askforalfred / alfred

ALFRED - A Benchmark for Interpreting Grounded Instructions for Everyday Tasks
MIT License
360 stars 77 forks source link

Questions about the leaderboard submission format #87

Closed soyeonm closed 3 years ago

soyeonm commented 3 years ago

Hello,

I have a few questions about the submission format to the leaderboard.

  1. As in the screenshot below, the output of the "leaderboard.py" only has "task_id" as the key. I realized that there are multiple episodes with the same "task_id" (e.g. episodes with repeat index 0,1,2 and everything else being the same). In the case that I do not have time to produce this output for all episodes of "tests_unseen"/ "tests_seen", will I get a different result in the leaderboard server if I just choose one episode within the same "task_id" (e.g. just choose to run on all episodes with repeat index 0), compared to running on all of "tests_unseen"/"tests_seen"?

Screen Shot 2021-05-28 at 12 43 15 AM

  1. If I have actions such as "LookUp_30" in my submission to the leaderboard, will this action be executed as intended (look up 30 degrees), or will the agent only look up 15 degrees, because your leaderboard server is based on this repository?

(Your repository seems to interpret "LookUp_x", "LookDown_x" as LookUp/Down by "AGENT_HORIZON_ADJ" defined in gen/constants.py)

  1. The leaderboard rule states to not use any metadata from Ai2THOR, but even your repository's "va_interact" uses ground truth segmentation mask from Ai2THOR when deciding which ObjectID to pick up, etc (as you have explained here: https://github.com/askforalfred/alfred/issues/70 ).

Are our models for leaderboard submissions allowed to use this "va_interact" function (using ground truth segmentation mask from AI2THOR only when deciding which ObjectID to interact)?

Thank you all the time!

MohitShridhar commented 3 years ago

@soyeonm

  1. Right, there are 3 repeats because each task has 3 language annotations. It's the same task that the agent needs to solve, but the instructions for achieving those tasks were written by 3 separate annotators. For the leaderboard, if you miss any of them, they will be counted as failures. So I would recommend running on all of them.

  2. No sorry, for the purpose a standardized and fair evaluation, we require all submissions to use the same action space. The leaderboard server won't recognize LookUp_30 and instead you have to call LookUp_15 twice. Keeping the action space consistent is necessary for a fair comparison of path-length weighted scores across models.

  3. va_interact is part of the ALFRED API for interacting with objects, and you can surely use it for all the leaderboard submissions. The rule is referring to other metadata info from the simulator like ground-truth object names, locations, poses, and other properties which won't be available to an agent in a realistic embodied setting.

Hope this helps! Good luck!