askforalfred / alfred

ALFRED - A Benchmark for Interpreting Grounded Instructions for Everyday Tasks
MIT License
352 stars 77 forks source link

Create a static dataset for evaluation #99

Closed smittal10 closed 2 years ago

smittal10 commented 2 years ago

Hi, is there a way to evaluate the seq2seq model checkpoint using a static test dataset that doesn't require online interactions with the simulator? Currently eval_seq2seq.py evaluates it interactively with the simulator. We want to run inference on a small device with limited compute resources and running the simulator there is not feasible.

MohitShridhar commented 2 years ago

Hi @smittal10, I can think of two ways:

  1. Compute the loss on the validation set, or a small portion of the validation set. But note that low validation loss might not correspond to good performance because an agent can take an action different to the expert and still complete the task.
  2. Run eval_seq2seq.py on a small portion of validation tasks that captures a good distribution of tasks in the dataset (clean, pick etc.)

Hope this helps!

MohitShridhar commented 2 years ago

Closing due to inactivity.