How to reproduce the results in Table 1?

Z-MU-Z commented 4 months ago

Hi,

Thank you for your interesting work. I have downloaded the weights from Huggingface (general-off2on-digirl) and am trying to reproduce the results in Table 1. However, I have encountered a few issues:

Config Modifications: The scripts/config/main/eval_only.yaml seems to need modifications. For example, I had to add: task_split: "test" There might be other necessary changes, but I am not sure what they are. For instance, the sample_mode is currently set to random by default.
Code Modifications: The code might also need modifications. The paper mentions, "We use all 545 tasks in the training set for training and the first 96 tasks in the test set for testing," but I haven't found any implementation for testing on the first 96 tasks in the code.
Overlap Between Test and Training Sets: There seems to be a significant overlap between the test and training sets. I checked digirl/environment/android/assets/task_set/general_test.txt and digirl/environment/android/assets/task_set/general_train.txt and found many tasks present in both sets. This appears to be a form of data leakage. Is this normal? For example:
```
Search for hotels in Washington DC
What's the news in India?
```

Could you provide guidance on how to correctly set up the configuration and code to reproduce the results in Table 1? Thank you!

BiEchi commented 4 months ago

Thanks for your questions. I've made a patch section in README.md on reproducing the paper results, please make sure to read them before moving onto my answers.

scripts/config/main/eval_only.yaml comes with its parent scripts/config/main/default.yaml. This is a cascading configuration, where the compiler first looks for configurations in the parent (default.yaml) and then the child eval_only.yaml. For example, the task_split configuration is specified in default.yaml. The sample mode should be sequential for evaluating on the test set.
As mentioned in the patch, make sure rollout_size * eval_iterations = 96.
Yes, this is expected. This is a known issue with the task set we use (Android-in-the-Wild). This is also why some scores in Table 1 of our paper has higher test scores than train scores (which is a bit counter-intuitive). You can always change it to anything else you like after validating your workflow - all you need to do is giving instructions in text.

Z-MU-Z commented 4 months ago

Thanks for your helpful answers. I have another question to confirm. I found that there is no APP store in the screenshots of my simulator. Is this normal? In this case, it seems that the task Install an app cannot be completed.

YifeiZhou02 commented 4 months ago

Thanks. Yea currently install is not supported in our environment (also due to the long time it takes to install an app). Luckily "app installation" only accounts for a very small portion (<5%) in the subsets studied in our paper (General and Web Shopping), so it would only be a big issue if you want to try Install subset of AitW.

Z-MU-Z commented 4 months ago

Thanks again. Now all my confusions are cleared. I will close this issue now.

DigiRL-agent / digirl

How to reproduce the results in Table 1? #6