Closed Z-MU-Z closed 4 months ago
Thanks for your questions. I've made a patch section in README.md on reproducing the paper results, please make sure to read them before moving onto my answers.
scripts/config/main/eval_only.yaml
comes with its parent scripts/config/main/default.yaml
. This is a cascading configuration, where the compiler first looks for configurations in the parent (default.yaml
) and then the child eval_only.yaml
. For example, the task_split
configuration is specified in default.yaml
. The sample mode should be sequential for evaluating on the test set.rollout_size
* eval_iterations
= 96. Thanks for your helpful answers. I have another question to confirm. I found that there is no APP store in the screenshots of my simulator. Is this normal? In this case, it seems that the task Install an app cannot be completed.
Thanks. Yea currently install is not supported in our environment (also due to the long time it takes to install an app). Luckily "app installation" only accounts for a very small portion (<5%) in the subsets studied in our paper (General and Web Shopping), so it would only be a big issue if you want to try Install subset of AitW.
Thanks again. Now all my confusions are cleared. I will close this issue now.
Hi,
Thank you for your interesting work. I have downloaded the weights from Huggingface (general-off2on-digirl) and am trying to reproduce the results in Table 1. However, I have encountered a few issues:
Config Modifications: The
scripts/config/main/eval_only.yaml
seems to need modifications. For example, I had to add: task_split: "test" There might be other necessary changes, but I am not sure what they are. For instance, thesample_mode
is currently set torandom
by default.Code Modifications: The code might also need modifications. The paper mentions, "We use all 545 tasks in the training set for training and the first 96 tasks in the test set for testing," but I haven't found any implementation for testing on the first 96 tasks in the code.
Overlap Between Test and Training Sets: There seems to be a significant overlap between the test and training sets. I checked digirl/environment/android/assets/task_set/general_test.txt and digirl/environment/android/assets/task_set/general_train.txt and found many tasks present in both sets. This appears to be a form of data leakage. Is this normal? For example:
Could you provide guidance on how to correctly set up the configuration and code to reproduce the results in Table 1? Thank you!