Closed GoooKuuu closed 3 weeks ago
Thanks for your interest in our work. It is due to differences in evaluation metrics. Note that most prior papers that evaluate on AitW uses single-step match rate (e.g. whether this single-step action matches what is in the offline dataset) so that they can avoid the need of interactions with an Android emulator while our work uses task success rate while interacting with the Android emulator.
Closing due to inactivity.
Hi authors, I have successfully run the evaluations for DigiRL and AutoGUI, and the results align with Table 1 from your paper. However, I have a question regarding AutoGUI. In autogui's original paper, they report a success rate of approximately 65% on the 'General' category, which is much higher than the result I observed using your code (around 17% on 'General'). Could this be due to differences in evaluation standards, or am I missing something? Thank you again for your excellent work!