Discrepancy in AutoGUI Evaluation Results

DigiRL-agent / digirl

Official repo for paper DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning.

Apache License 2.0

258 stars 21 forks source link

Discrepancy in AutoGUI Evaluation Results #21

Closed GoooKuuu closed 3 weeks ago

GoooKuuu commented 1 month ago

Hi authors, I have successfully run the evaluations for DigiRL and AutoGUI, and the results align with Table 1 from your paper. However, I have a question regarding AutoGUI. In autogui's original paper, they report a success rate of approximately 65% on the 'General' category, which is much higher than the result I observed using your code (around 17% on 'General'). Could this be due to differences in evaluation standards, or am I missing something? Thank you again for your excellent work!

YifeiZhou02 commented 1 month ago

Thanks for your interest in our work. It is due to differences in evaluation metrics. Note that most prior papers that evaluate on AitW uses single-step match rate (e.g. whether this single-step action matches what is in the offline dataset) so that they can avoid the need of interactions with an Android emulator while our work uses task success rate while interacting with the Android emulator.

BiEchi commented 3 weeks ago

Closing due to inactivity.