Can not to get the current result by AutoUI-DigiRL

mrFranklin commented 4 days ago

Hi. I have test one simple task for some time. in order to evaluate the AutoUI-DigiRL model. but the result is alway wrong. Is it normal or is there a step that is incorrect?

The main steps: (Some other less important steps have been omitted.)

modify the tast_set/general_test.txt, only leave one simple task: "Open the files app"
modify default.yaml, set bsize=1; rollout_size=1; in order to run the one task abolve.

download AutoUI base model and place it it the the path specified by policy_lm in default.yaml; download general-off2on-digirl.zip and unzip it to the path specified by save_path in eval_only.yaml. the logs below shows it is have loaded.

>>> Using DigiRL trainer
>>> Loading from previous checkpoint
[2024-11-14 15:42:03,792][accelerate.accelerator][INFO] - Loading states from /home/***/digirl/logs/trainer.pt
[2024-11-14 15:42:04,933][accelerate.checkpointing][INFO] - All model weights loaded successfully
[2024-11-14 15:42:04,933][accelerate.checkpointing][INFO] - All optimizer states loaded successfully
[2024-11-14 15:42:04,933][accelerate.checkpointing][INFO] - All scheduler states loaded successfully
[2024-11-14 15:42:04,933][accelerate.checkpointing][INFO] - All dataloader sampler states loaded successfully
[2024-11-14 15:42:04,937][accelerate.checkpointing][INFO] - All random states loaded successfully
[2024-11-14 15:42:04,937][accelerate.accelerator][INFO] - Loading in 0 custom states

modify the call_gemini function to always return 0, because I want to check the result manually and no score is required during this process.
run the eval script. check the screenshots.

The model is not perform the correct actions. the "files app" is not opened. I also test the task: "Set an alarm for 6pm". Nor have the correct results.

BiEchi commented 3 days ago

Thanks for your interest in our work! The steps you described should be correct. After DigiRL, we also did not find the agent being able to complete these tasks. There's nothing you did wrong. This is probably because the pretrained AutoUI agent was not able to correctly explore this task. You're more than encouraged to think of approaches to improve the agent.

mrFranklin commented 9 hours ago

Thanks for your interest in our work! The steps you described should be correct. After DigiRL, we also did not find the agent being able to complete these tasks. There's nothing you did wrong. This is probably because the pretrained AutoUI agent was not able to correctly explore this task. You're more than encouraged to think of approaches to improve the agent.

Thank you for replying. It your report, “AitW General Subset Success Rate” using "SFT+DigiRL + AutoUI + Offline" is 61.5%; using "SFT+DigiRL + AutoUI + Offline/Online" is 71.9%. Could I reproduce the success rate using my steps and model above? Because I have run some tasks, none of them are successful. I strongly suspect there might be issues with my steps or the model I'm using.

DigiRL-agent / digirl

Can not to get the current result by AutoUI-DigiRL #28