DigiRL-agent / digirl

Official repo for paper DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning.
Apache License 2.0
197 stars 16 forks source link

Why use Auto-UI instead of larger VLM? #8

Closed SiyuanMaCS closed 1 month ago

SiyuanMaCS commented 1 month ago

Why do you only train Auto-UI? Auto-UI seems to me a traditional RL policy model, not a LLM/VLM agent.

BiEchi commented 1 month ago

Thanks for your question. AutoUI is a VLM, construes of CLIP + T5, not a traditional RL policy model. AutoUI paper here. It's only 1.5B so provides fairly good accessibility. We're also actively exploring performance of the DigiRL algorithm on larger models.

SiyuanMaCS commented 1 month ago

Thank you for your reply. I found the methods and findings in this paper very interesting.

I still think there are fundamental differences between autoui and other VLMs as multimodal agents. T5 is not chatbot LLM such as Llama and GPT. Also, autoui are only train to predict action plan. So autoui should not have general visual instruction following ability.

VLMs like llava and GPT4V, rely their reasoning and planning ability to action as agent(eg. COT is crutial to this). Whereas, autoui predict action plan directly from its training data. Therefore, autoui works in a completely different way as these VLMs.

Would similar RL methods work on these stronger VLMs?

The paper is a good work, and I think this quesrion interesting to ask.

On Wed, Jul 17, 2024, 23:16 Jack BAI @.***> wrote:

Thanks for your question. AutoUI https://huggingface.co/cooelf/Auto-UI is a VLM, construes of CLIP + T5, not a traditional RL policy model. AutoUI paper here https://arxiv.org/abs/2309.11436. It's only 1.5B so provides fairly good accessibility. We're also actively exploring performance of the DigiRL algorithm on larger models.

— Reply to this email directly, view it on GitHub https://github.com/DigiRL-agent/digirl/issues/8#issuecomment-2233573224, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVAYCHRSL2GRJRJPS5IIULTZM2DFLAVCNFSM6AAAAABLAPFZZCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMZTGU3TGMRSGQ . You are receiving this because you authored the thread.Message ID: @.***>

BiEchi commented 1 month ago

Thanks for your insightful question. We do observe different behaviors with larger VLMs like LLaVA, you can check out another paper from us. Basically, RL with CoT involved is more challenging because you need to balance the thought tokens and action tokens. Beware though at the behavioral level the model size (even the architecture) doesn't necessarily matter that much - the critic function approaximation capability and initial actor performance might be more of concern.

BiEchi commented 1 month ago

Closed due to inactivity.

zhiyuanhubj commented 1 month ago

Hello, I am currently conducting similar work using larger vlm in web/app agent work and would like to have your method as baseline approach. But I also curious about whether you will release the larger VLM results like cogagent or cogvlm. Such general VLMs will be more promissing and useful than Auto-UI either in academical research or real-world applications.

BiEchi commented 1 month ago

Hi @zhiyuanhubj, thanks for reaching out. We're currently doing some algorithmic improvement including scaling things up. This will take some more time, but we'll be happy if you follow our line of work!

BiEchi commented 1 month ago

Closing due to inactivity.