Closed SiyuanMaCS closed 1 month ago
Thank you for your reply. I found the methods and findings in this paper very interesting.
I still think there are fundamental differences between autoui and other VLMs as multimodal agents. T5 is not chatbot LLM such as Llama and GPT. Also, autoui are only train to predict action plan. So autoui should not have general visual instruction following ability.
VLMs like llava and GPT4V, rely their reasoning and planning ability to action as agent(eg. COT is crutial to this). Whereas, autoui predict action plan directly from its training data. Therefore, autoui works in a completely different way as these VLMs.
Would similar RL methods work on these stronger VLMs?
The paper is a good work, and I think this quesrion interesting to ask.
On Wed, Jul 17, 2024, 23:16 Jack BAI @.***> wrote:
Thanks for your question. AutoUI https://huggingface.co/cooelf/Auto-UI is a VLM, construes of CLIP + T5, not a traditional RL policy model. AutoUI paper here https://arxiv.org/abs/2309.11436. It's only 1.5B so provides fairly good accessibility. We're also actively exploring performance of the DigiRL algorithm on larger models.
— Reply to this email directly, view it on GitHub https://github.com/DigiRL-agent/digirl/issues/8#issuecomment-2233573224, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVAYCHRSL2GRJRJPS5IIULTZM2DFLAVCNFSM6AAAAABLAPFZZCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMZTGU3TGMRSGQ . You are receiving this because you authored the thread.Message ID: @.***>
Thanks for your insightful question. We do observe different behaviors with larger VLMs like LLaVA, you can check out another paper from us. Basically, RL with CoT involved is more challenging because you need to balance the thought tokens and action tokens. Beware though at the behavioral level the model size (even the architecture) doesn't necessarily matter that much - the critic function approaximation capability and initial actor performance might be more of concern.
Closed due to inactivity.
Hello, I am currently conducting similar work using larger vlm in web/app agent work and would like to have your method as baseline approach. But I also curious about whether you will release the larger VLM results like cogagent or cogvlm. Such general VLMs will be more promissing and useful than Auto-UI either in academical research or real-world applications.
Hi @zhiyuanhubj, thanks for reaching out. We're currently doing some algorithmic improvement including scaling things up. This will take some more time, but we'll be happy if you follow our line of work!
Closing due to inactivity.
Why do you only train Auto-UI? Auto-UI seems to me a traditional RL policy model, not a LLM/VLM agent.