cooelf / Auto-GUI

Official implementation for "You Only Look at Screens: Multimodal Chain-of-Action Agents" (Findings of ACL 2024)
https://arxiv.org/abs/2309.11436
Apache License 2.0
174 stars 15 forks source link

Hello Great work Guys! #3

Closed AsadMir10 closed 11 months ago

AsadMir10 commented 11 months ago

I just wanted to use your model in the Hugging Face model library but I don't see any model usage definitions, will you be defining any usage instructions or model card any time soon?

cooelf commented 11 months ago

Hi, this is a good point. This work has arbitrary modifications to the model architecture. I am not sure it can work smoothly with the standard Hugging Face model library. I may try it hopefully in the next two weeks.

AsadMir10 commented 11 months ago

Great, will be waiting for it to be on air soon. Anyhow I was also wondering about the role of Blip2 in the stack and from what I understood you are just using Blip2 for feature extraction and than creating a embedded dataset and meanwhile the real meat inside the architecture is using T5 multimodal, correct me if I'm wrong, and thanks for the update!

cooelf commented 11 months ago

Yes. We use BLIP2 just as the feature extractor. A projection layer is used to adapt the vision features to the T5 encoder-decoder architecture.

AsadMir10 commented 11 months ago

Great, Understood.