OS-Copilot / OS-Atlas

OS-ATLAS: A Foundation Action Model For Generalist GUI Agents
Apache License 2.0
175 stars 5 forks source link

Conversational Model #6

Open aswad546 opened 2 weeks ago

aswad546 commented 2 weeks ago

Hello,

Thank you for sharing OS-ATLAS. It seems the system would work well for a conversational setting, I assume for that you would want to feed in the chat history with each prompt (how much history can the model work with). Which is why I was wondering about the token limit for the 7B parameter model.

The preprocessing that you show as an example for the 4B model is due to the underlying InternVL2 model and is not required for the 7B model based on QwenV2L. (Please correct me about this if I am wrong).

Also I am planning to use this for the desktop based websites on Linux is the Unified Action space prompt recommended for this use case or not?

The prompt used in the GitHub readme is somewhat simplistic, is that to imply it would work well with the 7B model. Or do you recommend using the Unified Action Space Prompt from the paper.

Thank you!

aswad546 commented 1 week ago

Also I have noticed the model struggles to generate bounding boxes for multiple elements at a time. Is this something you trained for or is this a limitation of the underlying VLM?

jasonlee-sf commented 1 week ago

Also I have noticed the model struggles to generate bounding boxes for multiple elements at a time. Is this something you trained for or is this a limitation of the underlying VLM?

+1. Also seeing this. Online inference is fine but batched inference seems broken.

CarlHuangNuc commented 1 week ago

@jasonlee-sf what is meaning "Online inference is fine but batched inference seems broken." ? Could you have a detail description?

jasonlee-sf commented 1 week ago

Created a new issue with more detail here https://github.com/OS-Copilot/OS-Atlas/issues/17

jasonlee-sf commented 1 week ago

@CarlHuangNuc ^^