Open aswad546 opened 2 weeks ago
Also I have noticed the model struggles to generate bounding boxes for multiple elements at a time. Is this something you trained for or is this a limitation of the underlying VLM?
Also I have noticed the model struggles to generate bounding boxes for multiple elements at a time. Is this something you trained for or is this a limitation of the underlying VLM?
+1. Also seeing this. Online inference is fine but batched inference seems broken.
@jasonlee-sf what is meaning "Online inference is fine but batched inference seems broken." ? Could you have a detail description?
Created a new issue with more detail here https://github.com/OS-Copilot/OS-Atlas/issues/17
@CarlHuangNuc ^^
Hello,
Thank you for sharing OS-ATLAS. It seems the system would work well for a conversational setting, I assume for that you would want to feed in the chat history with each prompt (how much history can the model work with). Which is why I was wondering about the token limit for the 7B parameter model.
The preprocessing that you show as an example for the 4B model is due to the underlying InternVL2 model and is not required for the 7B model based on QwenV2L. (Please correct me about this if I am wrong).
Also I am planning to use this for the desktop based websites on Linux is the Unified Action space prompt recommended for this use case or not?
The prompt used in the GitHub readme is somewhat simplistic, is that to imply it would work well with the 7B model. Or do you recommend using the Unified Action Space Prompt from the paper.
Thank you!