All-Hands-AI / OpenHands

🙌 OpenHands: Code Less, Make More
https://all-hands.dev
MIT License
31.38k stars 3.62k forks source link

[Feature] Support vision inputs for LLM with vision capabilities #2590

Closed xingyaoww closed 1 month ago

xingyaoww commented 2 months ago

What problem or use case are you trying to solve?

Context: https://opendevin.slack.com/archives/C06P5NCGSFP/p1719073107473339

It will be very helpful for the agent to actually "see," especially if you ask the agent to develop a web page / frontend UI / game.

Describe the UX of the solution you'd like

Do you have thoughts on the technical implementation?

LiteLLM already have vision model supports: https://litellm.vercel.app/docs/completion/vision#checking-if-a-model-supports-vision

We should throw out an error if user choose to use a model without vision support, yet uploaded an image.

Describe alternatives you've considered

Additional context

rezzie-rich commented 2 months ago

Having a multi llm support for task specific agents will be highly beneficial , as mentioned in #486

Phi3-vision could be used to see while deepseek-coder or llama3 generates the codes.

PierrunoYT commented 2 months ago

Yeah this would be sick

rezzie-rich commented 2 months ago

@xingyaoww @kaushikdkrikhanu, it would be great if besides uploading images, users could also give a URL for agents to extract the UI elements.

example use case would be passing in a blog site(i.e., forbe) and asking it to create a similar blog site or even make changes to that like prompting to make a website like Forbe but in 'this' & 'that' color theme, etc.

Playwright can be used to browse and extract ui elements from a given website.

this request might be outside of the scope of this issue, but still decided to include it as you guys are laying the foundation.

Also, since now there's a merge-able PR #2756 for configuring different llm per agent, there should be an option to configure a different llm for vision related tasks. Of course, there will be a default setup in case the same llm like gpt-4o is used everywhere.

github-actions[bot] commented 1 month ago

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

xingyaoww commented 1 month ago

Resolved by #2848.