[Feature] Support vision inputs for LLM with vision capabilities

xingyaoww commented 2 months ago

What problem or use case are you trying to solve?

Context: https://opendevin.slack.com/archives/C06P5NCGSFP/p1719073107473339

It will be very helpful for the agent to actually "see," especially if you ask the agent to develop a web page / frontend UI / game.

Describe the UX of the solution you'd like

[ ] Backend: We should use litellm to enable vision model support (https://litellm.vercel.app/docs/completion/vision#checking-if-a-model-supports-vision) in the OpenAI API format where image can be represented as base64 to pass into chat completion call. The chat Event (text-only now) needs to be modified to support passing images as "base64" to the backend: maybe each chat event can be an interleaved list of [text, image_in_base64, text, ...].
[ ] Frontend: the user can upload images to the chat by (1) pasting, (2) clicking to upload, and/or (3) referring to files inside the workspace (e.g., @/workspace/screenshot.png -- this maybe too complicated and can leave to future). And once the image is added, we should show the thumbnail of the added images. The chat Event need to be modified to support passing images as "base64" to the backend.

Do you have thoughts on the technical implementation?

LiteLLM already have vision model supports: https://litellm.vercel.app/docs/completion/vision#checking-if-a-model-supports-vision

We should throw out an error if user choose to use a model without vision support, yet uploaded an image.

Describe alternatives you've considered

Additional context

rezzie-rich commented 2 months ago

Having a multi llm support for task specific agents will be highly beneficial , as mentioned in #486

Phi3-vision could be used to see while deepseek-coder or llama3 generates the codes.

PierrunoYT commented 2 months ago

Yeah this would be sick

rezzie-rich commented 2 months ago

@xingyaoww @kaushikdkrikhanu, it would be great if besides uploading images, users could also give a URL for agents to extract the UI elements.

example use case would be passing in a blog site(i.e., forbe) and asking it to create a similar blog site or even make changes to that like prompting to make a website like Forbe but in 'this' & 'that' color theme, etc.

Playwright can be used to browse and extract ui elements from a given website.

this request might be outside of the scope of this issue, but still decided to include it as you guys are laying the foundation.

Also, since now there's a merge-able PR #2756 for configuring different llm per agent, there should be an option to configure a different llm for vision related tasks. Of course, there will be a default setup in case the same llm like gpt-4o is used everywhere.

github-actions[bot] commented 1 month ago

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

xingyaoww commented 1 month ago

Resolved by #2848.

All-Hands-AI / OpenHands

[Feature] Support vision inputs for LLM with vision capabilities #2590