Closed xingyaoww closed 1 month ago
Having a multi llm support for task specific agents will be highly beneficial , as mentioned in #486
Phi3-vision could be used to see while deepseek-coder or llama3 generates the codes.
Yeah this would be sick
@xingyaoww @kaushikdkrikhanu, it would be great if besides uploading images, users could also give a URL for agents to extract the UI elements.
example use case would be passing in a blog site(i.e., forbe) and asking it to create a similar blog site or even make changes to that like prompting to make a website like Forbe but in 'this' & 'that' color theme, etc.
Playwright can be used to browse and extract ui elements from a given website.
this request might be outside of the scope of this issue, but still decided to include it as you guys are laying the foundation.
Also, since now there's a merge-able PR #2756 for configuring different llm per agent, there should be an option to configure a different llm for vision related tasks. Of course, there will be a default setup in case the same llm like gpt-4o is used everywhere.
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.
Resolved by #2848.
What problem or use case are you trying to solve?
Context: https://opendevin.slack.com/archives/C06P5NCGSFP/p1719073107473339
It will be very helpful for the agent to actually "see," especially if you ask the agent to develop a web page / frontend UI / game.
Describe the UX of the solution you'd like
[ ] Backend: We should use litellm to enable vision model support (https://litellm.vercel.app/docs/completion/vision#checking-if-a-model-supports-vision) in the OpenAI API format where image can be represented as
base64
to pass into chat completion call. The chatEvent
(text-only now) needs to be modified to support passing images as "base64" to the backend: maybe each chat event can be an interleaved list of[text, image_in_base64, text, ...]
.[ ] Frontend: the user can upload images to the chat by (1) pasting, (2) clicking to upload, and/or (3) referring to files inside the workspace (e.g., @/workspace/screenshot.png -- this maybe too complicated and can leave to future). And once the image is added, we should show the thumbnail of the added images. The chat
Event
need to be modified to support passing images as "base64" to the backend.Do you have thoughts on the technical implementation?
LiteLLM already have vision model supports: https://litellm.vercel.app/docs/completion/vision#checking-if-a-model-supports-vision
We should throw out an error if user choose to use a model without vision support, yet uploaded an image.
Describe alternatives you've considered
Additional context