Support image inputs in chat completion API (like OpenAI)

Boscop commented 1 month ago

Thanks for making Tabby, it's great :)

I want to build a local assistant on top of Tabby's HTTP API, and this assistant should support image inputs in chat. Like with the OpenAI API: https://platform.openai.com/docs/guides/vision/quickstart

When I tried this example on Tabby's local HTTP API:

$ curl "http://localhost:8080/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer <token>" \
  -d '{
    "model": "gpt-4o-mini",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "text",
            "text": "What is in this image?"
          },
          {
            "type": "image_url",
            "image_url": {
              "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
            }
          }
        ]
      }
    ],
    "max_tokens": 300
  }'

I got this error:

Failed to deserialize the JSON body into the target type: messages[0]: data did not match any variant of untagged enum ChatCompletionRequestMessage at line 19 column 5

So it seems that Tabby doesn't support image inputs in chat completion requests.

Would it be possible to add support for image inputs in the chat completions API? 🙂 Either as URL or base64 encoded image. (Like in ollama https://github.com/ollama/ollama/issues/3690)

Please reply with a 👍 if you want this feature.

sqwishy commented 1 month ago

Does it work if you include the "detail" field along with the image URL?

Try changing

"image_url": {
  "url": "..."
}

to

"image_url": {
  "url": "...",
  "detail": "auto"
}

Edit: I think the response you posted got about the deserialization error is from the detail field missing. When I adjusted the request you posted to include that field, the response was successful; but I think it just said that it can't view images. My guess is you need a different kind of model but I don't know anything about this stuff. So probably not what you were looking for sorry ._.

wsxiaoys commented 1 month ago

Before we delve into the implementation details, could you share the use case for utilizing image input? What value do you gain by using Tabby, instead of directly interacting with a chat completion endpoint that includes an image API (e.g., the GPT-4 series)?

Boscop commented 1 month ago

@wsxiaoys Sure :) My use case is: I want to be use Tabby via the API to generate code not just based on instructions but also attached images such as screenshots of technical spec documents (e.g. hardware devices or screenshots of images from PDFs), (and also documents via embedding/RAG by extracting text from PDFs, technical documentation docs related to the project etc.).

E.g. a lot of PDFs about hardware devices have tables with binary layouts etc. and I often take a screenshot and tell Claude to write code based on it, which works, but then it's always out of the context of the codebase, I want to do it in context locally with Tabby.

Basically I want to extend Tabby's functionality with this, by using it via the API. Maybe I'll even end up writing my own coding assistant, and I want to use Tabby as the backend via the API.

TabbyML / tabby

Support image inputs in chat completion API (like OpenAI) #3169