Support OpenAI Vision API

tleyden commented 9 months ago

I'm running the backend with:

sh -c ./llava-v1.5-7b-q4.llamafile -ngl 9999

and I'm able to use the web ui to upload images and send chat queries. However when I try to use the JSON API from code, I am getting a 500 error:

"500 Internal Server Error\n[json.exception.type_error.302] type must be string, but is array"

Here is the JSON request I'm sending:

{
   "model":"LLaMA_CPP",
   "messages":[
      {
         "role":"user",
         "content":[
            {
               "type":"text",
               "text":"What is this an image of?"
            },
            {
               "type":"image_url",
               "image_url":{
                  "url":"data:image/jpeg;base64,iVBORwAA <snip ... long 2.2 MB base 64 image> ElFTkSuQmCC"
               }
            }
         ]
      }
   ],
   "max_tokens":4096
}

Which has a similar structure to the OpenAI GPT Vision example.

Is there an example of how to format the request when passing both text and image?

I couldn't find an examples in the docs of doing this particular type of inference via the JSON API. Any hints would be greatly appreciated!

tleyden commented 9 months ago

This looks like where it's expecting a string rather than an array in the content field:

https://github.com/Mozilla-Ocho/llamafile/blob/dfd333589abd55574ea2d2165aa18e3658045e80/llama.cpp/server/utils.h#L180

jart commented 9 months ago

There's no support for this yet but we can add it.

Mozilla-Ocho / llamafile

Support OpenAI Vision API #258