OpenAI incompatible image handling in server multimodal

gelim commented 10 months ago

Hello while testing Llava-13B with server implementation I got a 500 error related to the content being a list of dicts and not a simple string.

what works (but unrelated to Llava):

$ curl -H "Content-Type: application/json" -X POST -s $SERVER/v1/chat/completions -d '{"messages": [{"role": "user", "content": "hello"}]}'

{"choices":[{"finish_reason":"stop","index":0,"message":{"content":"Hi there! How can I help you today?","role":"assistant"}}],[...]

what yields a 500 with [json.exception.type_error.302] type must be string, but is array:

$ curl -H "Content-Type: application/json" -X POST -s $SERVER/v1/chat/completions -d '{"messages": [{"role": "user", "content": [{"type":"text","text":"hello"}]}]}'

this is to demonstrate the issue when using an OpenAI REST aware frontend that is pushing text with pic inside the content key like this:

{
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "describe the picture"
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "data:image/webp;base64,AAAAAA==",
            "detail": "auto"
          }
        }
      ]
    }
  ],
  "model": "llava-13b",
  "frequency_penalty": 0,
  "max_tokens": 4000,
  "presence_penalty": 0,
  "temperature": 0.1,
  "top_p": 1,
  "user": "foobar"
}

gelim commented 10 months ago

If I understand correctly, that is more of a feature that is not implemented within server.cpp than a bug in itself.

Here is the OpenAI API documentation for reference: https://platform.openai.com/docs/api-reference/chat/create

gelim commented 10 months ago

Ok after digging a bit, I see the code in examples/server/server.cpp and examples/server/public/index.html that is definitely not OpenAI REST API compatible.

Format info from README.md

gelim commented 10 months ago

I monkey patched api_like_OAI.py This is highly untested and does not handle several pictures being sent during the chat session.

Main idea is to catch the messages 'content' typed as list, extract the 'image_url' b64 data, convert it to jpeg (forcing that as my frontend sends webp), create the root key 'inage_data' with data + id. Update user message in prompt with the ref to img id.

To be done: add multi images support.

kevkid commented 9 months ago

I am experiencing the same thing. I tried using this code and could not get it to work:


import base64
import requests

CONTEXT = "You are LLaVA, a large language and vision assistant trained by UW Madison WAIV Lab. You are able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language. Follow the instructions carefully and explain your answers in detail.### Human: Hi!### Assistant: Hi there! How can I help you today?\n"

with open(image.jpg', 'rb') as f:
    img_str = base64.b64encode(f.read()).decode('utf-8')
    data = { 
        "messages": [
            {
                "role": "user",
                "image_url": f"data:image/jpeg;base64,{img_str}"
            },
            {
                "role": "user",
                "content": "what is in this image?"
            }
        ]
    }
response = requests.post('http://<addr>:<port>/v1/chat/completions', json
```=data)

gelim commented 9 months ago

Yes you need to do the json adaptation yourself. I can put my crappy code later for people to improve it.

kevkid commented 9 months ago

Yes you need to do the json adaptation yourself. I can put my crappy code later for people to improve it.

Would you be kind enough to drop your code in a gist or give an example? Thank you.

gelim commented 9 months ago

diff --git a/examples/server/api_like_OAI.py b/examples/server/api_like_OAI.py
index 607fe49..6638081 100755
--- a/examples/server/api_like_OAI.py
+++ b/examples/server/api_like_OAI.py
@@ -39,20 +39,51 @@ def convert_chat(messages):
     user_n = args.user_name
     ai_n = args.ai_name
     stop = args.stop
-
+    multimodal = str()
     prompt = "" + args.chat_prompt + stop

     for line in messages:
         if (line["role"] == "system"):
             prompt += f"{system_n}{line['content']}{stop}"
         if (line["role"] == "user"):
-            prompt += f"{user_n}{line['content']}{stop}"
+            # multimodal heuristic
+            if isinstance(line['content'], list):
+                for cont in line['content']:
+                    multimodal="[img-10]"
+                    if cont['type'] == 'text':
+                        prompt += f"{user_n}{multimodal}{cont['text']}{stop}"
+            else: prompt += f"{user_n}{multimodal}{line['content']}{stop}"
         if (line["role"] == "assistant"):
             prompt += f"{ai_n}{line['content']}{stop}"
     prompt += ai_n.rstrip()

     return prompt

+# from any image format in base64 to JPEG in base64
+# using Pillow lib
+def multimodal_convert_pic(image_b64):
+    from base64 import b64decode,b64encode
+    from io import BytesIO
+    from PIL import Image
+
+    webp_bytes = b64decode(image_b64)
+    im = Image.open(BytesIO(webp_bytes))
+    if im.mode != 'RGB': im = im.convert('RGB')
+    jpg_data = BytesIO()
+    im.save(jpg_data, 'JPEG')
+    jpg_data.seek(0)
+    return b64encode(jpg_data.read()).decode()
+
+def multimodal_extract_image(body):
+    for line in body['messages']:
+        if not line['role'] == 'user': continue
+        for cont in line['content']:
+            if cont['type'] == 'image_url':
+                url = cont['image_url']['url']
+                start = url.find(',') + 1
+                return multimodal_convert_pic(url[start:])
+    return False
+
 def make_postData(body, chat=False, stream=False):
     postData = {}
     if (chat):
@@ -81,6 +112,9 @@ def make_postData(body, chat=False, stream=False):
     postData["stream"] = stream
     postData["cache_prompt"] = True
     postData["slot_id"] = slot_id
+    # multimodal detection
+    pic_data = multimodal_extract_image(body)
+    if pic_data: postData["image_data"] = [{"data": pic_data, "id": 10}]
     return postData

 def make_resData(data, chat=False, promptToken=[]):

Launching the proxy with: ./api_like_OAI.py --llama-api http://llamacpp_listening_ip:llamacpp_port --host proxy_listening_ip --port proxy_port

Forwarded message to [llamacpp_listening_ip:llamacpp_port] will look like this:

POST /completion HTTP/1.1
Host: 172.17.1.1:8480
User-Agent: python-requests/2.31.0
Accept-Encoding: gzip, deflate
Accept: */*
Connection: keep-alive
Content-Length: 5381

{"prompt": "A chat between a curious user and an artificial intelligence assistant. The assistant follows the given rules no matter what.</s>USER: [img-10]describe this picture</s>ASSISTANT:", "temperature": 1, "top_p": 1, "n_predict": 4000, "presence_penalty": 0, "frequency_penalty": 0, "stop": ["</s>"], "n_keep": -1, "stream": true, "cache_prompt": true, "slot_id": -1, "image_data": [{"data": "/9j/4AA[***STRIPPED BASE64 JPEG****]RQB//2Q==", "id": 10}]}

and you will point your OpenAI protocol speaking frontend to baseUrl = http://proxy_listening_ip:proxy_port/v1

gelim commented 9 months ago

this is now getting more interesting with Llava 1.6 being released and results much usable than 1.5 on their demo... Waiting for llama.cpp to update (#5267) as now loading the GGUFs result to same quality as with 1.5.

github-actions[bot] commented 7 months ago

This issue is stale because it has been open for 30 days with no activity.

github-actions[bot] commented 7 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale.

ggerganov / llama.cpp

OpenAI incompatible image handling in server multimodal #4771