etkecc / baibot

🤖 A Matrix bot for using diffent capabilities (text-generation, text-to-speech, speech-to-text, image-generation, etc.) of AI / Large Language Models (OpenAI, Anthropic, etc.)
GNU Affero General Public License v3.0
45 stars 4 forks source link

Image to text support? #5

Open saket424 opened 2 months ago

saket424 commented 2 months ago

I see text to image as a supported feature. How about image to text. There are quite a few capable multimodal self-host models these days such as moondream2 and minicpm2.6 that are supported in ollama and similar.

Is that functionality implicitly supported!

saket424 commented 2 months ago

localai supports multimodal chat completions with gpt-4-vision-preview . can i try baibot with gpt-4-vision-preview instead of gpt-4 ?

      - id: localai
        provider: localai
        config:
          base_url: http://172.17.0.1:8080/v1
          api_key: null
          text_generation:
            model_id: gpt-4-vision-preview
            prompt: You are a brief, but helpful bot.
            temperature: 1.0
            max_response_tokens: 16384
            max_context_tokens: 128000
name: gpt-4-vision-preview

roles:
  user: "USER:"
  assistant: "ASSISTANT:"
  system: "SYSTEM:"

mmproj: llava-v1.6-7b-mmproj-f16.gguf
parameters:
  model: llava-v1.6-mistral-7b.Q5_K_M.gguf
  temperature: 0.2
  top_k: 40
  top_p: 0.95
  seed: -1

template:
  chat: |
    A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.
    {{.Input}}
    ASSISTANT:

download_files:
- filename: llava-v1.6-mistral-7b.Q5_K_M.gguf
  uri: huggingface://cjpais/llava-1.6-mistral-7b-gguf/llava-v1.6-mistral-7b.Q5_K_M.gguf
- filename: llava-v1.6-7b-mmproj-f16.gguf
  uri: huggingface://cjpais/llava-1.6-mistral-7b-gguf/mmproj-model-f16.gguf

usage: |
    curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
        "model": "gpt-4-vision-preview",
        "messages": [{"role": "user", "content": [{"type":"text", "text": "What is in the image?"}, {"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" }}], "temperature": 0.9}]}'
curl http://172.17.0.1:8080/v1/chat/completions -H "Content-Type: application/json" -d '{ "model": "gpt-4-vision-preview", "messages": [{"role": "user", "content": [{"type":"text", "text": "What is in the image?"}, {"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" }}], "temperature": 0.9}]}'
{"created":1726522282,"object":"chat.completion","id":"3a66a0dd-9899-49df-93c4-a2d36309642e","model":"gpt-4-vision-preview","choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"The image shows a wooden pathway leading through a field of tall grass. The pathway appears to be a simple, unpaved trail, possibly in a rural or natural setting. The sky is clear and blue, suggesting a sunny day. There are no visible landmarks or distinctive features in the background, which gives the impression of a peaceful, open landscape. \u003c/s\u003e"}}],"usage":{"prompt_tokens":1,"completion_tokens":76,"total_tokens":77}}
saket424 commented 2 months ago

gpt-4-vision-preview does not appear to be supported by baibot -- only gpt-4 for the moment

~/baibot/src/agent/provider/localai$ cat mod.rs 
// LocalAI is based on OpenAI (async-openai), because it seems to be fully compatible.
// Moreover, openai_api_rust does not support speech-to-text, so if we wish to use this feature
// we need to stick to async-openai.

use super::openai_compat::Config;

pub fn default_config() -> Config {
    let mut config = Config {
        base_url: "http://my-localai-self-hosted-service:8080/v1".to_owned(),

        ..Default::default()
    };

    if let Some(ref mut config) = config.text_generation.as_mut() {
        config.model_id = "gpt-4".to_owned();
        config.max_context_tokens = 128_000;
        config.max_response_tokens = 4096;
    }

    if let Some(ref mut config) = config.text_to_speech.as_mut() {
        config.model_id = "tts-1".to_owned();
    }

    if let Some(ref mut config) = config.speech_to_text.as_mut() {
        config.model_id = "whisper-1".to_owned();
    }

    if let Some(ref mut config) = config.image_generation.as_mut() {
        config.model_id = "stablediffusion".to_owned();
    }

    config
} 
spantaleev commented 2 months ago

This is a valid feature request.

baibot currently ignores all images sent by you. It doesn't support feeding them to a model yet.

spantaleev commented 2 months ago

To address your previous comment:

gpt-4-vision-preview does not appear to be supported by baibot -- only gpt-4 for the moment

You're pasting an excerpt from the code which defines the default configuration for models created on the localai provider. This configuration inherits from the "OpenAI compatible" provider and customizes the models to some sane defaults for the LocalAI provider.

The fact that gpt-4 is hardcoded in the default configuration does not mean you can't change it. When creating a new agent dynamically (e.g. !bai agent create-room-local localai my-new-localai-agent), you will be shown the default configuration (which specifies the gpt-4 model), but you can change it however you'd like. You can also define the agent statically (in your YAML configuration).

Perhaps specifying a gpt-4-vision-preview model would make LocalAI route your queries to a different agent.

Regardless, baibot cannot send images to the model, so what you're trying to do cannot be done yet.


For completeness, it should be noted that for the actual OpenAI API (recommended to be used via the openai provider), gpt-4-vision-preview is no longer a valid model.

If you try to use it, you get an error:

invalid_request_error: The model gpt-4-vision-preview has been deprecated, learn more here: https://platform.openai.com/docs/deprecations (code: model_not_found)

Here's the relevant part:

On June 6th, 2024, we notified developers using gpt-4-32k and gpt-4-vision-preview of their upcoming deprecations in one year and six months respectively. As of June 17, 2024, only existing users of these models will be able to continue using them.

Using gpt-4o is the new equivalent to using gpt-4-vision-preview.

saket424 commented 2 months ago

Thanks @spantaleev . In preparation for this new feature request for baibot. I will open an issue with localAI to let them know that gpt-4-vision-preview is deprecated and to instead name it gpt-4o in compliance with OpenAI API compatibility. This should get mapped to the llava-1.6-mistral model that the stock docker cuda12 localAI v2.20.1 image comes pre installed with.

References to gpt-4-vision-preview in https://github.com/mudler/LocalAI/blob/master/aio/gpu-8g/vision.yaml and

https://github.com/mudler/LocalAI/blob/master/aio/cpu/vision.yaml and

https://github.com/mudler/LocalAI/blob/master/aio/intel/vision.yaml

need to be changed to gpt-4o as you point out

saket424 commented 2 months ago

I opened this LocalAI issue https://github.com/mudler/LocalAI/issues/3596

saket424 commented 1 week ago

@spantaleev Any progress on this ? I would love for baibot to weigh in when an image and associated prompt is uploaded. This should be relatively straightforward to support as this is an extended multimodal use of the existing text chat completion api