BoundaryML / baml

BAML is a language that helps you get structured data from LLMs, with the best DX possible. Works with all languages. Check out the promptfiddle.com playground
https://docs.boundaryml.com
Apache License 2.0
1.31k stars 49 forks source link

Open source models support #969

Closed NikiBase closed 1 month ago

NikiBase commented 1 month ago

Hello,

First of all thank you for bringing this amazing tool! I was wondering if there is any chance of integrating open-source LMM models like for example https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct which is not currently supported by ollama provider. Or if there is a way to self host instead of calling and external provider.

Thank you again!

hellovai commented 1 month ago

That's interesting, how are you calling models like that today @NikiBase?

NikiBase commented 1 month ago

Hi @hellovai, I use huggingface but I don't know if it is possible nowadays use it as provider for BAML. Do you have any suggestions on this? Thank you in advance

hellovai commented 1 month ago

@NikiBase if you can post the curl/python request you use to query the model that would be great.

I think it might be possible already as I think hugging face provides an openai compatible interface

NikiBase commented 1 month ago

@hellovai I basically do the example call of the model provided in their huggingface page

from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# default: Load the model on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-7B-Instruct", torch_dtype="auto", device_map="auto"
)

# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
# model = Qwen2VLForConditionalGeneration.from_pretrained(
#     "Qwen/Qwen2-VL-7B-Instruct",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

# default processer
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")

# The default range for the number of visual tokens per image in the model is 4-16384. You can set min_pixels and max_pixels according to your needs, such as a token count range of 256-1280, to balance speed and memory usage.
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

But I am working on an FastAPI endpoint for the inference. I would like to test both approaches but if the FastAPI is easier and implemented I would like to know how can I do that for now.

Thank you

hellovai commented 1 month ago

For now, that would likely be the easiest way:

  1. Make a fast API endpoint that is openai compatible
  2. create a openai-generic client in BAML that points at the local server
NikiBase commented 1 month ago

I have seen this example This would serve as an example to my use case where I should just change the clients.baml file in baml_src to point to my local server, right? Thank you

hellovai commented 1 month ago

yep! And it should, "just work" if you run into anything, just tag me on here! If you do get it working, I'd love for you to share the code, so we can add it to our docs for how to support hugging face local models.

NikiBase commented 1 month ago

sure! will be testing this weekend and getting familiar with the BAML syntax. will write you back soon. thank you very much and have a nice weekend!

NikiBase commented 1 month ago

Hello @hellovai,

I was working on the subject this weekend and cannot archive to send requests to my FastAPI endpoint which has an image as an input to my model, however the issue is with the connection to the BAML client which thoughts an not valid provider error. I attach the BAML client declaration and the error I got

client<llm> Qwen2VL7B {

  provider "openai-generic"

  options {
    base_url "<url_to_fastapi_endpoint>"
    model "Qwen/Qwen2-VL-7B-Instruct-GPTQ-Int4"
  }
}

The error I get

Unspecified error code: 2
reqwest::Error { kind: Request, source: "JsValue(TypeError: Failed to fetch\nTypeError: Failed to fetch\n    at __wbg_fetch_1e4e8ed1f64c7e28 \n    at baml_schema_build.wasm.reqwest::wasm::client::js_fetch::h4c26ca3258bac32c (wasm://wasm/baml_schema_build.wasm-025cbcb2:wasm-function[7062]:0x64cf68)\n    at baml_schema_build.wasm.reqwest::wasm::client::fetch::{{closure}}::hcd7f917ec0bd47ec (wasm://wasm/baml_schema_build.wasm-025cbcb2:wasm-function[365]:0x1b6bc9)\n    at baml_schema_build.wasm.<baml_runtime::internal::llm_client::primitive::openai::openai_client::OpenAIClient as baml_runtime::internal::llm_client::traits::chat::WithStreamChat>::stream_chat::{{closure}}::hd7fa3de64ebf7d94 (wasm://wasm/baml_schema_build.wasm-025cbcb2:wasm-function[392]:0x1cf672)\n    at baml_schema_build.wasm.baml_runtime::BamlRuntime::run_test::{{closure}}::hf221190f10327e32 (wasm://wasm/baml_schema_build.wasm-025cbcb2:wasm-function[139]:0x356a3)\n    at baml_schema_build.wasm.wasm_bindgen_futures::future_to_promise::{{closure}}::{{closure}}::h63ca6808f76398b7 (wasm://wasm/baml_schema_build.wasm-025cbcb2:wasm-function[635]:0x27702c)\n    at baml_schema_build.wasm.wasm_bindgen_futures::queue::Queue::new::{{closure}}::h2bd4eab1d7cf1be5 (wasm://wasm/baml_schema_build.wasm-025cbcb2:wasm-function[3450]:0x52c2aa)\n    at baml_schema_build.wasm.<dyn core::ops::function::FnMut<(A,)>+Output = R as wasm_bindgen::closure::WasmClosure>::describe::invoke::h4787a4021637fe52 (wasm://wasm/baml_schema_build.wasm-025cbcb2:wasm-function[12894]:0x6dbc20)\n    at __wbg_adapter_41\n    at real " }

Check the webview network tab for more details. Command Palette -> Open webview developer tools. 

I was trying to use another deployment framework such as VLLM since I worked with it recently but was running in some installation from source issues and cannot do anything more. I would try to use another LLM since the one I am using, depends on current dev version of transformers library and that is not working pretty well with VLLM now.

Do you have any suggestions on the FastAPI solution? I would really appreciate any comment. Thank you a lot

hellovai commented 1 month ago

@NikiBase , That's odd, let me try running this myself. Is that url pingable by me?

for context: i have it working with ollama

and can you confirm what version of BAML you're using?

NikiBase commented 1 month ago

Sorry, that url is not available. The problem is that the model is not available in ollama. I am using baml 0.56.1

hellovai commented 1 month ago

Hmm, an idea, does the Raw CURL work for you? (you should be able to pull out the exact CURL request being made under the hood from the playground). See the [ ] Raw Curl button right above the prompt preview.

Sidd065 commented 1 month ago

@NikiBase llama.cpp supports Qwen models and has an OpenAI Compatible HTTP Server.

There is also a FastAPI Server for llama.cpp. You can run the model on that server and create a openai-generic client for Baml.

NikiBase commented 1 month ago

hi @hellovai @Sidd065 thank you for the recommendations. I will check them and write you back when I have something working.

NikiBase commented 1 month ago

I was trying to run qwen2-vl-7b-instruct using llama.cpp but it seems that is not supported yet https://github.com/ggerganov/llama.cpp/issues/9246

Will try with another model until then. Do you have any suggestions of light-weight LMM that can perform well getting data from receipts? @hellovai @Sidd065 Thank you!

NikiBase commented 1 month ago

At the end, I used this model https://huggingface.co/openbmb/MiniCPM-V-2_6 using vllm serve and connected it with BAML. However, I had some troubles, that I would like to list next if anyone have them:

Ben-Epstein commented 3 weeks ago

@NikiBase you may want to try ramalama with a gguf variant of qwen

ramalama --nocontainer serve 'huggingface://Qwen/Qwen2-7B-Instruct-GGUF/qwen2-7b-instruct-q2_k.gguf'

(this is a small quant, probably want to use q6 https://huggingface.co/Qwen/Qwen2-7B-Instruct-GGUF