Closed NikiBase closed 1 month ago
That's interesting, how are you calling models like that today @NikiBase?
Hi @hellovai, I use huggingface but I don't know if it is possible nowadays use it as provider for BAML. Do you have any suggestions on this? Thank you in advance
@NikiBase if you can post the curl/python request you use to query the model that would be great.
I think it might be possible already as I think hugging face provides an openai compatible interface
@hellovai I basically do the example call of the model provided in their huggingface page
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
# default: Load the model on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(
"Qwen/Qwen2-VL-7B-Instruct", torch_dtype="auto", device_map="auto"
)
# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
# model = Qwen2VLForConditionalGeneration.from_pretrained(
# "Qwen/Qwen2-VL-7B-Instruct",
# torch_dtype=torch.bfloat16,
# attn_implementation="flash_attention_2",
# device_map="auto",
# )
# default processer
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")
# The default range for the number of visual tokens per image in the model is 4-16384. You can set min_pixels and max_pixels according to your needs, such as a token count range of 256-1280, to balance speed and memory usage.
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
},
{"type": "text", "text": "Describe this image."},
],
}
]
# Preparation for inference
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
But I am working on an FastAPI endpoint for the inference. I would like to test both approaches but if the FastAPI is easier and implemented I would like to know how can I do that for now.
Thank you
For now, that would likely be the easiest way:
openai-generic
client in BAML that points at the local serverI have seen this example This would serve as an example to my use case where I should just change the clients.baml file in baml_src to point to my local server, right? Thank you
yep! And it should, "just work" if you run into anything, just tag me on here! If you do get it working, I'd love for you to share the code, so we can add it to our docs for how to support hugging face local models.
sure! will be testing this weekend and getting familiar with the BAML syntax. will write you back soon. thank you very much and have a nice weekend!
Hello @hellovai,
I was working on the subject this weekend and cannot archive to send requests to my FastAPI endpoint which has an image as an input to my model, however the issue is with the connection to the BAML client which thoughts an not valid provider error. I attach the BAML client declaration and the error I got
client<llm> Qwen2VL7B {
provider "openai-generic"
options {
base_url "<url_to_fastapi_endpoint>"
model "Qwen/Qwen2-VL-7B-Instruct-GPTQ-Int4"
}
}
The error I get
Unspecified error code: 2
reqwest::Error { kind: Request, source: "JsValue(TypeError: Failed to fetch\nTypeError: Failed to fetch\n at __wbg_fetch_1e4e8ed1f64c7e28 \n at baml_schema_build.wasm.reqwest::wasm::client::js_fetch::h4c26ca3258bac32c (wasm://wasm/baml_schema_build.wasm-025cbcb2:wasm-function[7062]:0x64cf68)\n at baml_schema_build.wasm.reqwest::wasm::client::fetch::{{closure}}::hcd7f917ec0bd47ec (wasm://wasm/baml_schema_build.wasm-025cbcb2:wasm-function[365]:0x1b6bc9)\n at baml_schema_build.wasm.<baml_runtime::internal::llm_client::primitive::openai::openai_client::OpenAIClient as baml_runtime::internal::llm_client::traits::chat::WithStreamChat>::stream_chat::{{closure}}::hd7fa3de64ebf7d94 (wasm://wasm/baml_schema_build.wasm-025cbcb2:wasm-function[392]:0x1cf672)\n at baml_schema_build.wasm.baml_runtime::BamlRuntime::run_test::{{closure}}::hf221190f10327e32 (wasm://wasm/baml_schema_build.wasm-025cbcb2:wasm-function[139]:0x356a3)\n at baml_schema_build.wasm.wasm_bindgen_futures::future_to_promise::{{closure}}::{{closure}}::h63ca6808f76398b7 (wasm://wasm/baml_schema_build.wasm-025cbcb2:wasm-function[635]:0x27702c)\n at baml_schema_build.wasm.wasm_bindgen_futures::queue::Queue::new::{{closure}}::h2bd4eab1d7cf1be5 (wasm://wasm/baml_schema_build.wasm-025cbcb2:wasm-function[3450]:0x52c2aa)\n at baml_schema_build.wasm.<dyn core::ops::function::FnMut<(A,)>+Output = R as wasm_bindgen::closure::WasmClosure>::describe::invoke::h4787a4021637fe52 (wasm://wasm/baml_schema_build.wasm-025cbcb2:wasm-function[12894]:0x6dbc20)\n at __wbg_adapter_41\n at real " }
Check the webview network tab for more details. Command Palette -> Open webview developer tools.
I was trying to use another deployment framework such as VLLM since I worked with it recently but was running in some installation from source issues and cannot do anything more. I would try to use another LLM since the one I am using, depends on current dev version of transformers library and that is not working pretty well with VLLM now.
Do you have any suggestions on the FastAPI solution? I would really appreciate any comment. Thank you a lot
@NikiBase , That's odd, let me try running this myself. Is that url pingable by me?
for context: i have it working with ollama
and can you confirm what version of BAML you're using?
Sorry, that url is not available. The problem is that the model is not available in ollama. I am using baml 0.56.1
Hmm, an idea, does the Raw CURL work for you? (you should be able to pull out the exact CURL request being made under the hood from the playground). See the [ ] Raw Curl button right above the prompt preview.
@NikiBase llama.cpp supports Qwen models and has an OpenAI Compatible HTTP Server.
There is also a FastAPI Server for llama.cpp. You can run the model on that server and create a openai-generic client for Baml.
hi @hellovai @Sidd065 thank you for the recommendations. I will check them and write you back when I have something working.
I was trying to run qwen2-vl-7b-instruct using llama.cpp but it seems that is not supported yet https://github.com/ggerganov/llama.cpp/issues/9246
Will try with another model until then. Do you have any suggestions of light-weight LMM that can perform well getting data from receipts? @hellovai @Sidd065 Thank you!
At the end, I used this model https://huggingface.co/openbmb/MiniCPM-V-2_6 using vllm serve and connected it with BAML. However, I had some troubles, that I would like to list next if anyone have them:
vllm serve "openbmb/MiniCPM-V-2_6" --trust-remote-code --dtype auto --api-key token-abc123
client.baml
the new client, following the docsmain.baml
(in case you are following this example here, in case you are not, search for generator.baml
). This happened to me in 0.56.1 version.@NikiBase you may want to try ramalama with a gguf variant of qwen
ramalama --nocontainer serve 'huggingface://Qwen/Qwen2-7B-Instruct-GGUF/qwen2-7b-instruct-q2_k.gguf'
(this is a small quant, probably want to use q6 https://huggingface.co/Qwen/Qwen2-7B-Instruct-GGUF
Hello,
First of all thank you for bringing this amazing tool! I was wondering if there is any chance of integrating open-source LMM models like for example https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct which is not currently supported by ollama provider. Or if there is a way to self host instead of calling and external provider.
Thank you again!