Closed naifmeh closed 2 months ago
I tried running the model directly through llama.cpp and the logs are clearer on what might be causing the error:
llama_model_load: error loading model: done_getting_tensors: wrong number of tensors; expected 363, got 362
According to this similar issue, it seems to be related to the quantization script, but I might be wrong.
Our modified llama.cpp have not been merged into official llama.cpp, please try on this PR
Answering complete. if you have more questions, please continue to ask!
MiniCPM-Llama3-V 2.5 can run with llama.cpp now! See our fork of llama.cpp for more detail.
and here is our model in gguf format. https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5-gguf
@naifmeh
@Cuiunbo Awesome, it's looking great! Thanks :)
I had an error when running make
:
examples/minicpmv/minicpmv.cpp: In function ‘std::pair<int, int> get_refine_size(std::pair<int, int>, std::pair<int, int>, int, int, bool)’:
examples/minicpmv/minicpmv.cpp:395:59: error: could not convert ‘std::make_tuple(_Elements&& ...) [with _Elements = {int&, int&}](grid_height)’ from ‘std::tuple<int, int>’ to ‘std::pair<int, int>’
395 | auto best_grid_size = find_best_resize(std::make_tuple(grid_width, grid_height), scale_resolution, patch_size, allow_upscale);
| ~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~
| |
| std::tuple<int, int>
examples/minicpmv/minicpmv.cpp:400:54: error: conversion from ‘std::tuple<int, int>’ to non-scalar type ‘std::pair<int, int>’ requested
400 | std::pair<int, int> refine_size = std::make_tuple(best_grid_width * grid_x, best_grid_height * grid_y);
| ~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
examples/minicpmv/minicpmv.cpp: At global scope:
What fixed it for me was to explicitely convert the std::tuple
to the expected std::pair
for the lines where this happens.
I've also tried running the available GGUF version, it seems to run correctly but the output is wildly different from the int4 version of the model that runs through the transformers
library. From what I understand, Q4_K_M is supposed to be comparable in precision to an int4 version of a model, right?
In my case, the same prompts results in two very differents responses from the model, and it is always in favor of the int4 version.
Hi naifmeh, to fix the above code you can do this on the file minicpmv.cpp
which is located under examples/minicpmv
There what you can do is change the line on 395 to
auto best_grid_size = find_best_resize(std::make_pair(grid_width, grid_height), scale_resolution, patch_size, allow_upscale); // (new line) => fixes conversion for make_tuple to make_pair
As well as change line 400 to
std::pair<int, int> refine_size = std::make_pair(best_grid_width * grid_x, best_grid_height * grid_y);
I have also created a pull request for this change under the llama-cpp repo. @Cuiunbo
Thanks a lot for the feedback, we also found a difference between the llamacpp and int4 versions. We are trying to find the problem. @naifmeh
Hi naifmeh, to fix the above code you can do this on the file
minicpmv.cpp
which is located underexamples/minicpmv
There what you can do is change the line on 395 to
auto best_grid_size = find_best_resize(std::make_pair(grid_width, grid_height), scale_resolution, patch_size, allow_upscale); // (new line) => fixes conversion for make_tuple to make_pair
As well as change line 400 to
std::pair<int, int> refine_size = std::make_pair(best_grid_width * grid_x, best_grid_height * grid_y);
I have also created a pull request for this change under the llama-cpp repo. @Cuiunbo
@harvestingmoon thanks, Are you talking about the official repository or our fork.
@Cuiunbo Awesome, it's looking great! Thanks :)
I had an error when running
make
:examples/minicpmv/minicpmv.cpp: In function ‘std::pair<int, int> get_refine_size(std::pair<int, int>, std::pair<int, int>, int, int, bool)’: examples/minicpmv/minicpmv.cpp:395:59: error: could not convert ‘std::make_tuple(_Elements&& ...) [with _Elements = {int&, int&}](grid_height)’ from ‘std::tuple<int, int>’ to ‘std::pair<int, int>’ 395 | auto best_grid_size = find_best_resize(std::make_tuple(grid_width, grid_height), scale_resolution, patch_size, allow_upscale); | ~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~ | | | std::tuple<int, int> examples/minicpmv/minicpmv.cpp:400:54: error: conversion from ‘std::tuple<int, int>’ to non-scalar type ‘std::pair<int, int>’ requested 400 | std::pair<int, int> refine_size = std::make_tuple(best_grid_width * grid_x, best_grid_height * grid_y); | ~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ examples/minicpmv/minicpmv.cpp: At global scope:
What fixed it for me was to explicitely convert the
std::tuple
to the expectedstd::pair
for the lines where this happens.I've also tried running the available GGUF version, it seems to run correctly but the output is wildly different from the int4 version of the model that runs through the
transformers
library. From what I understand, Q4_K_M is supposed to be comparable in precision to an int4 version of a model, right?In my case, the same prompts results in two very differents responses from the model, and it is always in favor of the int4 version.
Tuple is supported by c++11. It's better to replace it with pair here.
Hi naifmeh, to fix the above code you can do this on the file
minicpmv.cpp
which is located underexamples/minicpmv
There what you can do is change the line on 395 to
auto best_grid_size = find_best_resize(std::make_pair(grid_width, grid_height), scale_resolution, patch_size, allow_upscale); // (new line) => fixes conversion for make_tuple to make_pair
As well as change line 400 to
std::pair<int, int> refine_size = std::make_pair(best_grid_width * grid_x, best_grid_height * grid_y);
I have also created a pull request for this change under the llama-cpp repo. @Cuiunbo
cool, merged. ^_^
@Cuiunbo Awesome, it's looking great! Thanks :)
I had an error when running
make
:examples/minicpmv/minicpmv.cpp: In function ‘std::pair<int, int> get_refine_size(std::pair<int, int>, std::pair<int, int>, int, int, bool)’: examples/minicpmv/minicpmv.cpp:395:59: error: could not convert ‘std::make_tuple(_Elements&& ...) [with _Elements = {int&, int&}](grid_height)’ from ‘std::tuple<int, int>’ to ‘std::pair<int, int>’ 395 | auto best_grid_size = find_best_resize(std::make_tuple(grid_width, grid_height), scale_resolution, patch_size, allow_upscale); | ~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~ | | | std::tuple<int, int> examples/minicpmv/minicpmv.cpp:400:54: error: conversion from ‘std::tuple<int, int>’ to non-scalar type ‘std::pair<int, int>’ requested 400 | std::pair<int, int> refine_size = std::make_tuple(best_grid_width * grid_x, best_grid_height * grid_y); | ~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ examples/minicpmv/minicpmv.cpp: At global scope:
What fixed it for me was to explicitely convert the
std::tuple
to the expectedstd::pair
for the lines where this happens.I've also tried running the available GGUF version, it seems to run correctly but the output is wildly different from the int4 version of the model that runs through the
transformers
library. From what I understand, Q4_K_M is supposed to be comparable in precision to an int4 version of a model, right?In my case, the same prompts results in two very differents responses from the model, and it is always in favor of the int4 version.
Could you send me one or two case to check the accuracy difference you said?
@Cuiunbo Awesome, it's looking great! Thanks :) I had an error when running
make
:examples/minicpmv/minicpmv.cpp: In function ‘std::pair<int, int> get_refine_size(std::pair<int, int>, std::pair<int, int>, int, int, bool)’: examples/minicpmv/minicpmv.cpp:395:59: error: could not convert ‘std::make_tuple(_Elements&& ...) [with _Elements = {int&, int&}](grid_height)’ from ‘std::tuple<int, int>’ to ‘std::pair<int, int>’ 395 | auto best_grid_size = find_best_resize(std::make_tuple(grid_width, grid_height), scale_resolution, patch_size, allow_upscale); | ~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~ | | | std::tuple<int, int> examples/minicpmv/minicpmv.cpp:400:54: error: conversion from ‘std::tuple<int, int>’ to non-scalar type ‘std::pair<int, int>’ requested 400 | std::pair<int, int> refine_size = std::make_tuple(best_grid_width * grid_x, best_grid_height * grid_y); | ~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ examples/minicpmv/minicpmv.cpp: At global scope:
What fixed it for me was to explicitely convert the
std::tuple
to the expectedstd::pair
for the lines where this happens. I've also tried running the available GGUF version, it seems to run correctly but the output is wildly different from the int4 version of the model that runs through thetransformers
library. From what I understand, Q4_K_M is supposed to be comparable in precision to an int4 version of a model, right? In my case, the same prompts results in two very differents responses from the model, and it is always in favor of the int4 version.Could you send me one or two case to check the accuracy difference you said?
Sure!
This one includes a simple screenshot of Amazon that I've used to test the OCR capabilities: https://ibb.co/q9FB0kX Here is my prompt and the output with the Q4_K_M version:
$ ./minicpmv-cli --model ../models/minicpm25-q4km.gguf --mmproj ../models/mmproj-minicpm25.gguf --image test_img3.jpg -c 4096 --temp 0.7 --top-p 0.8 --top-k 100 --repeat-penalty 1.05 -p "List the items that are included in the dropdown menu." --n-gpu-layers 40
<user>List the items that are included in the dropdown menu.
<assistant>
The dropdown menu includes items such as "Salle de bain et douche", "Cuisine et alimentation", and "Ménage et bricolage".
Here is the output for the same prompt with the transformers
library and the int4 version:
The dropdown menu includes the following items:
1. Air fryer
2. Air fryer Philips
3. Air fryer Moulinex
4. Air fryer Cotelec
5. Accessories
6. Air fryer Philips XXL
7. Air fryer Philips XL
8. Air fryer 2 compartments
9. Air fryer 8L
These items seem to be related to kitchen appliances, specifically air fryers from various brands and models.
The image is a stock picture of a living room, taken from here
I'm asking the model to describe the house equiments that are present in the picture.
With the G4_K_M quantization:
$ ./minicpmv-cli --model ../models/minicpm25-q4km.gguf --mmproj ../models/mmproj-minicpm25.gguf --image test_img4.jpg -c 4096 --temp 0.7 --top-p 0.8 --top-k 100 --repeat-penalty 1.05 -p "List all the house elements that are present in this picture." --n-gpu-layers 40
<user>List all the house elements that are present in this picture.
<assistant>
The image displays an interior space that appears to be a living room or lounge area within a home. Key elements include a wooden staircase with a glass balustrade leading to an upper level, which suggests a multi-storey dwelling. There's a plush seating area featuring a beige sofa
adorned with throw pillows and a coffee table, indicating a space for relaxation or socializing. A bookshelf filled with books indicates a leisure or study zone. The presence of a fireplace suggests a central gathering spot for warmth and ambiance. The image captures these elements from various angles, providing a comprehensive view of the interior design and layout.
This example generates an answer that is closer to the int4
version of the model but the first part of the answer is still off.
The output with the INT4 version:
The elements of the house present in the picture include a living room, ceiling with wooden beams, chandeliers, a fireplace, wall-mounted television, built-in bookshelves, sofa, armchairs, coffee table, side tables, lamps, decorative items, and windows.
@naifmeh now we have solved this problem, please try it, looking forward to your feedback!
@naifmeh now we have solved this problem, please try it, looking forward to your feedback!
Thanks @Cuiunbo, I will try asap and get back to you.
@Cuiunbo
Tried with the same images and prompt, and the files included in commit 6366d62d6b16b4376d976399cf402eb40829ebb1
.
I also downloaded the latest GGUF model (Q4_K_M).
For the first prompt, which tested the OCR capabilities of the quantized model, this is what I get when I run the command a few times:
$ ./minicpmv-cli --model ../LocalAI/models/minicpm25-q4km_new.gguf --mmproj ../LocalAI/models/mmproj-minicpm25.gguf --image test_img3.jpg -c 4096 --temp 0.7 --top-p 0.8 --top-k 100 --repeat-penalty 1.05 -p "List the items that are included in the dropdown menu." --n-gpu-layers 40
<user>List the items that are included in the dropdown menu.
<assistant>
The dropdown menu includes items such as air fryer, coffee maker, microwave, iron, hair dryer, steam cleaner, vacuum cleaner, and kitchen utensils.
<user>List the items that are included in the dropdown menu.
<assistant>
The dropdown menu includes a variety of items such as air fryer accessories, air fryers, air purifiers, aquariums, bed linen, bicycles, books, garden tools, kitchen utensils, kitchenware, lamps, laptops, mirrors
, pet products, photo frames, power tools, projectors, sewing machines, smart home devices, tablets, televisions, toys, and vacuum cleaners.
<user>List the items that are included in the dropdown menu.
<assistant>
The dropdown menu includes various categories such as 'Articles de toilette', 'Cadeaux d'entreprise', 'Camping et randonnée', 'Chaînes TV', 'Chromotherapie', 'Conso en ligne', 'Cordage et nœuds', 'Création artis
anale', 'Cuisine et repas', 'Découvrir Amazon', 'Divers et variés', 'Droguerie', 'Electronique grand public', 'Enfants, bébés, maternité', 'Équipement de bureau', 'Équipements de cuisine', 'Étiquettes et étiquettes', 'Ferrailles et outils', 'Gardening', 'Gourmandise', 'Gros lots', 'Histoire et actualité', 'Hobby & Sortir', 'Hygiène et beauté', 'Informatique, telecom', 'Jardin', 'Jeux et jouets', 'Journées fériées', 'Livres, BD, DVD', 'Maison et jardin', 'Meubles et décoration', 'Médecine et soins personnels', 'Menuiserie, ébénisterie', '
I see no major changes to the output when I play with the temperature. It's also not much better when running the model entierely on the CPU. Sometimes the output is just blank.
The output for the second image and prompt are better than they were though:
<user>List all the house elements that are present in this picture.
<assistant>
The image shows a living room with several elements such as a fireplace, a sofa, a coffee table, a rug, a light fixture, a picture frame, and decorative items like a lamp, books, and a throw pillow.
I've also tried the first prompt/image with the Q6_K version, with similar results. Something else I noted with the Q6_K model is that it is particularly slow for this specific prompt/image. When I run the model on the second prompt/image, it runs quickly and returns a consistent output. Edit: I forgot to offload the layers to the GPU
@tc-mb Have a look.
First of all, thank you for your impressive work! I've found that your model fares better than the latest LLAVA (13B) on some of my tasks. I've tried running the GGUF version of MiniCPM-V2.0 on LocalAI v2.15.0 using the llama.cpp backend but it can't seem to load the CLIP model. I've made sure to include both the mmproj and the model files.
The loading fails with these following log lines:
I'm attempting to run it on a RTX 3080 with 10GB of VRAM and I've tried using both the Q8 and the f16 version along with the mmproj from here: https://huggingface.co/mzwing/MiniCPM-V-2-GGUF
Please find the complete log below:
LocalAI (llama.cpp backend) logs
```bash 8:30PM DBG Request received: {"model":"minicpm","language":"","n":0,"top_p":null,"top_k":null,"temperature":null,"max_tokens":null,"echo":false,"batch":0,"ignore_eos":false,"repeat_pena lty":0,"n_keep":0,"frequency_penalty":0,"presence_penalty":0,"tfz":null,"typical_p":null,"seed":null,"negative_prompt":"","rope_freq_base":0,"rope_freq_scale":0,"negative_prompt_scale": 0,"use_fast_tokenizer":false,"clip_skip":0,"tokenizer":"","file":"","response_format":{},"size":"","prompt":null,"instruction":"","input":null,"stop":null,"messages":[{"role":"user","co ntent":[{"text":"List all the elements that you see. Do not repeat yourself.","type":"text"},{"image_url":{"url":"https://img.leboncoin.fr/api/v1/lbcpb1/images/82/03/15/8203153649130fb8 a70f4f49986280025bb71044.jpg?rule=ad-large"},"type":"image_url"}]}],"functions":null,"function_call":null,"stream":false,"mode":0,"step":0,"grammar":"","grammar_json_functions":null,"ba ckend":"","model_base_name":""} 8:30PM DBG Configuration read: &{PredictionOptions:{Model:minicpm-v2-f16.gguf Language: N:0 TopP:0xc0000e8028 TopK:0xc0000e8020 Temperature:0xc00040e408 Maxtokens:0xc0000e8098 Echo:fals e Batch:0 IgnoreEOS:false RepeatPenalty:1.05 Keep:0 FrequencyPenalty:0 PresencePenalty:0 TFZ:0xc0000e8100 TypicalP:0xc0000e80f8 Seed:0xc0000e8120 NegativePrompt: RopeFreqBase:0 RopeFreq Scale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:minicpm F16:0xc0000e80c0 Threads:0xc00040e3c0 Debug:0xc0000e8840 Roles:map[assistant:ASSISTANT: system:S YSTEM: user:USER:] Embeddings:false Backend:llama-cpp TemplateConfig:{Chat:A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed , and polite answers to the human's questions. {{.Input}} ASSISTANT: ChatMessage: Completion: Edit: Functions: UseTokenizerTemplate:false} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{Disabl eNoAction:false NoActionFunctionName: NoActionDescriptionName: ParallelCalls:false NoGrammar:false ResponseRegex:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNo rmEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0xc0000e80f0 MirostatTAU:0xc0000e80e8 Mirostat:0xc0000e80e0 NGPULayers:0xc00040e3c8 MMap:0xc00040e40 0 MMlock:0xc0000e8119 LowVRAM:0xc0000e8119 Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] TrimSuffix:[] ContextSize:0xc0000e80b0 NUMA:false LoraAdapter: LoraBase: LoraScale:0 NoMulMat Q:false DraftModel: NDraft:0 Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 TensorParallelSize:0 MMProj:minicpm-mmproj.gguf Rope Scaling:1 32000 ModelType: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{CUDA:false Pi pelineType: SchedulerType: EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder: ControlNet:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:} C UDA:false DownloadFiles:[] Description: Usage:} 8:30PM DBG Parameters: &{PredictionOptions:{Model:minicpm-v2-f16.gguf Language: N:0 TopP:0xc0000e8028 TopK:0xc0000e8020 Temperature:0xc00040e408 Maxtokens:0xc0000e8098 Echo:false Batch: 0 IgnoreEOS:false RepeatPenalty:1.05 Keep:0 FrequencyPenalty:0 PresencePenalty:0 TFZ:0xc0000e8100 TypicalP:0xc0000e80f8 Seed:0xc0000e8120 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:minicpm F16:0xc0000e80c0 Threads:0xc00040e3c0 Debug:0xc0000e8840 Roles:map[assistant:ASSISTANT: system:SYSTEM: u ser:USER:] Embeddings:false Backend:llama-cpp TemplateConfig:{Chat:A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and po lite answers to the human's questions. {{.Input}} ASSISTANT: ChatMessage: Completion: Edit: Functions: UseTokenizerTemplate:false} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{Disabl eNoAction:false NoActionFunctionName: NoActionDescriptionName: ParallelCalls:false NoGrammar:false ResponseRegex:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNo rmEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0xc0000e80f0 MirostatTAU:0xc0000e80e8 Mirostat:0xc0000e80e0 NGPULayers:0xc00040e3c8 MMap:0xc00040e40 0 MMlock:0xc0000e8119 LowVRAM:0xc0000e8119 Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] TrimSuffix:[] ContextSize:0xc0000e80b0 NUMA:false LoraAdapter: LoraBase: LoraScale:0 NoMulMat Q:false DraftModel: NDraft:0 Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 TensorParallelSize:0 MMProj:minicpm-mmproj.gguf Rope Scaling:1 32000 ModelType: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{CUDA:false Pi pelineType: SchedulerType: EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder: ControlNet:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:} C UDA:false DownloadFiles:[] Description: Usage:} 8:30PM DBG Prompt (before templating): USER:[img-0]List all the elements that you see. Do not repeat yourself. 8:30PM DBG Template found, input modified to: A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the h uman's questions. USER:[img-0]List all the elements that you see. Do not repeat yourself. ASSISTANT: 8:30PM DBG Prompt (after templating): A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's q uestions. USER:[img-0]List all the elements that you see. Do not repeat yourself. ASSISTANT: 8:30PM INF Loading model 'minicpm-v2-f16.gguf' with backend llama-cpp 8:30PM DBG Stopping all backends except 'minicpm-v2-f16.gguf' 8:30PM DBG Loading model in memory from file: /models/minicpm-v2-f16.gguf 8:30PM DBG Loading Model minicpm-v2-f16.gguf with gRPC (file: /models/minicpm-v2-f16.gguf) (backend: llama-cpp): {backendString:llama-cpp model:minicpm-v2-f16.gguf threads:11 assetDir:/ tmp/localai/backend_data context:{emptyCtx:{}} gRPCOptions:0xc00019b800 externalBackends:map[autogptq:/build/backend/python/autogptq/run.sh bark:/build/backend/python/bark/run.sh coqui: /build/backend/python/coqui/run.sh diffusers:/build/backend/python/diffusers/run.sh exllama:/build/backend/python/exllama/run.sh exllama2:/build/backend/python/exllama2/run.sh huggingfa ce-embeddings:/build/backend/python/sentencetransformers/run.sh mamba:/build/backend/python/mamba/run.sh parler-tts:/build/backend/python/parler-tts/run.sh petals:/build/backend/python/ petals/run.sh rerankers:/build/backend/python/rerankers/run.sh sentencetransformers:/build/backend/python/sentencetransformers/run.sh transformers:/build/backend/python/transformers/run .sh transformers-musicgen:/build/backend/python/transformers-musicgen/run.sh vall-e-x:/build/backend/python/vall-e-x/run.sh vllm:/build/backend/python/vllm/run.sh] grpcAttempts:20 grpcA ttemptsDelay:2 singleActiveBackend:true parallelRequests:false} 8:30PM DBG Loading GRPC Process: /tmp/localai/backend_data/backend-assets/grpc/llama-cpp 8:30PM DBG GRPC Service for minicpm-v2-f16.gguf will be running at: '127.0.0.1:33079' 8:30PM DBG GRPC Service state dir: /tmp/go-processmanager4275177599 8:30PM DBG GRPC Service Started 8:30PM DBG GRPC(minicpm-v2-f16.gguf-127.0.0.1:33079): stdout Server listening on 127.0.0.1:33079 8:30PM DBG GRPC Service Ready 8:30PM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:I'm not sure what might be causing the loading to fail.
Thank you!