GGUF versions doesn't seem to run on llama.cpp (through LocalAI)

naifmeh commented 3 months ago

First of all, thank you for your impressive work! I've found that your model fares better than the latest LLAVA (13B) on some of my tasks. I've tried running the GGUF version of MiniCPM-V2.0 on LocalAI v2.15.0 using the llama.cpp backend but it can't seem to load the CLIP model. I've made sure to include both the mmproj and the model files.

The loading fails with these following log lines:

8:31PM DBG GRPC(minicpm-v2-f16.gguf-127.0.0.1:33079): stderr key clip.vision.image_grid_pinpoints not found in file
8:31PM DBG GRPC(minicpm-v2-f16.gguf-127.0.0.1:33079): stderr key clip.vision.mm_patch_merge_type not found in file
8:31PM DBG GRPC(minicpm-v2-f16.gguf-127.0.0.1:33079): stderr key clip.vision.image_crop_resolution not found in file
8:31PM DBG GRPC(minicpm-v2-f16.gguf-127.0.0.1:33079): stderr clip_model_load: failed to load vision model tensors

I'm attempting to run it on a RTX 3080 with 10GB of VRAM and I've tried using both the Q8 and the f16 version along with the mmproj from here: https://huggingface.co/mzwing/MiniCPM-V-2-GGUF

Please find the complete log below:

LocalAI (llama.cpp backend) logs

```bash 8:30PM DBG Request received: {"model":"minicpm","language":"","n":0,"top_p":null,"top_k":null,"temperature":null,"max_tokens":null,"echo":false,"batch":0,"ignore_eos":false,"repeat_pena lty":0,"n_keep":0,"frequency_penalty":0,"presence_penalty":0,"tfz":null,"typical_p":null,"seed":null,"negative_prompt":"","rope_freq_base":0,"rope_freq_scale":0,"negative_prompt_scale": 0,"use_fast_tokenizer":false,"clip_skip":0,"tokenizer":"","file":"","response_format":{},"size":"","prompt":null,"instruction":"","input":null,"stop":null,"messages":[{"role":"user","co ntent":[{"text":"List all the elements that you see. Do not repeat yourself.","type":"text"},{"image_url":{"url":"https://img.leboncoin.fr/api/v1/lbcpb1/images/82/03/15/8203153649130fb8 a70f4f49986280025bb71044.jpg?rule=ad-large"},"type":"image_url"}]}],"functions":null,"function_call":null,"stream":false,"mode":0,"step":0,"grammar":"","grammar_json_functions":null,"ba ckend":"","model_base_name":""} 8:30PM DBG Configuration read: &{PredictionOptions:{Model:minicpm-v2-f16.gguf Language: N:0 TopP:0xc0000e8028 TopK:0xc0000e8020 Temperature:0xc00040e408 Maxtokens:0xc0000e8098 Echo:fals e Batch:0 IgnoreEOS:false RepeatPenalty:1.05 Keep:0 FrequencyPenalty:0 PresencePenalty:0 TFZ:0xc0000e8100 TypicalP:0xc0000e80f8 Seed:0xc0000e8120 NegativePrompt: RopeFreqBase:0 RopeFreq Scale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:minicpm F16:0xc0000e80c0 Threads:0xc00040e3c0 Debug:0xc0000e8840 Roles:map[assistant:ASSISTANT: system:S YSTEM: user:USER:] Embeddings:false Backend:llama-cpp TemplateConfig:{Chat:A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed , and polite answers to the human's questions. {{.Input}} ASSISTANT: ChatMessage: Completion: Edit: Functions: UseTokenizerTemplate:false} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{Disabl eNoAction:false NoActionFunctionName: NoActionDescriptionName: ParallelCalls:false NoGrammar:false ResponseRegex:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNo rmEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0xc0000e80f0 MirostatTAU:0xc0000e80e8 Mirostat:0xc0000e80e0 NGPULayers:0xc00040e3c8 MMap:0xc00040e40 0 MMlock:0xc0000e8119 LowVRAM:0xc0000e8119 Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] TrimSuffix:[] ContextSize:0xc0000e80b0 NUMA:false LoraAdapter: LoraBase: LoraScale:0 NoMulMat Q:false DraftModel: NDraft:0 Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 TensorParallelSize:0 MMProj:minicpm-mmproj.gguf Rope Scaling:1 32000 ModelType: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{CUDA:false Pi pelineType: SchedulerType: EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder: ControlNet:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:} C UDA:false DownloadFiles:[] Description: Usage:} 8:30PM DBG Parameters: &{PredictionOptions:{Model:minicpm-v2-f16.gguf Language: N:0 TopP:0xc0000e8028 TopK:0xc0000e8020 Temperature:0xc00040e408 Maxtokens:0xc0000e8098 Echo:false Batch: 0 IgnoreEOS:false RepeatPenalty:1.05 Keep:0 FrequencyPenalty:0 PresencePenalty:0 TFZ:0xc0000e8100 TypicalP:0xc0000e80f8 Seed:0xc0000e8120 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:minicpm F16:0xc0000e80c0 Threads:0xc00040e3c0 Debug:0xc0000e8840 Roles:map[assistant:ASSISTANT: system:SYSTEM: u ser:USER:] Embeddings:false Backend:llama-cpp TemplateConfig:{Chat:A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and po lite answers to the human's questions. {{.Input}} ASSISTANT: ChatMessage: Completion: Edit: Functions: UseTokenizerTemplate:false} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{Disabl eNoAction:false NoActionFunctionName: NoActionDescriptionName: ParallelCalls:false NoGrammar:false ResponseRegex:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNo rmEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0xc0000e80f0 MirostatTAU:0xc0000e80e8 Mirostat:0xc0000e80e0 NGPULayers:0xc00040e3c8 MMap:0xc00040e40 0 MMlock:0xc0000e8119 LowVRAM:0xc0000e8119 Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] TrimSuffix:[] ContextSize:0xc0000e80b0 NUMA:false LoraAdapter: LoraBase: LoraScale:0 NoMulMat Q:false DraftModel: NDraft:0 Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 TensorParallelSize:0 MMProj:minicpm-mmproj.gguf Rope Scaling:1 32000 ModelType: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{CUDA:false Pi pelineType: SchedulerType: EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder: ControlNet:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:} C UDA:false DownloadFiles:[] Description: Usage:} 8:30PM DBG Prompt (before templating): USER:[img-0]List all the elements that you see. Do not repeat yourself. 8:30PM DBG Template found, input modified to: A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the h uman's questions. USER:[img-0]List all the elements that you see. Do not repeat yourself. ASSISTANT: 8:30PM DBG Prompt (after templating): A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's q uestions. USER:[img-0]List all the elements that you see. Do not repeat yourself. ASSISTANT: 8:30PM INF Loading model 'minicpm-v2-f16.gguf' with backend llama-cpp 8:30PM DBG Stopping all backends except 'minicpm-v2-f16.gguf' 8:30PM DBG Loading model in memory from file: /models/minicpm-v2-f16.gguf 8:30PM DBG Loading Model minicpm-v2-f16.gguf with gRPC (file: /models/minicpm-v2-f16.gguf) (backend: llama-cpp): {backendString:llama-cpp model:minicpm-v2-f16.gguf threads:11 assetDir:/ tmp/localai/backend_data context:{emptyCtx:{}} gRPCOptions:0xc00019b800 externalBackends:map[autogptq:/build/backend/python/autogptq/run.sh bark:/build/backend/python/bark/run.sh coqui: /build/backend/python/coqui/run.sh diffusers:/build/backend/python/diffusers/run.sh exllama:/build/backend/python/exllama/run.sh exllama2:/build/backend/python/exllama2/run.sh huggingfa ce-embeddings:/build/backend/python/sentencetransformers/run.sh mamba:/build/backend/python/mamba/run.sh parler-tts:/build/backend/python/parler-tts/run.sh petals:/build/backend/python/ petals/run.sh rerankers:/build/backend/python/rerankers/run.sh sentencetransformers:/build/backend/python/sentencetransformers/run.sh transformers:/build/backend/python/transformers/run .sh transformers-musicgen:/build/backend/python/transformers-musicgen/run.sh vall-e-x:/build/backend/python/vall-e-x/run.sh vllm:/build/backend/python/vllm/run.sh] grpcAttempts:20 grpcA ttemptsDelay:2 singleActiveBackend:true parallelRequests:false} 8:30PM DBG Loading GRPC Process: /tmp/localai/backend_data/backend-assets/grpc/llama-cpp 8:30PM DBG GRPC Service for minicpm-v2-f16.gguf will be running at: '127.0.0.1:33079' 8:30PM DBG GRPC Service state dir: /tmp/go-processmanager4275177599 8:30PM DBG GRPC Service Started 8:30PM DBG GRPC(minicpm-v2-f16.gguf-127.0.0.1:33079): stdout Server listening on 127.0.0.1:33079 8:30PM DBG GRPC Service Ready 8:30PM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:} sizeCache:0 unknownFields:[] Model:minicpm-v2-f16.gguf Co ntextSize:512 Seed:1552041902 NBatch:512 F16Memory:false MLock:false MMap:true VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:30 MainGPU: TensorSplit: Threads:11 L ibrarySearchPath: RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/models/minicpm-v2-f16.gguf Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType: Sc hedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 ControlNet: Tokenizer: LoraBase: LoraAdapter: LoraScale:0 NoMulMatQ:false DraftModel: AudioPath: Qu antization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 TensorParallelSize:0 MMProj:minicpm-mmproj.gguf RopeScaling:1 32000 YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0 Type:} 8:30PM DBG GRPC(minicpm-v2-f16.gguf-127.0.0.1:33079): stdout {"timestamp":1716323441,"level":"INFO","function":"load_model","line":449,"message":"Multi Modal Mode Enabled"} 8:30PM DBG GRPC(minicpm-v2-f16.gguf-127.0.0.1:33079): stderr clip_model_load: description: image encoder for LLaVA 8:30PM DBG GRPC(minicpm-v2-f16.gguf-127.0.0.1:33079): stderr clip_model_load: GGUF version: 3 8:30PM DBG GRPC(minicpm-v2-f16.gguf-127.0.0.1:33079): stderr clip_model_load: alignment: 32 8:30PM DBG GRPC(minicpm-v2-f16.gguf-127.0.0.1:33079): stderr clip_model_load: n_tensors: 440 8:30PM DBG GRPC(minicpm-v2-f16.gguf-127.0.0.1:33079): stderr clip_model_load: n_kv: 18 8:30PM DBG GRPC(minicpm-v2-f16.gguf-127.0.0.1:33079): stderr clip_model_load: ftype: f16 8:30PM DBG GRPC(minicpm-v2-f16.gguf-127.0.0.1:33079): stderr 8:30PM DBG GRPC(minicpm-v2-f16.gguf-127.0.0.1:33079): stderr clip_model_load: loaded meta data with 18 key-value pairs and 440 tensors from /models/minicpm-mmproj.gguf 8:30PM DBG GRPC(minicpm-v2-f16.gguf-127.0.0.1:33079): stderr clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output. 8:30PM DBG GRPC(minicpm-v2-f16.gguf-127.0.0.1:33079): stderr clip_model_load: - kv 0: general.architecture str = clip 8:30PM DBG GRPC(minicpm-v2-f16.gguf-127.0.0.1:33079): stderr clip_model_load: - kv 1: clip.has_text_encoder bool = false 8:30PM DBG GRPC(minicpm-v2-f16.gguf-127.0.0.1:33079): stderr clip_model_load: - kv 2: clip.has_vision_encoder bool = true 8:30PM DBG GRPC(minicpm-v2-f16.gguf-127.0.0.1:33079): stderr clip_model_load: - kv 3: clip.has_llava_projector bool = true 8:30PM DBG GRPC(minicpm-v2-f16.gguf-127.0.0.1:33079): stderr clip_model_load: - kv 4: general.file_type u32 = 1 8:30PM DBG GRPC(minicpm-v2-f16.gguf-127.0.0.1:33079): stderr clip_model_load: - kv 5: general.description str = image encoder for LLaVA 8:30PM DBG GRPC(minicpm-v2-f16.gguf-127.0.0.1:33079): stderr clip_model_load: - kv 6: clip.projector_type str = resampler 8:30PM DBG GRPC(minicpm-v2-f16.gguf-127.0.0.1:33079): stderr clip_model_load: - kv 7: clip.vision.image_size u32 = 448 8:30PM DBG GRPC(minicpm-v2-f16.gguf-127.0.0.1:33079): stderr clip_model_load: - kv 8: clip.vision.patch_size u32 = 14 8:30PM DBG GRPC(minicpm-v2-f16.gguf-127.0.0.1:33079): stderr clip_model_load: - kv 9: clip.vision.embedding_length u32 = 1152 8:30PM DBG GRPC(minicpm-v2-f16.gguf-127.0.0.1:33079): stderr clip_model_load: - kv 10: clip.vision.feed_forward_length u32 = 4304 8:30PM DBG GRPC(minicpm-v2-f16.gguf-127.0.0.1:33079): stderr clip_model_load: - kv 11: clip.vision.projection_dim u32 = 0 8:30PM DBG GRPC(minicpm-v2-f16.gguf-127.0.0.1:33079): stderr clip_model_load: - kv 12: clip.vision.attention.head_count u32 = 16 8:30PM DBG GRPC(minicpm-v2-f16.gguf-127.0.0.1:33079): stderr clip_model_load: - kv 13: clip.vision.attention.layer_norm_epsilon f32 = 0.000001 8:30PM DBG GRPC(minicpm-v2-f16.gguf-127.0.0.1:33079): stderr clip_model_load: - kv 14: clip.vision.block_count u32 = 26 8:30PM DBG GRPC(minicpm-v2-f16.gguf-127.0.0.1:33079): stderr clip_model_load: - kv 15: clip.vision.image_mean arr[f32,3] = [0.500000, 0.500000, 0.500000] 8:30PM DBG GRPC(minicpm-v2-f16.gguf-127.0.0.1:33079): stderr clip_model_load: - kv 16: clip.vision.image_std arr[f32,3] = [0.500000, 0.500000, 0.500000] 8:30PM DBG GRPC(minicpm-v2-f16.gguf-127.0.0.1:33079): stderr clip_model_load: - kv 17: clip.use_gelu bool = true 8:30PM DBG GRPC(minicpm-v2-f16.gguf-127.0.0.1:33079): stderr clip_model_load: - type f32: 277 tensors 8:30PM DBG GRPC(minicpm-v2-f16.gguf-127.0.0.1:33079): stderr clip_model_load: - type f16: 163 tensors 8:30PM DBG GRPC(minicpm-v2-f16.gguf-127.0.0.1:33079): stderr ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no 8:30PM DBG GRPC(minicpm-v2-f16.gguf-127.0.0.1:33079): stderr ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes 8:30PM DBG GRPC(minicpm-v2-f16.gguf-127.0.0.1:33079): stderr ggml_cuda_init: found 1 CUDA devices: 8:30PM DBG GRPC(minicpm-v2-f16.gguf-127.0.0.1:33079): stderr Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes 8:30PM DBG GRPC(minicpm-v2-f16.gguf-127.0.0.1:33079): stderr clip_model_load: CLIP using CUDA backend 8:30PM DBG GRPC(minicpm-v2-f16.gguf-127.0.0.1:33079): stderr clip_model_load: text_encoder: 0 8:30PM DBG GRPC(minicpm-v2-f16.gguf-127.0.0.1:33079): stderr clip_model_load: vision_encoder: 1 8:30PM DBG GRPC(minicpm-v2-f16.gguf-127.0.0.1:33079): stderr clip_model_load: llava_projector: 1 8:30PM DBG GRPC(minicpm-v2-f16.gguf-127.0.0.1:33079): stderr clip_model_load: model size: 828.18 MB 8:30PM DBG GRPC(minicpm-v2-f16.gguf-127.0.0.1:33079): stderr clip_model_load: metadata size: 0.17 MB 8:30PM DBG GRPC(minicpm-v2-f16.gguf-127.0.0.1:33079): stderr clip_model_load: params backend buffer size = 828.18 MB (440 tensors) 8:31PM DBG GRPC(minicpm-v2-f16.gguf-127.0.0.1:33079): stderr key clip.vision.image_grid_pinpoints not found in file 8:31PM DBG GRPC(minicpm-v2-f16.gguf-127.0.0.1:33079): stderr key clip.vision.mm_patch_merge_type not found in file 8:31PM DBG GRPC(minicpm-v2-f16.gguf-127.0.0.1:33079): stderr key clip.vision.image_crop_resolution not found in file 8:31PM DBG GRPC(minicpm-v2-f16.gguf-127.0.0.1:33079): stderr clip_model_load: failed to load vision model tensors 8:31PM ERR Server error error="could not load model: rpc error: code = Unknown desc = Unexpected error in RPC handling" ```

I'm not sure what might be causing the loading to fail.

Thank you!

naifmeh commented 3 months ago

I tried running the model directly through llama.cpp and the logs are clearer on what might be causing the error:

llama_model_load: error loading model: done_getting_tensors: wrong number of tensors; expected 363, got 362

According to this similar issue, it seems to be related to the quantization script, but I might be wrong.

iceflame89 commented 3 months ago

Our modified llama.cpp have not been merged into official llama.cpp, please try on this PR

Cuiunbo commented 3 months ago

Answering complete. if you have more questions, please continue to ask!

Cuiunbo commented 3 months ago

MiniCPM-Llama3-V 2.5 can run with llama.cpp now! See our fork of llama.cpp for more detail.

and here is our model in gguf format. https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5-gguf

@naifmeh

naifmeh commented 3 months ago

@Cuiunbo Awesome, it's looking great! Thanks :)

I had an error when running make:

examples/minicpmv/minicpmv.cpp: In function ‘std::pair<int, int> get_refine_size(std::pair<int, int>, std::pair<int, int>, int, int, bool)’:
examples/minicpmv/minicpmv.cpp:395:59: error: could not convert ‘std::make_tuple(_Elements&& ...) [with _Elements = {int&, int&}](grid_height)’ from ‘std::tuple<int, int>’ to ‘std::pair<int, int>’
  395 |     auto best_grid_size = find_best_resize(std::make_tuple(grid_width, grid_height), scale_resolution, patch_size, allow_upscale);
      |                                            ~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~
      |                                                           |
      |                                                           std::tuple<int, int>
examples/minicpmv/minicpmv.cpp:400:54: error: conversion from ‘std::tuple<int, int>’ to non-scalar type ‘std::pair<int, int>’ requested
  400 |     std::pair<int, int> refine_size = std::make_tuple(best_grid_width * grid_x, best_grid_height * grid_y);
      |                                       ~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
examples/minicpmv/minicpmv.cpp: At global scope:

What fixed it for me was to explicitely convert the std::tuple to the expected std::pair for the lines where this happens.

I've also tried running the available GGUF version, it seems to run correctly but the output is wildly different from the int4 version of the model that runs through the transformers library. From what I understand, Q4_K_M is supposed to be comparable in precision to an int4 version of a model, right?

In my case, the same prompts results in two very differents responses from the model, and it is always in favor of the int4 version.

harvestingmoon commented 3 months ago

Hi naifmeh, to fix the above code you can do this on the file minicpmv.cpp which is located under examples/minicpmv

There what you can do is change the line on 395 to

    auto best_grid_size = find_best_resize(std::make_pair(grid_width, grid_height), scale_resolution, patch_size, allow_upscale); // (new line) => fixes conversion for make_tuple to make_pair

As well as change line 400 to

    std::pair<int, int> refine_size = std::make_pair(best_grid_width * grid_x, best_grid_height * grid_y);

I have also created a pull request for this change under the llama-cpp repo. @Cuiunbo

Cuiunbo commented 3 months ago

Thanks a lot for the feedback, we also found a difference between the llamacpp and int4 versions. We are trying to find the problem. @naifmeh

Cuiunbo commented 3 months ago

Hi naifmeh, to fix the above code you can do this on the file minicpmv.cpp which is located under examples/minicpmv

There what you can do is change the line on 395 to
    auto best_grid_size = find_best_resize(std::make_pair(grid_width, grid_height), scale_resolution, patch_size, allow_upscale); // (new line) => fixes conversion for make_tuple to make_pair
As well as change line 400 to
    std::pair<int, int> refine_size = std::make_pair(best_grid_width * grid_x, best_grid_height * grid_y); 
I have also created a pull request for this change under the llama-cpp repo. @Cuiunbo

@harvestingmoon thanks, Are you talking about the official repository or our fork.

tc-mb commented 3 months ago

@Cuiunbo Awesome, it's looking great! Thanks :)

I had an error when running make:
examples/minicpmv/minicpmv.cpp: In function ‘std::pair<int, int> get_refine_size(std::pair<int, int>, std::pair<int, int>, int, int, bool)’:
examples/minicpmv/minicpmv.cpp:395:59: error: could not convert ‘std::make_tuple(_Elements&& ...) [with _Elements = {int&, int&}](grid_height)’ from ‘std::tuple<int, int>’ to ‘std::pair<int, int>’
  395 |     auto best_grid_size = find_best_resize(std::make_tuple(grid_width, grid_height), scale_resolution, patch_size, allow_upscale);
      |                                            ~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~
      |                                                           |
      |                                                           std::tuple<int, int>
examples/minicpmv/minicpmv.cpp:400:54: error: conversion from ‘std::tuple<int, int>’ to non-scalar type ‘std::pair<int, int>’ requested
  400 |     std::pair<int, int> refine_size = std::make_tuple(best_grid_width * grid_x, best_grid_height * grid_y);
      |                                       ~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
examples/minicpmv/minicpmv.cpp: At global scope:
What fixed it for me was to explicitely convert the std::tuple to the expected std::pair for the lines where this happens.

I've also tried running the available GGUF version, it seems to run correctly but the output is wildly different from the int4 version of the model that runs through the transformers library. From what I understand, Q4_K_M is supposed to be comparable in precision to an int4 version of a model, right?

In my case, the same prompts results in two very differents responses from the model, and it is always in favor of the int4 version.

Tuple is supported by c++11. It's better to replace it with pair here.

tc-mb commented 3 months ago

Hi naifmeh, to fix the above code you can do this on the file minicpmv.cpp which is located under examples/minicpmv

There what you can do is change the line on 395 to
    auto best_grid_size = find_best_resize(std::make_pair(grid_width, grid_height), scale_resolution, patch_size, allow_upscale); // (new line) => fixes conversion for make_tuple to make_pair
As well as change line 400 to
    std::pair<int, int> refine_size = std::make_pair(best_grid_width * grid_x, best_grid_height * grid_y); 
I have also created a pull request for this change under the llama-cpp repo. @Cuiunbo

cool, merged. ^_^

tc-mb commented 3 months ago

@Cuiunbo Awesome, it's looking great! Thanks :)

I had an error when running make:
examples/minicpmv/minicpmv.cpp: In function ‘std::pair<int, int> get_refine_size(std::pair<int, int>, std::pair<int, int>, int, int, bool)’:
examples/minicpmv/minicpmv.cpp:395:59: error: could not convert ‘std::make_tuple(_Elements&& ...) [with _Elements = {int&, int&}](grid_height)’ from ‘std::tuple<int, int>’ to ‘std::pair<int, int>’
  395 |     auto best_grid_size = find_best_resize(std::make_tuple(grid_width, grid_height), scale_resolution, patch_size, allow_upscale);
      |                                            ~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~
      |                                                           |
      |                                                           std::tuple<int, int>
examples/minicpmv/minicpmv.cpp:400:54: error: conversion from ‘std::tuple<int, int>’ to non-scalar type ‘std::pair<int, int>’ requested
  400 |     std::pair<int, int> refine_size = std::make_tuple(best_grid_width * grid_x, best_grid_height * grid_y);
      |                                       ~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
examples/minicpmv/minicpmv.cpp: At global scope:
What fixed it for me was to explicitely convert the std::tuple to the expected std::pair for the lines where this happens.

I've also tried running the available GGUF version, it seems to run correctly but the output is wildly different from the int4 version of the model that runs through the transformers library. From what I understand, Q4_K_M is supposed to be comparable in precision to an int4 version of a model, right?

In my case, the same prompts results in two very differents responses from the model, and it is always in favor of the int4 version.

Could you send me one or two case to check the accuracy difference you said?

naifmeh commented 3 months ago

@Cuiunbo Awesome, it's looking great! Thanks :) I had an error when running make:
examples/minicpmv/minicpmv.cpp: In function ‘std::pair<int, int> get_refine_size(std::pair<int, int>, std::pair<int, int>, int, int, bool)’:
examples/minicpmv/minicpmv.cpp:395:59: error: could not convert ‘std::make_tuple(_Elements&& ...) [with _Elements = {int&, int&}](grid_height)’ from ‘std::tuple<int, int>’ to ‘std::pair<int, int>’
  395 |     auto best_grid_size = find_best_resize(std::make_tuple(grid_width, grid_height), scale_resolution, patch_size, allow_upscale);
      |                                            ~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~
      |                                                           |
      |                                                           std::tuple<int, int>
examples/minicpmv/minicpmv.cpp:400:54: error: conversion from ‘std::tuple<int, int>’ to non-scalar type ‘std::pair<int, int>’ requested
  400 |     std::pair<int, int> refine_size = std::make_tuple(best_grid_width * grid_x, best_grid_height * grid_y);
      |                                       ~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
examples/minicpmv/minicpmv.cpp: At global scope:
What fixed it for me was to explicitely convert the std::tuple to the expected std::pair for the lines where this happens. I've also tried running the available GGUF version, it seems to run correctly but the output is wildly different from the int4 version of the model that runs through the transformers library. From what I understand, Q4_K_M is supposed to be comparable in precision to an int4 version of a model, right? In my case, the same prompts results in two very differents responses from the model, and it is always in favor of the int4 version.
Could you send me one or two case to check the accuracy difference you said?

Sure!

First example

This one includes a simple screenshot of Amazon that I've used to test the OCR capabilities: https://ibb.co/q9FB0kX Here is my prompt and the output with the Q4_K_M version:

$ ./minicpmv-cli --model ../models/minicpm25-q4km.gguf --mmproj ../models/mmproj-minicpm25.gguf --image test_img3.jpg -c 4096 --temp 0.7 --top-p 0.8 --top-k 100 --repeat-penalty 1.05 -p "List the items that are included in the dropdown menu." --n-gpu-layers 40
<user>List the items that are included in the dropdown menu.
<assistant>
The dropdown menu includes items such as "Salle de bain et douche", "Cuisine et alimentation", and "Ménage et bricolage".

Here is the output for the same prompt with the transformers library and the int4 version:

The dropdown menu includes the following items:

1. Air fryer
2. Air fryer Philips
3. Air fryer Moulinex
4. Air fryer Cotelec
5. Accessories
6. Air fryer Philips XXL
7. Air fryer Philips XL
8. Air fryer 2 compartments
9. Air fryer 8L

These items seem to be related to kitchen appliances, specifically air fryers from various brands and models.

Second example

The image is a stock picture of a living room, taken from here

I'm asking the model to describe the house equiments that are present in the picture.

With the G4_K_M quantization:

$ ./minicpmv-cli --model ../models/minicpm25-q4km.gguf --mmproj ../models/mmproj-minicpm25.gguf --image test_img4.jpg -c 4096 --temp 0.7 --top-p 0.8 --top-k 100 --repeat-penalty 1.05 -p "List all the house elements that are present in this picture." --n-gpu-layers 40

<user>List all the house elements that are present in this picture.
<assistant>
The image displays an interior space that appears to be a living room or lounge area within a home. Key elements include a wooden staircase with a glass balustrade leading to an upper level, which suggests a multi-storey dwelling. There's a plush seating area featuring a beige sofa
 adorned with throw pillows and a coffee table, indicating a space for relaxation or socializing. A bookshelf filled with books indicates a leisure or study zone. The presence of a fireplace suggests a central gathering spot for warmth and ambiance. The image captures these elements from various angles, providing a comprehensive view of the interior design and layout.

This example generates an answer that is closer to the int4 version of the model but the first part of the answer is still off.

The output with the INT4 version:

The elements of the house present in the picture include a living room, ceiling with wooden beams, chandeliers, a fireplace, wall-mounted television, built-in bookshelves, sofa, armchairs, coffee table, side tables, lamps, decorative items, and windows.

Cuiunbo commented 3 months ago

@naifmeh now we have solved this problem, please try it, looking forward to your feedback!

naifmeh commented 3 months ago

@naifmeh now we have solved this problem, please try it, looking forward to your feedback!

Thanks @Cuiunbo, I will try asap and get back to you.

naifmeh commented 3 months ago

@Cuiunbo

Tried with the same images and prompt, and the files included in commit 6366d62d6b16b4376d976399cf402eb40829ebb1. I also downloaded the latest GGUF model (Q4_K_M).

For the first prompt, which tested the OCR capabilities of the quantized model, this is what I get when I run the command a few times:

$ ./minicpmv-cli --model ../LocalAI/models/minicpm25-q4km_new.gguf --mmproj ../LocalAI/models/mmproj-minicpm25.gguf --image test_img3.jpg -c 4096 --temp 0.7 --top-p 0.8 --top-k 100 --repeat-penalty 1.05 -p "List the items that are included in the dropdown menu." --n-gpu-layers 40

<user>List the items that are included in the dropdown menu.
<assistant>
The dropdown menu includes items such as air fryer, coffee maker, microwave, iron, hair dryer, steam cleaner, vacuum cleaner, and kitchen utensils.

<user>List the items that are included in the dropdown menu.
<assistant>
The dropdown menu includes a variety of items such as air fryer accessories, air fryers, air purifiers, aquariums, bed linen, bicycles, books, garden tools, kitchen utensils, kitchenware, lamps, laptops, mirrors
, pet products, photo frames, power tools, projectors, sewing machines, smart home devices, tablets, televisions, toys, and vacuum cleaners.

<user>List the items that are included in the dropdown menu.
<assistant>
The dropdown menu includes various categories such as 'Articles de toilette', 'Cadeaux d'entreprise', 'Camping et randonnée', 'Chaînes TV', 'Chromotherapie', 'Conso en ligne', 'Cordage et nœuds', 'Création artis
anale', 'Cuisine et repas', 'Découvrir Amazon', 'Divers et variés', 'Droguerie', 'Electronique grand public', 'Enfants, bébés, maternité', 'Équipement de bureau', 'Équipements de cuisine', 'Étiquettes et étiquettes', 'Ferrailles et outils', 'Gardening', 'Gourmandise', 'Gros lots', 'Histoire et actualité', 'Hobby & Sortir', 'Hygiène et beauté', 'Informatique, telecom', 'Jardin', 'Jeux et jouets', 'Journées fériées', 'Livres, BD, DVD', 'Maison et jardin', 'Meubles et décoration', 'Médecine et soins personnels', 'Menuiserie, ébénisterie', '

I see no major changes to the output when I play with the temperature. It's also not much better when running the model entierely on the CPU. Sometimes the output is just blank.

The output for the second image and prompt are better than they were though:

<user>List all the house elements that are present in this picture.
<assistant>
The image shows a living room with several elements such as a fireplace, a sofa, a coffee table, a rug, a light fixture, a picture frame, and decorative items like a lamp, books, and a throw pillow.

I've also tried the first prompt/image with the Q6_K version, with similar results. Something else I noted with the Q6_K model is that it is particularly slow for this specific prompt/image. When I run the model on the second prompt/image, it runs quickly and returns a consistent output. Edit: I forgot to offload the layers to the GPU

Cuiunbo commented 3 months ago

@tc-mb Have a look.

OpenBMB / MiniCPM-V

GGUF versions doesn't seem to run on llama.cpp (through LocalAI) #114

First example

Second example