OpenBMB / MiniCPM-V

MiniCPM-V 2.6: A GPT-4V Level MLLM for Single Image, Multi Image and Video on Your Phone
Apache License 2.0
11.82k stars 829 forks source link

支持llama.cpp 部署么? #16

Closed WangFengtu1996 closed 3 months ago

WangFengtu1996 commented 7 months ago
iceflame89 commented 7 months ago

We are working on it,please stay tuned.

CyberTimon commented 7 months ago

I'm interested in this feature too! Thanks

iceflame89 commented 5 months ago

coming soon!

CyberTimon commented 5 months ago

Nice! Looking forward to it :)

leeaction commented 5 months ago

any update? I'm Really looking forward to it

zhengxingmao commented 5 months ago

+1

Achazwl commented 4 months ago

Created a PR: https://github.com/ggerganov/llama.cpp/pull/6919. I created a folder called "minicpmv" in the examples folder of llama.cpp. More detail can be seen in llama.cpp/examples/minicpmv/README.md

leeaction commented 4 months ago

Created a PR: ggerganov/llama.cpp#6919. I created a folder called "minicpmv" in the examples folder of llama.cpp. More detail can be seen in llama.cpp/examples/minicpmv/README.md

Hi Achazwl, Can you provide the minicpmv gguf Q4_K_M file? That could be easily to deploy on ollama locally.

Achazwl commented 4 months ago

Created a PR: ggerganov/llama.cpp#6919. I created a folder called "minicpmv" in the examples folder of llama.cpp. More detail can be seen in llama.cpp/examples/minicpmv/README.md

Hi Achazwl, Can you provide the minicpmv gguf Q4_K_M file? That could be easily to deploy on ollama locally.

Thanks mzwing for uploading: https://huggingface.co/mzwing/MiniCPM-V-2-GGUF/tree/main

leeaction commented 4 months ago

Created a PR: ggerganov/llama.cpp#6919. I created a folder called "minicpmv" in the examples folder of llama.cpp. More detail can be seen in llama.cpp/examples/minicpmv/README.md

Hi Achazwl, Can you provide the minicpmv gguf Q4_K_M file? That could be easily to deploy on ollama locally.

Thanks mzwing for uploading: https://huggingface.co/mzwing/MiniCPM-V-2-GGUF/tree/main

Thanks to provid the MiniCPM-V-2-GGUF model , but when I import it with Modelfile to ollama, the log shows that load model failed ,the error message is below

main error is done_getting_tensors: wrong number of tensors; expected 363, got 362

Please give some help to resolve this issue.

Ollama version :0.1.37

logs:

May 14 12:08:40 wbs-desktop ollama[226291]: llama_model_loader: loaded meta data with 23 key-value pairs and 363 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-e336d2d263931fce7f8006ea7e283911854458c4faf6efe79f8ebd8730be16e3 (version GGUF V3 (latest)) May 14 12:08:40 wbs-desktop ollama[226291]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. May 14 12:08:40 wbs-desktop ollama[226291]: llama_model_loader: - kv 0: general.architecture str = minicpm May 14 12:08:40 wbs-desktop ollama[226291]: llama_model_loader: - kv 1: general.name str = MiniCPM May 14 12:08:40 wbs-desktop ollama[226291]: llama_model_loader: - kv 2: minicpm.context_length u32 = 4096 May 14 12:08:40 wbs-desktop ollama[226291]: llama_model_loader: - kv 3: minicpm.embedding_length u32 = 2304 May 14 12:08:40 wbs-desktop ollama[226291]: llama_model_loader: - kv 4: minicpm.block_count u32 = 40 May 14 12:08:40 wbs-desktop ollama[226291]: llama_model_loader: - kv 5: minicpm.feed_forward_length u32 = 5760 May 14 12:08:40 wbs-desktop ollama[226291]: llama_model_loader: - kv 6: minicpm.rope.dimension_count u32 = 64 May 14 12:08:40 wbs-desktop ollama[226291]: llama_model_loader: - kv 7: minicpm.attention.head_count u32 = 36 May 14 12:08:40 wbs-desktop ollama[226291]: llama_model_loader: - kv 8: minicpm.attention.head_count_kv u32 = 36 May 14 12:08:40 wbs-desktop ollama[226291]: llama_model_loader: - kv 9: minicpm.attention.layer_norm_rms_epsilon f32 = 0.000010 May 14 12:08:40 wbs-desktop ollama[226291]: llama_model_loader: - kv 10: general.file_type u32 = 15 May 14 12:08:40 wbs-desktop ollama[226291]: llama_model_loader: - kv 11: minicpm.tie_lm_head bool = false May 14 12:08:40 wbs-desktop ollama[226291]: llama_model_loader: - kv 12: tokenizer.ggml.model str = llama May 14 12:08:40 wbs-desktop ollama[226291]: llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,122753] = ["", "", "", "", "<C... May 14 12:08:40 wbs-desktop ollama[226291]: llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,122753] = [-1000.000000, -1000.000000, -1000.00... May 14 12:08:40 wbs-desktop ollama[226291]: llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,122753] = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... May 14 12:08:40 wbs-desktop ollama[226291]: llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1 May 14 12:08:40 wbs-desktop ollama[226291]: llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2 May 14 12:08:40 wbs-desktop ollama[226291]: llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0 May 14 12:08:40 wbs-desktop ollama[226291]: llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32 = 0 May 14 12:08:40 wbs-desktop ollama[226291]: llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true May 14 12:08:40 wbs-desktop ollama[226291]: llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = false May 14 12:08:40 wbs-desktop ollama[226291]: llama_model_loader: - kv 22: general.quantization_version u32 = 2 May 14 12:08:40 wbs-desktop ollama[226291]: llama_model_loader: - type f32: 81 tensors May 14 12:08:40 wbs-desktop ollama[226291]: llama_model_loader: - type q5_0: 20 tensors May 14 12:08:40 wbs-desktop ollama[226291]: llama_model_loader: - type q8_0: 20 tensors May 14 12:08:40 wbs-desktop ollama[226291]: llama_model_loader: - type q4_K: 221 tensors May 14 12:08:40 wbs-desktop ollama[226291]: llama_model_loader: - type q6_K: 21 tensors May 14 12:08:40 wbs-desktop ollama[226291]: time=2024-05-14T12:08:40.645+08:00 level=INFO source=server.go:524 msg="waiting for server to become available" status="llm server loading model" May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_vocab: mismatch in special tokens definition ( 3528/122753 vs 271/122753 ). May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: format = GGUF V3 (latest) May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: arch = minicpm May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: vocab type = SPM May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: n_vocab = 122753 May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: n_merges = 0 May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: n_ctx_train = 4096 May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: n_embd = 2304 May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: n_head = 36 May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: n_head_kv = 36 May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: n_layer = 40 May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: n_rot = 64 May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: n_embd_head_k = 64 May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: n_embd_head_v = 64 May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: n_gqa = 1 May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: n_embd_k_gqa = 2304 May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: n_embd_v_gqa = 2304 May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: f_norm_eps = 0.0e+00 May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: f_norm_rms_eps = 1.0e-05 May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: f_clamp_kqv = 0.0e+00 May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: f_logit_scale = 0.0e+00 May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: n_ff = 5760 May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: n_expert = 0 May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: n_expert_used = 0 May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: causal attn = 1 May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: pooling type = 0 May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: rope type = 0 May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: rope scaling = linear May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: freq_base_train = 10000.0 May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: freq_scale_train = 1 May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: n_yarn_orig_ctx = 4096 May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: rope_finetuned = unknown May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: ssm_d_conv = 0 May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: ssm_d_inner = 0 May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: ssm_d_state = 0 May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: ssm_dt_rank = 0 May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: model type = 2B May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: model ftype = Q4_K - Medium May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: model params = 3.01 B May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: model size = 1.82 GiB (5.21 BPW) May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: general.name = MiniCPM May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: BOS token = 1 '' May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: EOS token = 2 '' May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: UNK token = 0 '' May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: PAD token = 0 '' May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: LF token = 1099 '<0x0A>' May 14 12:08:40 wbs-desktop ollama[226291]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes May 14 12:08:40 wbs-desktop ollama[226291]: ggml_cuda_init: CUDA_USE_TENSOR_CORES: no May 14 12:08:40 wbs-desktop ollama[226291]: ggml_cuda_init: found 1 CUDA devices: May 14 12:08:40 wbs-desktop ollama[226291]: Device 0: NVIDIA GeForce GTX 1050 Ti, compute capability 6.1, VMM: yes May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_tensors: ggml ctx size = 0.37 MiB May 14 12:08:40 wbs-desktop ollama[226291]: llama_model_load: error loading model: done_getting_tensors: wrong number of tensors; expected 363, got 362 May 14 12:08:40 wbs-desktop ollama[226291]: llama_load_model_from_file: exception loading model May 14 12:08:40 wbs-desktop ollama[226291]: terminate called after throwing an instance of 'std::runtime_error' May 14 12:08:40 wbs-desktop ollama[226291]: what(): done_getting_tensors: wrong number of tensors; expected 363, got 362

Achazwl commented 4 months ago

Created a PR: ggerganov/llama.cpp#6919. I created a folder called "minicpmv" in the examples folder of llama.cpp. More detail can be seen in llama.cpp/examples/minicpmv/README.md

Hi Achazwl, Can you provide the minicpmv gguf Q4_K_M file? That could be easily to deploy on ollama locally.

Thanks mzwing for uploading: https://huggingface.co/mzwing/MiniCPM-V-2-GGUF/tree/main

Thank to provid the MiniCPM-V-2-GGUF model , but when I import it with Modelfile to ollama, the log shows that load model failed ,the error message is below

main error is done_getting_tensors: wrong number of tensors; expected 363, got 362

Please give some help to resolve this issue.

Ollama version :0.1.37

logs:

May 14 12:08:40 wbs-desktop ollama[226291]: llama_model_loader: loaded meta data with 23 key-value pairs and 363 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-e336d2d263931fce7f8006ea7e283911854458c4faf6efe79f8ebd8730be16e3 (version GGUF V3 (latest)) May 14 12:08:40 wbs-desktop ollama[226291]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. May 14 12:08:40 wbs-desktop ollama[226291]: llama_model_loader: - kv 0: general.architecture str = minicpm May 14 12:08:40 wbs-desktop ollama[226291]: llama_model_loader: - kv 1: general.name str = MiniCPM May 14 12:08:40 wbs-desktop ollama[226291]: llama_model_loader: - kv 2: minicpm.context_length u32 = 4096 May 14 12:08:40 wbs-desktop ollama[226291]: llama_model_loader: - kv 3: minicpm.embedding_length u32 = 2304 May 14 12:08:40 wbs-desktop ollama[226291]: llama_model_loader: - kv 4: minicpm.block_count u32 = 40 May 14 12:08:40 wbs-desktop ollama[226291]: llama_model_loader: - kv 5: minicpm.feed_forward_length u32 = 5760 May 14 12:08:40 wbs-desktop ollama[226291]: llama_model_loader: - kv 6: minicpm.rope.dimension_count u32 = 64 May 14 12:08:40 wbs-desktop ollama[226291]: llama_model_loader: - kv 7: minicpm.attention.head_count u32 = 36 May 14 12:08:40 wbs-desktop ollama[226291]: llama_model_loader: - kv 8: minicpm.attention.head_count_kv u32 = 36 May 14 12:08:40 wbs-desktop ollama[226291]: llama_model_loader: - kv 9: minicpm.attention.layer_norm_rms_epsilon f32 = 0.000010 May 14 12:08:40 wbs-desktop ollama[226291]: llama_model_loader: - kv 10: general.file_type u32 = 15 May 14 12:08:40 wbs-desktop ollama[226291]: llama_model_loader: - kv 11: minicpm.tie_lm_head bool = false May 14 12:08:40 wbs-desktop ollama[226291]: llama_model_loader: - kv 12: tokenizer.ggml.model str = llama May 14 12:08:40 wbs-desktop ollama[226291]: llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,122753] = ["", "", "", "", "<C... May 14 12:08:40 wbs-desktop ollama[226291]: llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,122753] = [-1000.000000, -1000.000000, -1000.00... May 14 12:08:40 wbs-desktop ollama[226291]: llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,122753] = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... May 14 12:08:40 wbs-desktop ollama[226291]: llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1 May 14 12:08:40 wbs-desktop ollama[226291]: llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2 May 14 12:08:40 wbs-desktop ollama[226291]: llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0 May 14 12:08:40 wbs-desktop ollama[226291]: llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32 = 0 May 14 12:08:40 wbs-desktop ollama[226291]: llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true May 14 12:08:40 wbs-desktop ollama[226291]: llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = false May 14 12:08:40 wbs-desktop ollama[226291]: llama_model_loader: - kv 22: general.quantization_version u32 = 2 May 14 12:08:40 wbs-desktop ollama[226291]: llama_model_loader: - type f32: 81 tensors May 14 12:08:40 wbs-desktop ollama[226291]: llama_model_loader: - type q5_0: 20 tensors May 14 12:08:40 wbs-desktop ollama[226291]: llama_model_loader: - type q8_0: 20 tensors May 14 12:08:40 wbs-desktop ollama[226291]: llama_model_loader: - type q4_K: 221 tensors May 14 12:08:40 wbs-desktop ollama[226291]: llama_model_loader: - type q6_K: 21 tensors May 14 12:08:40 wbs-desktop ollama[226291]: time=2024-05-14T12:08:40.645+08:00 level=INFO source=server.go:524 msg="waiting for server to become available" status="llm server loading model" May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_vocab: mismatch in special tokens definition ( 3528/122753 vs 271/122753 ). May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: format = GGUF V3 (latest) May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: arch = minicpm May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: vocab type = SPM May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: n_vocab = 122753 May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: n_merges = 0 May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: n_ctx_train = 4096 May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: n_embd = 2304 May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: n_head = 36 May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: n_head_kv = 36 May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: n_layer = 40 May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: n_rot = 64 May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: n_embd_head_k = 64 May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: n_embd_head_v = 64 May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: n_gqa = 1 May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: n_embd_k_gqa = 2304 May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: n_embd_v_gqa = 2304 May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: f_norm_eps = 0.0e+00 May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: f_norm_rms_eps = 1.0e-05 May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: f_clamp_kqv = 0.0e+00 May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: f_logit_scale = 0.0e+00 May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: n_ff = 5760 May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: n_expert = 0 May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: n_expert_used = 0 May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: causal attn = 1 May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: pooling type = 0 May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: rope type = 0 May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: rope scaling = linear May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: freq_base_train = 10000.0 May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: freq_scale_train = 1 May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: n_yarn_orig_ctx = 4096 May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: rope_finetuned = unknown May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: ssm_d_conv = 0 May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: ssm_d_inner = 0 May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: ssm_d_state = 0 May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: ssm_dt_rank = 0 May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: model type = 2B May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: model ftype = Q4_K - Medium May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: model params = 3.01 B May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: model size = 1.82 GiB (5.21 BPW) May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: general.name = MiniCPM May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: BOS token = 1 '' May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: EOS token = 2 '' May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: UNK token = 0 '' May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: PAD token = 0 '' May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_print_meta: LF token = 1099 '<0x0A>' May 14 12:08:40 wbs-desktop ollama[226291]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes May 14 12:08:40 wbs-desktop ollama[226291]: ggml_cuda_init: CUDA_USE_TENSOR_CORES: no May 14 12:08:40 wbs-desktop ollama[226291]: ggml_cuda_init: found 1 CUDA devices: May 14 12:08:40 wbs-desktop ollama[226291]: Device 0: NVIDIA GeForce GTX 1050 Ti, compute capability 6.1, VMM: yes May 14 12:08:40 wbs-desktop ollama[226291]: llm_load_tensors: ggml ctx size = 0.37 MiB May 14 12:08:40 wbs-desktop ollama[226291]: llama_model_load: error loading model: done_getting_tensors: wrong number of tensors; expected 363, got 362 May 14 12:08:40 wbs-desktop ollama[226291]: llama_load_model_from_file: exception loading model May 14 12:08:40 wbs-desktop ollama[226291]: terminate called after throwing an instance of 'std::runtime_error' May 14 12:08:40 wbs-desktop ollama[226291]: what(): done_getting_tensors: wrong number of tensors; expected 363, got 362

My PR also updated a few lines of minicpm's code, which is in the llama.cpp file, are those modification synced?

https://github.com/ggerganov/llama.cpp/pull/6919/files#diff-150dc86746a90bad4fc2c3334aeb9b5887b3adad3cc1459446717638605348ef

Achazwl commented 4 months ago

@mzwing Would my latest commit https://github.com/ggerganov/llama.cpp/pull/6919/commits/a76fbcd05054e39e8be325c10320397775d42ac3 that tidy the modification outside the minicpmv folder change the gguf file?

mzwing commented 4 months ago

@mzwing Would my latest commit ggerganov/llama.cpp@a76fbcd that tidy the modification outside the minicpmv folder change the gguf file?

Not tested yet (

I will update the Huggingface repo tonight if the gguf files are changed.

mzwing commented 4 months ago

@leeaction Are you sure that your ollama supports MiniCPM-V-2? This model may need manual compilation with the PR adding in it.

leeaction commented 4 months ago

@leeaction Do you sure that your ollama supports MiniCPM-V-2? This model may need manual compilation with the PR adding in it.

Emm..., I imported below gguf file to ollama. I thought These files was compiled with the PR , Or it's Not ?

OR.... I should to compile ollama bin with the PR locally and use it?

👇🏻👇🏻👇🏻 Thanks mzwing for uploading: https://huggingface.co/mzwing/MiniCPM-V-2-GGUF/tree/main

mzwing commented 4 months ago

Yes my quantized models were built with the PR.

OR.... I should to compile ollama bin with the PR locally and use it?

Yes, you should.

leeaction commented 4 months ago

Yes my quantized models were built with the PR.

OR.... I should to compile ollama bin with the PR locally and use it?

Yes, you should.

Thankyou for the information

mzwing commented 4 months ago

I will update the Huggingface repo tonight if the gguf files are changed.

@Achazwl @leeaction I confirm that the gguf files are changed.

Uploading... Please wait for a while.

mzwing commented 4 months ago

Uploading... Please wait for a while.

Uploaded. Please test.

@leeaction

Cuiunbo commented 3 months ago

MiniCPM-Llama3-V 2.5 can run with llama.cpp now! See our fork of llama.cpp for more detail.

and here is our model in gguf format. https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5-gguf

CyberTimon commented 3 months ago

Can you please upload Q5_K_M if you have time? This has a good tradeoff between quality and size

mzwing commented 3 months ago

Can you please upload Q5_K_M if you have time? This has a good tradeoff between quality and size

I will have a try at noon (UTC+8).

I encountered some bugs on building OpenBMB/llama.cpp, and my RAM is not enough to convert it into gguf.

When I now have time to do some research on it, I found that OpenBMB has already uploaded all sizes of the model, so it seems that I don't need to do it anymore.

Sorry for the inconvenience I caused.

huisai commented 3 months ago

@Achazwl 请问在手机上的运行时间是多少? encode_image_with_clip 相关时间就多达140s, 请问这个时间正常吗? 是否可以优化?输入图片分辨是512x512; 手机cpu是8gen3,12G内存。以下是我在手机端运行的时间:

llama_print_timings: load time = 146004.56 ms llama_print_timings: sample time = 6.43 ms / 83 runs ( 0.08 ms per token, 12900.22 tokens per second) llama_print_timings: prompt eval time = 15484.59 ms / 238 tokens ( 65.06 ms per token, 15.37 tokens per second) llama_print_timings: eval time = 7342.00 ms / 82 runs ( 89.54 ms per token, 11.17 tokens per second) llama_print_timings: total time = 153497.50 ms / 320 tokens

Achazwl commented 3 months ago

不正常,我这有几个结果,关于 8gen3 处理 448x448 的图像 clip 时间:

  1. 如果是按照 llama.cpp 提供的在电脑上编译然后拷到手机上的方案,大概是 40+s
  2. 如果直接在手机的 termux 上编译 llama.cpp 并运行,大概只需 10+s
  3. MiniCPM-V-2.5 版本基于本 PR 进行了进一步的开发 OpenBMB/llama.cpp,对手机的图像部分做了进一步的大幅加速,详细可以参考那边。
mzwing commented 3 months ago

@huisai 你启用了CLBLast加速没有(

这个时间绝对是不正常的,除非你手机正在后台运行一些高占用进程。

以及你用的模型量化是多少啊(

huisai commented 3 months ago

@huisai 你启用了CLBLast加速没有(

这个时间绝对是不正常的,除非你手机正在后台运行一些高占用进程。

以及你用的模型量化是多少啊(

没有启用CLBLast, 但即使没有也不会有这么大差别吧。 模型是 MiniCPM-V-2-mmproj.F16.gguf 和 MiniCPM-V-2.Q5_K_M.gguf

allen20200111 commented 1 month ago

还是不支持部署吗?看到这个写的还在开发过程中,https://github.com/OpenBMB/llama.cpp/tree/minicpm-v2.5/examples/server

mzwing commented 1 month ago

还是不支持部署吗?看到这个写的还在开发过程中,https://github.com/OpenBMB/llama.cpp/tree/minicpm-v2.5/examples/server

根据 https://github.com/ggerganov/llama.cpp/pull/7599 ,开发应该已接近完成。

而且其实在开发过程中也是可以部署的:) 只是也许需要自己编译一下(

gejian-iscas commented 3 weeks ago

我在android手机上按照linux的步骤操作,但是遇到了bash:minicpmv-cli permission denied错误,这是什么问题?

mzwing commented 3 weeks ago

我在android手机上按照linux的步骤操作,但是遇到了bash:minicpmv-cli permission denied错误,这是什么问题?

试试chmod +x ./minicpmv-cli

gejian-iscas commented 3 weeks ago

我在android手机上按照linux的步骤操作,但是遇到了bash:minicpmv-cli permission denied错误,这是什么问题?

试试chmod +x ./minicpmv-cli

感谢你的回复!但是我在运行完chmod命令后,仍旧无法运行minicpmv-cli,依然是Permission Denied。

mzwing commented 3 weeks ago

感谢你的回复!但是我在运行完chmod命令后,仍旧无法运行minicpmv-cli,依然是Permission Denied。

那要不你试试chmod 755 -R .