OpenBMB / MiniCPM-V

MiniCPM-V 2.6: A GPT-4V Level MLLM for Single Image, Multi Image and Video on Your Phone
Apache License 2.0
12.75k stars 893 forks source link

[BUG] <title>llama.cpp CLIP cannot encode some images after building graph CLIP无法编码特定图片 #650

Open yzyhyt opened 1 month ago

yzyhyt commented 1 month ago

是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?

该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?

当前行为 | Current Behavior

When input some images, the program will stop after clip_image_preprocess and clip_image_build_graph, and then exit. The program should have started encode_image_with_clip after clip_image_build_graph but it doen't. I guess it is because uhd_slice_image makes the image too small to encode by CLIP, or the image is too small. However, the demo can run smoothly at these images. I used ggml-model-Q4_K_M.gguf on llama.cpp

输入特定图片会令程序停在clip_image_preprocess 和 clip_image_build_graph之后,并退出。 正常情况下程序在clip_image_build_graph后会encode_image_with_clip即用CLIP编码图像,但此时并没有。 我推测这是由于uhd_slice_image将图片的尺寸调整得太小或者图片本身尺寸太小,导致CLIP无法编码。 但是在demo上图片可以被正确处理。 我在llama.cpp上使用了ggml-model-Q4_K_M.gguf这个量化模型。

期望行为 | Expected Behavior

The program should process the images.

程序应该能处理图片。

复现方法 | Steps To Reproduce

llama-minicpmv-cli.exe -m ggml-model-Q4_K_M.gguf --mmproj mmproj-model-f16.gguf -c 512 --temp 0.7 --top-p 0.8 --top-k 100 --repeat-penalty 1.05 --image QQQQQQQ_8408.jpg -p "return the text" QQQQQQQ_8408

运行环境 | Environment

- OS:
- Python:
- Transformers:
- PyTorch:
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`):

备注 | Anything else?

clip_model_load: CLIP using CUDA backend clip_model_load: text_encoder: 0 clip_model_load: vision_encoder: 1 clip_model_load: llava_projector: 0 clip_model_load: minicpmv_projector: 1 clip_model_load: model size: 996.02 MB clip_model_load: metadata size: 0.16 MB clip_model_load: params backend buffer size = 996.02 MB (455 tensors) key clip.vision.image_grid_pinpoints not found in file key clip.vision.mm_patch_merge_type not found in file key clip.vision.image_crop_resolution not found in file clip_image_build_graph: 448 448 clip_model_load: compute allocated memory: 102.80 MB uhd_slice_image: multiple 1 clip_image_preprocess: 1050 196 clip_image_build_graph: 1050 196