OpenBMB / MiniCPM-V

MiniCPM-V 2.6: A GPT-4V Level MLLM for Single Image, Multi Image and Video on Your Phone
Apache License 2.0
11.83k stars 831 forks source link

[BUG] number of image start tokens and image end tokens mismatch #483

Open Aguin opened 3 weeks ago

Aguin commented 3 weeks ago

是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?

该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?

当前行为 | Current Behavior

When the length of image_start_tokens and the length of image_end_tokens are not equal, valid_image_nums will be the greater one, causing torch.hstack to fail due to tensor size mismatch. Should max be min? https://huggingface.co/openbmb/MiniCPM-V-2_6/blob/main/processing_minicpmv.py#L119 image

期望行为 | Expected Behavior

No response

复现方法 | Steps To Reproduce

run the video example with video_path="./assets/demo_video.mp4" https://github.com/OpenBMB/MiniCPM-V?tab=readme-ov-file#chat-with-video

运行环境 | Environment

- OS: Ubuntu 20.04
- Python: 3.10
- Transformers: 4.40.0
- PyTorch: 2.1.2
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`): 11.8

备注 | Anything else?

No response

LDLINGLINGLING commented 3 weeks ago

hello, Maybe your model length setting is too small and the video is too long.

Aguin commented 3 weeks ago

hello, Maybe your model length setting is too small and the video is too long.

@LDLINGLINGLING yes, downsampling can solve this, but I still think L119 is incorrect since it will stack two tensors of different lengths.

Liu0329 commented 3 weeks ago

Where to set the model length ?

nanamma commented 1 week ago

Where to set the model length ?

set MAX_NUM_FRAMES=40 #64 # if cuda OOM set a smaller number when inference videos maybe 40 is the largest, due to the max_num of tokens is 8192