[BUG] number of image start tokens and image end tokens mismatch

Aguin commented 3 weeks ago

When the length of image_start_tokens and the length of image_end_tokens are not equal, valid_image_nums will be the greater one, causing torch.hstack to fail due to tensor size mismatch. Should max be min? https://huggingface.co/openbmb/MiniCPM-V-2_6/blob/main/processing_minicpmv.py#L119

No response

- OS: Ubuntu 20.04
- Python: 3.10
- Transformers: 4.40.0
- PyTorch: 2.1.2
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`): 11.8

No response

LDLINGLINGLING commented 3 weeks ago

hello, Maybe your model length setting is too small and the video is too long.

Aguin commented 3 weeks ago

hello, Maybe your model length setting is too small and the video is too long.

@LDLINGLINGLING yes, downsampling can solve this, but I still think L119 is incorrect since it will stack two tensors of different lengths.

Liu0329 commented 3 weeks ago

Where to set the model length ?

nanamma commented 1 week ago

Where to set the model length ?

set MAX_NUM_FRAMES=40 #64 # if cuda OOM set a smaller number when inference videos maybe 40 is the largest, due to the max_num of tokens is 8192

OpenBMB / MiniCPM-V