Question on multi-image input

AILab-CVC / SEED-Bench

(CVPR2024)A benchmark for evaluating Multimodal LLMs using multiple-choice questions.

Other

315 stars 12 forks source link

Question on multi-image input #24

Open auhowielau opened 8 months ago

auhowielau commented 8 months ago

Some models (e.g. LLaVA 1.5) cannot input multiple (>3) images limited by input length (e.g., 2048). However, Evaluation Dimension 17-24 of SeedBench 2 may require inputs of up to 8 images. How do you handle such situations? Thanks!

Bohao-Lee commented 8 months ago

In our code, we concat images to handle such situations just like other models. In our experiment, llava model can output reasonable loss.

auhowielau commented 8 months ago

For the LLaVA 1.5 model, does the concat operation transform N input images into Nx576 visual tokens? If so, for an input of 8 frames, would there be a truncation issue, as 576x8=4608 far exceeds the input length limit of 2048? Thanks!