Multi-Image or Multi-Video Inference Example

chancharikmitra commented 4 months ago

Hello, and thanks for such a great contribution to the field of interleaved LMMs! This is really great work. I was wondering if there was an example of the format for multiple image or multiple video inference (similar to what is shown in the in-context learning examples)? Does it involve appending multiple <image> tokens at the specified locations? And then are the images and videos inserted sequentially?

From my understanding of the run_vila.py script, the way to have an ICL input for images (and the corresponding structure for videos, of course) would be as follows:

python -W ignore llava/eval/run_vila.py \
    --model-path Efficient-Large-Model/Llama-3-VILA1.5-8b \
    --conv-mode llama_3 \
    --query "<image>\n ICL text 1 <image>\n ICL text 2 <image>\n" \
    --image-file "img1.png,img2.png,img3.png"

However, I am not sure if the positions of the <image> tokens are considered by the model during generation because looking at the llava_llama.py, the method for preparing the multimodal inputs is inherited from LLaVA, which I believe just concatenates the image features and does not embed them specifically in the locations of the <image> tokens.

I may have missed something as I am still new to the codebase and exploring the model more deeply. Would appreciate any clarification on the point about multi-image and multi-video inputs. Thanks!

Edit: After having looked more deeply, it seems to me at least that the way I have formatted the prompt (with '\n' included) aligns with your code. However, I see in your paper that the image tokens are enumerated:

Edit 2:

As a side note, I do get this warning a lot.

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results. Setting pad_token_id to eos_token_id:128001 for open-end generation.

I think the pad token is fine as it is automatically set to the eos_token. But what about the mask? I see no mention of that when I try to evaluate on datasets like SEEDBench. I do seem to get uncharacteristically low acc. on these benchmarks, and I am trying to find out why.

I also noticed that the run_vila.py script does not have 'llama_3' as a conv_mode option. Is it possible that VILA-1.5 used a different conv_mode?

DtYXs commented 4 months ago

Hello, I think in the VILA code, images are embed specifically in the locations of the <image> tokens. https://github.com/NVlabs/VILA/blob/0085724ca3363dc62eaa0d7ef1a30ad20df5c8da/llava/model/llava_arch.py#L371-L391

chancharikmitra commented 4 months ago

Thank you @DtYXs for the clarification about the <image> token and its placement! Given that, do you have any insights on why zero-shot performance on VILA-1.5-8b might be lower than what is being reported? Few-shot improvements are fantastic as advertised. Perhaps it is related to my concerns regarding masking and the conv_mode formatting. However, looking deeper at the eval scripts, I see that the conv_mode was passed directly - so 'llama_3' indeed would have been used.

NVlabs / VILA

Multi-Image or Multi-Video Inference Example #97