Open chancharikmitra opened 4 months ago
Hello, I think in the VILA code, images are embed specifically in the locations of the <image>
tokens. https://github.com/NVlabs/VILA/blob/0085724ca3363dc62eaa0d7ef1a30ad20df5c8da/llava/model/llava_arch.py#L371-L391
Thank you @DtYXs for the clarification about the <image>
token and its placement! Given that, do you have any insights on why zero-shot performance on VILA-1.5-8b might be lower than what is being reported? Few-shot improvements are fantastic as advertised. Perhaps it is related to my concerns regarding masking and the conv_mode formatting. However, looking deeper at the eval scripts, I see that the conv_mode was passed directly - so 'llama_3' indeed would have been used.
Hello, and thanks for such a great contribution to the field of interleaved LMMs! This is really great work. I was wondering if there was an example of the format for multiple image or multiple video inference (similar to what is shown in the in-context learning examples)? Does it involve appending multiple
<image>
tokens at the specified locations? And then are the images and videos inserted sequentially?From my understanding of the
run_vila.py
script, the way to have an ICL input for images (and the corresponding structure for videos, of course) would be as follows:However, I am not sure if the positions of the
<image>
tokens are considered by the model during generation because looking at thellava_llama.py
, the method for preparing the multimodal inputs is inherited from LLaVA, which I believe just concatenates the image features and does not embed them specifically in the locations of the<image>
tokens.I may have missed something as I am still new to the codebase and exploring the model more deeply. Would appreciate any clarification on the point about multi-image and multi-video inputs. Thanks!
Edit: After having looked more deeply, it seems to me at least that the way I have formatted the prompt (with '\n' included) aligns with your code. However, I see in your paper that the image tokens are enumerated:
Edit 2:
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results. Setting pad_token_id to eos_token_id:128001 for open-end generation.
I think the pad token is fine as it is automatically set to the eos_token. But what about the mask? I see no mention of that when I try to evaluate on datasets like SEEDBench. I do seem to get uncharacteristically low acc. on these benchmarks, and I am trying to find out why.