dvlab-research / LLaMA-VID

Official Implementation for LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
Apache License 2.0
623 stars 40 forks source link

Question about Model's Understanding #15

Closed Journey7331 closed 6 months ago

Journey7331 commented 6 months ago

Thank you for your great work! The idea of context token and content token is excellent !! and greatly enhances the model's capabilities.

However, while reading the paper, I have a question that may seem silly. :P

For instance, in the movie Interstellar

Furthermore, I'm curious about the extent to context token and content token which learned from tokens affects the model's understanding ability.

Again, thank a lot for your great work! :)

yanwei-li commented 6 months ago

Thanks for your interest in our work. We concat subtitle (if exists) with the image token in the format [<image-i><subtitle-i>] for each frame, as presented in Stage 3 of Figure 3. In this manner, LLM understands the characters in each movie.

Of course, we also compare with the pure subtitles as input in Figure 9 to prove that the effectiveness is from the image and subtitle, instead of the subtitle only.

Journey7331 commented 6 months ago

Thanks for replying!

So when the input is just pure video without subtitles, the model will indeed understand the character relationships and the narrative content to some extent, but may not specifically know the names of the characters or other proper nouns mentioned in the video. In this case, the model's understanding will only rely on visual cues and contextual information within the scenes.

If I say something wrong, plz correct me. :)

yanwei-li commented 6 months ago

Yes, of course, you can extract features from video to replace subtitles for better understanding.

Yxxxb commented 4 months ago

Thank you for your great work.

Regarding adding subtitles, I still have the following questions:

  1. If you do not use subtitles for training, and do not change other model architecture and designs, in other words, for video tokens, only the sequence of is used. Can the model understand long videos? Or does the model have the ability to find a needle in a haystack or answer detailed questions about a hour long video?
  2. If subtitles are not added, there is obviously an order of magnitude difference between the number of input visual tokens and the number of text tokens. Will such an imbalance affect the effect of the model?
  3. Afte adding subtitles for training, can you infer videos without subtitles? If so, how to inference? How to set up the subtitles?

Thanks.