dvlab-research / LLaMA-VID

LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models (ECCV 2024)
Apache License 2.0
693 stars 43 forks source link

Questions about the subtitles. #66

Open Yxxxb opened 7 months ago

Yxxxb commented 7 months ago

Thank you for your great work.

Regarding adding subtitles, I still have the following questions:

  1. If you do not use subtitles for training, and do not change other model architecture and designs, in other words, for video tokens, only the sequence of is used. Can the model understand long videos? Or does the model have the ability to find a needle in a haystack or answer detailed questions about a hour long video?
  2. If subtitles are not added, there is obviously an order of magnitude difference between the number of input visual tokens and the number of text tokens. Will such an imbalance affect the effect of the model?
  3. Afte adding subtitles for training, can you infer videos without subtitles? If so, how to inference? How to set up the subtitles?

Thanks.

Yxxxb commented 7 months ago

@yanwei-li