Questions about the subtitles.

Thank you for your great work.

Regarding adding subtitles, I still have the following questions:

If you do not use subtitles for training, and do not change other model architecture and designs, in other words, for video tokens, only the sequence of is used. Can the model understand long videos? Or does the model have the ability to find a needle in a haystack or answer detailed questions about a hour long video?
If subtitles are not added, there is obviously an order of magnitude difference between the number of input visual tokens and the number of text tokens. Will such an imbalance affect the effect of the model?
Afte adding subtitles for training, can you infer videos without subtitles? If so, how to inference? How to set up the subtitles?

Thanks.

dvlab-research / LLaMA-VID