Closed Journey7331 closed 6 months ago
Thanks for your interest in our work. We concat subtitle (if exists) with the image token in the format [<image-i><subtitle-i>]
for each frame, as presented in Stage 3 of Figure 3. In this manner, LLM understands the characters in each movie.
Of course, we also compare with the pure subtitles as input in Figure 9 to prove that the effectiveness is from the image and subtitle, instead of the subtitle only.
Thanks for replying!
So when the input is just pure video without subtitles, the model will indeed understand the character relationships and the narrative content to some extent, but may not specifically know the names of the characters or other proper nouns mentioned in the video. In this case, the model's understanding will only rely on visual cues and contextual information within the scenes.
If I say something wrong, plz correct me. :)
Yes, of course, you can extract features from video to replace subtitles for better understanding.
Thank you for your great work.
Regarding adding subtitles, I still have the following questions:
Thanks.
Thank you for your great work! The idea of
context token
andcontent token
is excellent !! and greatly enhances the model's capabilities.However, while reading the paper, I have a question that may seem silly. :P
For instance, in the movie
Interstellar
Furthermore, I'm curious about the extent to
context token
andcontent token
which learned from tokens affects the model's understanding ability.Again, thank a lot for your great work! :)