PKU-YuanGroup / Video-LLaVA

【EMNLP 2024🔥】Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
https://arxiv.org/pdf/2311.10122.pdf
Apache License 2.0
2.97k stars 215 forks source link

mistake in joint understanding #21

Closed Teaboy62 closed 11 months ago

Teaboy62 commented 11 months ago
1701287333033

I try to let the model understand both picture and video, but there is a mistake. Obviously, what the video records is not the flag.
In my view, after share project layer, in the share feature space, the distance between the two vectors is very close. Therefore, the LLM thinks "they are the same". Is this the bottleneck of this model? or just need to tune some instructions backend? if you have any ideas or improvement, please let me know. I will be very grateful. I am very interested in this topic.

LinB203 commented 11 months ago

This is an intriguing concept that we have not yet experimented with. In fact, we do not have image-video pairs to train the LLM; our data sources are either images or videos. Therefore, the LLM's ability to demonstrate even a slight understanding of both images and videos is already quite remarkable. I believe that incorporating image-video pairs into the training of the LLM in the next version will further enhance its visual comprehension capabilities. Btw, without pre-aligning visual signals, when images and videos are simultaneously input, the model will neither understand the images nor comprehend the videos.

Teaboy62 commented 11 months ago

Thank you for your reply. Of course, Video-LLaVa is remarkable, and I am amazed with your work. As you mentioned in the paper, Video-LLaVa treats LLM as a decoder rather than a scheduler. Is there any method to just train or tune the decoder? Anyway, thank you again and I am looking forward to the next version of this model.

LinB203 commented 11 months ago

You can refer to the training script of stage 2.

Teaboy62 commented 11 months ago

thank you very much!