Closed Teaboy62 closed 11 months ago
This is an intriguing concept that we have not yet experimented with. In fact, we do not have image-video pairs to train the LLM; our data sources are either images or videos. Therefore, the LLM's ability to demonstrate even a slight understanding of both images and videos is already quite remarkable. I believe that incorporating image-video pairs into the training of the LLM in the next version will further enhance its visual comprehension capabilities. Btw, without pre-aligning visual signals, when images and videos are simultaneously input, the model will neither understand the images nor comprehend the videos.
Thank you for your reply. Of course, Video-LLaVa is remarkable, and I am amazed with your work. As you mentioned in the paper, Video-LLaVa treats LLM as a decoder rather than a scheduler. Is there any method to just train or tune the decoder? Anyway, thank you again and I am looking forward to the next version of this model.
You can refer to the training script of stage 2.
thank you very much!
I try to let the model understand both picture and video, but there is a mistake. Obviously, what the video records is not the flag.
In my view, after share project layer, in the share feature space, the distance between the two vectors is very close. Therefore, the LLM thinks "they are the same". Is this the bottleneck of this model? or just need to tune some instructions backend? if you have any ideas or improvement, please let me know. I will be very grateful. I am very interested in this topic.