Closed siyilingting closed 3 months ago
Good question!
In InternVideo2-s2
, we also use matching loss
to improve retrieval results, which largely improve the results but decrease the speed.
For InternVideo2-clip-1B
, we only use CLIP loss
, which is simpler for pratical applications. Besides, we adopt a multilingual LLM which supports diverse language and longer text.
:smile::smile:Thank you for your response. Does this indicate that through training with only CLIP loss, InternVideo2-clip has potentially lost some knowledge acquired during the training of InternVideo2-s2? My assumption is that the weights of InternVideo2-clip were Initialized by those of InternVideo2-s2 model. I'm not sure if I have a misunderstanding here.
I think not. The performances drop becase of the lack of matching loss
, and the performances is higher if it's compared with InternVideo2-s2
without mathcing loss
.
Besides, most of the weights are frozen and the knowledge should be maintained well.
Thanks a lot.
Hello, thanks for the great work. I noticed in MODEl-ZOO.md that the performance of the InternVideo2-clip model is lower than that of InternVideo2-s2. For instance, the T2V metric of the InternVideo2-clip-1B model on MSRVTT is 50.0, compared to 51.9 for the InternVideo2-s2-1B model. The disparity becomes even more evident for the 6B-sized models on MSRVTT: 50.9 versus 55.9. What perplexes me is that the InternVideo2-clip model was pre-trained from InternVideo-s2, hence its performance should exceed, rather than lag behind. I would greatly appreciate your assistance in shedding light on this confusion. Thank you very much.