OpenGVLab / InternVideo

[ECCV2024] Video Foundation Models & Data for Multimodal Understanding
Apache License 2.0
1.44k stars 88 forks source link

The performance gap between InternVideo2-clip and InternVideo2-s2 #152

Closed siyilingting closed 3 months ago

siyilingting commented 4 months ago

Hello, thanks for the great work. I noticed in MODEl-ZOO.md that the performance of the InternVideo2-clip model is lower than that of InternVideo2-s2. For instance, the T2V metric of the InternVideo2-clip-1B model on MSRVTT is 50.0, compared to 51.9 for the InternVideo2-s2-1B model. The disparity becomes even more evident for the 6B-sized models on MSRVTT: 50.9 versus 55.9. What perplexes me is that the InternVideo2-clip model was pre-trained from InternVideo-s2, hence its performance should exceed, rather than lag behind. I would greatly appreciate your assistance in shedding light on this confusion. Thank you very much.

Andy1621 commented 4 months ago

Good question!

In InternVideo2-s2, we also use matching loss to improve retrieval results, which largely improve the results but decrease the speed.

For InternVideo2-clip-1B, we only use CLIP loss, which is simpler for pratical applications. Besides, we adopt a multilingual LLM which supports diverse language and longer text.

siyilingting commented 4 months ago

:smile::smile:Thank you for your response. Does this indicate that through training with only CLIP loss, InternVideo2-clip has potentially lost some knowledge acquired during the training of InternVideo2-s2? My assumption is that the weights of InternVideo2-clip were Initialized by those of InternVideo2-s2 model. I'm not sure if I have a misunderstanding here.

Andy1621 commented 4 months ago

I think not. The performances drop becase of the lack of matching loss, and the performances is higher if it's compared with InternVideo2-s2 without mathcing loss.

Besides, most of the weights are frozen and the knowledge should be maintained well.

siyilingting commented 4 months ago

Thanks a lot.