OpenGVLab / InternVideo

[ECCV2024] Video Foundation Models & Data for Multimodal Understanding
Apache License 2.0
1.42k stars 85 forks source link

Clip model size is too small #114

Open dwsmart32 opened 6 months ago

dwsmart32 commented 6 months ago

Hello, really appreciate for your great work. https://github.com/OpenGVLab/InternVideo/blob/main/InternVideo2/multi_modality/MODEL_ZOO.md I checked that you guys wrote "We also learn a CLIP-style InternVideo2 indicated by InternVideo2clip. It is post-pretrained from InternVideo2s2 by only preserving video and text encoders and contrastive loss." in your paper.

But I found out that this model [InternVideo2-CLIP-1B-224p-f8] in huggingface is too small, like just a few MB. And according to the right before issue I noticed that that pth file in huggingface is "add on parameter", not a full parameter.

  1. So as i understood, there might be only clip model that post trained after stage2 right?
  2. I want to know how can I initialize that clip model and utilize. I want to get clip score from that model. It would be really grateful if you let me know exact way to do that. ( It is quite confusing no matter how much time i refer to your readme and demo.ipynb file.)

Thank you in advance.

Andy1621 commented 6 months ago

For your question, we only finetuned the AttentionPool in the vision encoder for CLIP model. And the main parameters are not updated.

Please check the zero-shot evaluation code for CLIP to load the model. Here are the scripts.

dwsmart32 commented 6 months ago

Thanks for your reply. Then you mean I can use clip when at least two components get ready which are Internvideo2-s2 parameter(main parameter which has not been updated) and Internvideo2-clip(additional small parameter), right?

It would be really grateful if you let me know when are you guys going to update main parameter approximately.

I m looking forward to utilize your model to my work.

Appreciate for your great work once again. @Andy1621

Andy1621 commented 6 months ago

Yes! Currently, we do not plan to update the main parameter, as I have tried to updated more parameters, but it lead to poorer performance, which may be caused by limited post-training datasets.