OpenGVLab / InternVideo

[ECCV2024] Video Foundation Models & Data for Multimodal Understanding
Apache License 2.0
1.11k stars 73 forks source link

About Framework of ViCLIP #70

Open kimsekeun opened 5 months ago

kimsekeun commented 5 months ago

Thank you for nice work.

In training ViCLIP, I would like to clarify my understanding of this paper.

If vision transforms is not pre-trained such as MAE method, then, it means that it only align between image and text. Then vision encoder will have no power of vision representation except to align to text.

so my question is,

  1. Is vision transformer is pre-trained ? or it just used ViT blocks initialized with random weight?
  2. Is Text transformer is pre-trained ? or it just used ViT blocks initialized with random weight?

And if possible, could you release training code of ViCLIP? it will clarify of this paper.

Thank you in advance. :)

shepnerd commented 5 months ago

ViCLIP utilizes both vision and text transformers that are initialized from CLIP's transformers.

Regarding the statement, "If vision transforms are not pre-trained, such as MAE method, then it means that it only aligns between image and text. The vision encoder will have no power of vision representation except to align to text," it may seem exaggerated. Empirical experiments have demonstrated that ViCLIP does learn competent video representations. In both CLIP and our paper, you can find image/video classification results in the fine-tuning setting that exhibit non-trivial performance, supporting the effectiveness of the learned vision representation.

To achieve appropriate representations, the focus in community is on compressing or reconstructing signals. In the context of multimodal contrastive learning, you can perceive the training process as an attempt to compress high-dimensional visual signals into semantics defined by human languages.

We are open for discussion and would be glad to engage with you further on this topic.

Releasing code is on the way.

kimsekeun commented 5 months ago

Thank you for clarification.

If we only have vision encoder by MAE, and not text encoder. Is is possible to align human language with vision encoder with your method? Initialize Vision encoder , Random initialize Text encoder. Both are trainable during training process. Then what if we apply VCLIP in this setting? I want to know your thought.

Thank you.

shepnerd commented 5 months ago

Indeed, your idea seems to be reasonable. However, during the implementation process, it is crucial to ensure that the hyperparameters for training are appropriately tuned. Additionally, we suggest initializing the text encoder for easier training. Otherwise, you may need to experiment with the amount of data to learn a proper text encoder from scratch. This will help improve the overall performance and convergence of the model.

You may refer to our umt paper for some thoughts. In the initial learning stage, we employ videoMAE training and also distill a clip vision encoder to emulate a language-friendly visual representation. We hope that this information proves helpful to you.