PKU-YuanGroup / LanguageBind

【ICLR 2024🔥】 Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
https://arxiv.org/abs/2310.01852
MIT License
723 stars 52 forks source link

Pretraining on video dataset without lora. #54

Open shihuai opened 5 months ago

shihuai commented 5 months ago

Greate work! I am also very interested in your work. Recently, I tried to reproduce the work on video modality alignment. I used the pre-trained ViT-b32 of OpenAI for initialization. The visual encoder part uses temporal attention to model the temporal relationship. During training, the text encoder is fixed, and only the weights of the embedding layer and the temporal attention part of the visual encoder will be updated. During training, the loss of the model dropped from 5.9 to 5.2. If both the visual encoder and the text encoder are all fine-tuned, the loss can be reduced to about 0.3. For this situation where only some parameters of the visual encoder are fine-tuned, the loss converges poorly. I wonder if you have encountered this during training? What should I pay attention to when using this fine-tuning method?