PKU-YuanGroup / LanguageBind

【ICLR 2024🔥】 Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
https://arxiv.org/abs/2310.01852
MIT License
549 stars 44 forks source link

How to Initialize the multi-modal encoders & training from scratch #25

Closed chen-yy20 closed 5 months ago

chen-yy20 commented 5 months ago

Great work! I have noticed in figure 3 of your paper that the multi-modal encoders weights are frozen when doing the Multi-modal Joint Learning. Do you mean they are frozen during all the training time and you only use LoRA to adjust the multi-modal encoders?

If so, how do you initialize their weights? Are they also initialized from pretrained OpenCLIP vision encoder?

Furthermore, are there any pretrain steps in your work? Can I train LanguageBind from scratch or I can only use LoRA to finetune it?

LinB203 commented 5 months ago

Only use LoRA to adjust the multi-modal encoders on paper (Full parameter fine-tuning is now accessible). They initialized from pretrained OpenCLIP vision encoder. We use only the VIDAL dataset for training. You can train from scratch just by setting args.pretrained to False. But this is not recommended. I prefer to use LoRA fine-tuning after loading the pre-trained weights, which can be found here.