How to Initialize the multi-modal encoders & training from scratch

PKU-YuanGroup / LanguageBind

【ICLR 2024🔥】 Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment

MIT License

549 stars 44 forks source link

Great work! I have noticed in figure 3 of your paper that the multi-modal encoders weights are frozen when doing the Multi-modal Joint Learning. Do you mean they are frozen during all the training time and you only use LoRA to adjust the multi-modal encoders?

If so, how do you initialize their weights? Are they also initialized from pretrained OpenCLIP vision encoder?

Furthermore, are there any pretrain steps in your work? Can I train LanguageBind from scratch or I can only use LoRA to finetune it?

PKU-YuanGroup / LanguageBind

How to Initialize the multi-modal encoders & training from scratch #25