Closed chen-yy20 closed 5 months ago
Only use LoRA to adjust the multi-modal encoders on paper (Full parameter fine-tuning is now accessible).
They initialized from pretrained OpenCLIP vision encoder.
We use only the VIDAL dataset for training. You can train from scratch just by setting args.pretrained
to False. But this is not recommended. I prefer to use LoRA fine-tuning after loading the pre-trained weights, which can be found here.
Great work! I have noticed in figure 3 of your paper that the multi-modal encoders weights are frozen when doing the Multi-modal Joint Learning. Do you mean they are frozen during all the training time and you only use LoRA to adjust the multi-modal encoders?
If so, how do you initialize their weights? Are they also initialized from pretrained OpenCLIP vision encoder?
Furthermore, are there any pretrain steps in your work? Can I train LanguageBind from scratch or I can only use LoRA to finetune it?