PKU-YuanGroup / LanguageBind

【ICLR 2024🔥】 Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
https://arxiv.org/abs/2310.01852
MIT License
549 stars 44 forks source link

pretraining details #12

Closed xiaoen0 closed 6 months ago

xiaoen0 commented 6 months ago

Great work! I'd like to learn more about the details of the pretraining process mentioned: "During the pretraining process, all modalities gradually align with the language modality through contrastive learning." Could you clarify if this pretraining process is equivalent to LoRA fine-tuning? In other words, during the pretraining phase, are parameters updated for the video encoder, infrared encoder, depth encoder, and audio encoder using the four types of data contained in VIDAL-10M, namely, video-language data, infrared-language data, depth-language data, and audio-language data, through contrastive learning?

LinB203 commented 6 months ago

Thanks fou your attention and you are ⭐right.