Great work!
I'd like to learn more about the details of the pretraining process mentioned: "During the pretraining process, all modalities gradually align with the language modality through contrastive learning."
Could you clarify if this pretraining process is equivalent to LoRA fine-tuning? In other words, during the pretraining phase, are parameters updated for the video encoder, infrared encoder, depth encoder, and audio encoder using the four types of data contained in VIDAL-10M, namely, video-language data, infrared-language data, depth-language data, and audio-language data, through contrastive learning?
Great work! I'd like to learn more about the details of the pretraining process mentioned: "During the pretraining process, all modalities gradually align with the language modality through contrastive learning." Could you clarify if this pretraining process is equivalent to LoRA fine-tuning? In other words, during the pretraining phase, are parameters updated for the video encoder, infrared encoder, depth encoder, and audio encoder using the four types of data contained in VIDAL-10M, namely, video-language data, infrared-language data, depth-language data, and audio-language data, through contrastive learning?