PKU-YuanGroup / LanguageBind

【ICLR 2024🔥】 Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
https://arxiv.org/abs/2310.01852
MIT License
570 stars 47 forks source link

Difference from imagebind #2

Closed lzw-lzw closed 7 months ago

lzw-lzw commented 7 months ago

Thank you for your excellent work. I want to know what is the difference between this work and ImageBind. According to my understanding, the difference is mainly reflected in the different modalities used as band, right? Thanks!

BinZhu-ece commented 7 months ago

Thank you for your question and you are referring to one of its highlights. To summarize, firstly, a language-binding approach is more suitable for the majority of language-based downstream tasks because no intermediate modality is required. Secondly, LanguageBind is pre-trained on the VIDAL-10M dataset, where the direct alignment of video, infrared, depth, and language data pairs outperforms the indirect integration of image-based modality data. Additionally, within VIDAL-10M, the language modality is a multi-view textual description enhanced by advanced models like ChatGPT. This ensures that the central modality possesses sufficient semantic information to bind effectively with other modal data.

lzw-lzw commented 7 months ago

Got it, thanks for your patient reply!