PKU-YuanGroup / LanguageBind

【ICLR 2024🔥】 Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
https://arxiv.org/abs/2310.01852
MIT License
723 stars 52 forks source link

Difference from imagebind #2

Closed lzw-lzw closed 1 year ago

lzw-lzw commented 1 year ago

Thank you for your excellent work. I want to know what is the difference between this work and ImageBind. According to my understanding, the difference is mainly reflected in the different modalities used as band, right? Thanks!

BinZhu-ece commented 1 year ago

Thank you for your question and you are referring to one of its highlights. To summarize, firstly, a language-binding approach is more suitable for the majority of language-based downstream tasks because no intermediate modality is required. Secondly, LanguageBind is pre-trained on the VIDAL-10M dataset, where the direct alignment of video, infrared, depth, and language data pairs outperforms the indirect integration of image-based modality data. Additionally, within VIDAL-10M, the language modality is a multi-view textual description enhanced by advanced models like ChatGPT. This ensures that the central modality possesses sufficient semantic information to bind effectively with other modal data.

lzw-lzw commented 1 year ago

Got it, thanks for your patient reply!