Closed lzw-lzw closed 1 year ago
Thank you for your question and you are referring to one of its highlights. To summarize, firstly, a language-binding approach is more suitable for the majority of language-based downstream tasks because no intermediate modality is required. Secondly, LanguageBind is pre-trained on the VIDAL-10M dataset, where the direct alignment of video, infrared, depth, and language data pairs outperforms the indirect integration of image-based modality data. Additionally, within VIDAL-10M, the language modality is a multi-view textual description enhanced by advanced models like ChatGPT. This ensures that the central modality possesses sufficient semantic information to bind effectively with other modal data.
Got it, thanks for your patient reply!
Thank you for your excellent work. I want to know what is the difference between this work and ImageBind. According to my understanding, the difference is mainly reflected in the different modalities used as band, right? Thanks!