PKU-YuanGroup / LanguageBind

【ICLR 2024🔥】 Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
https://arxiv.org/abs/2310.01852
MIT License
549 stars 44 forks source link

Choice of Vit-L over Vit-H #10

Closed jacklishufan closed 6 months ago

jacklishufan commented 6 months ago

Hi Thanks for the great work. Imagebind uses Vit-H, so I'm supervised that you were able to achieve better performance using Vit-L only. Have you tried to explore Vit-H under your setting? I see in the config there are some leftover code of LAION CLIP ViT-H

LinB203 commented 6 months ago

Let me summarize the performance improvement more succinctly. For video, we additionally pretrain on the video-text pair of VIDAL-3M, while ImageBind does not. We add temporal attention to the model, while ImageBind just averages over the temporal dimension. For audio, depth, and infrared, thanks to the VIDAL dataset and the LanguageBind method, we do not need any intermediate modality as a transformation. As in Figure 1 in the paper, ImageBind can be considered to use images as intermediate modality. At first, we are using ViT-H, but this has limited enhancement for video-text, and we hypothesize that the reason is that the model cannot learn the timing-related information. We therefore added temporal attention, but unfortunately at this point had to replace it with ViT-L due to memory constraints. Fortunately it WORKED. We are currently exploring larger datasets and stronger models, which will go live shortly.

LinB203 commented 5 months ago

Now! Checking our model zoo. Released LanguageBind-HUGE model on video.