Closed jacklishufan closed 6 months ago
Let me summarize the performance improvement more succinctly. For video, we additionally pretrain on the video-text pair of VIDAL-3M, while ImageBind does not. We add temporal attention to the model, while ImageBind just averages over the temporal dimension. For audio, depth, and infrared, thanks to the VIDAL dataset and the LanguageBind method, we do not need any intermediate modality as a transformation. As in Figure 1 in the paper, ImageBind can be considered to use images as intermediate modality. At first, we are using ViT-H, but this has limited enhancement for video-text, and we hypothesize that the reason is that the model cannot learn the timing-related information. We therefore added temporal attention, but unfortunately at this point had to replace it with ViT-L due to memory constraints. Fortunately it WORKED. We are currently exploring larger datasets and stronger models, which will go live shortly.
Hi Thanks for the great work. Imagebind uses Vit-H, so I'm supervised that you were able to achieve better performance using Vit-L only. Have you tried to explore Vit-H under your setting? I see in the config there are some leftover code of LAION CLIP ViT-H