PKU-YuanGroup / LanguageBind

【ICLR 2024🔥】 Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
https://arxiv.org/abs/2310.01852
MIT License
549 stars 44 forks source link

Vision encoder version #34

Closed JosephPai closed 3 months ago

JosephPai commented 4 months ago

Hi authors,

Thanks for releasing the code. I noticed that you mentioned "Note that our image encoder is the same as OpenCLIP. Not as fine-tuned as other modalities." I would like to know what is the exact version of CLIP weight are you using?

Thanks!

LinB203 commented 3 months ago

For CLIP-L: https://huggingface.co/laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K

For CLIP-H: https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K