ExplainableML / Vision_by_Language

[ICLR 2024] Official repository for "Vision-by-Language for Training-Free Compositional Image Retrieval"
MIT License
41 stars 4 forks source link

A question about CLIP #8

Open Jian-Lang opened 15 hours ago

Jian-Lang commented 15 hours ago

In paper, you have mentioned that "We experiment with different ViT-variants (Dosovitskiy et al., 2021) of CLIP, with weights taken from the official implementation in (Radford et al., 2021)". And your experiment table for FashionIQ dataset labels the results with VIT-B/32 VIT-L/14. However, in your script file, your experiment on FashionIQ dataset leverage the VIT-B-32 VIT-L-14. So why this happened? I need an explanation. @sgk98 @Confusezius

Jian-Lang commented 15 hours ago

I mean, do your weights for FashionIQ come from openai's weights or openclip?

sgk98 commented 5 hours ago

Hey Jian, In our paper, the results for the CLIP ViT-B/32 and ViT-L/14 in all tables are indeed with the OpenAI checkpoints (and you can also run them using the appropriate flags). In our experiments later, we were trying out the OpenCLIP variants (especially to use the ViT-H and ViT-bigG models), and this is what you see currently in the codebase. If I remember correctly, the results for the OpenCLIP variants are a lot higher than the results we have with the OpenAI models (e.g. CIRR R@1 is 24.5 in the paper, but the ViT-L-14 achieves R@1 of 33.5). I hope this helps.