altndrr / vic

Code implementation of our NeurIPS 2023 paper: Vocabulary-free Image Classification
https://alessandroconti.me/papers/2306.00917
MIT License
100 stars 3 forks source link

What is CLIP upper bound?How did you get the model? #19

Open PowerKaly opened 1 month ago

PowerKaly commented 1 month ago

Thank you for your work, I have some questions and hope you can answer them despite your busy schedule What is CLIP upper bound?How did you get the model?

We consider three main groups of baselines for our comparisons. The most straightforward baselines consist of using CLIP with large vocabularies, such as WordNet [41] (117k names) or the English Words (234k names [16]). As an upper bound, we also consider CLIP with the perfect vocabulary, i.e. the ground-truth names of the target dataset (CLIP upper bound). Due to lack of space, we only report results for CLIP with ViT-L [13].

altndrr commented 1 month ago

As you reported, "As an upper bound, we also consider CLIP with the perfect vocabulary, i.e. the ground-truth names of the target dataset (CLIP upper bound).", which means we test CLIP performance on the standard "generalized" zero-shot setting. This means we have a manually-annotated pre-defined list of class names per dataset, not a generated one per image as in the other approaches. Let me know if it is clearer now.