Closed Adasunnylily closed 11 months ago
In the paper, we mentioned we used dot product instead of cosine similarity. We found using normalized features will hurt the final performance (including the linear probe performance). The cosine similarity scores tend to be very close (most are between 0 and 0.4) which are not discriminative enough to help the classification. In terms of alignment and interpretablity, we found dot product can still capture the semantics between the image and text very well.
Got it. Thank you for your response. I also wanted to ask if there are specific methods to measure the ability to capture semantics between images and text, or if the approach involves using classification accuracy or traditional CBM methods to identify important concepts?
If you have some ground truth data, i.e., some image-concept pairs, you can measure the performance of the alignment using retrieval metrics. For example, given an image, compute the alignment scores with all candidate concepts, check whether the ground truth concepts of this image is within the top-k concepts.
I understand, thanks again!
hi, thanks for your great work! I am a little confusedd about the code. I noticed in the code that two parameters in the cfg use_img_norm and use_txt_norm are set as False which is unusual in normal CLIP process for classification. Is there any specific reason for not using norm on img and text features? Shouldn't using the norm process a better way for image-concept alignment and better interpretable ability based on CLIP? THANKS a lot!