YueYANG1996 / LaBo

CVPR 2023: Language in a Bottle: Language Model Guided Concept Bottlenecks for Interpretable Image Classification
https://arxiv.org/abs/2211.11158
83 stars 6 forks source link

the unuse of use_img_norm and use_txt_norm #20

Closed Adasunnylily closed 11 months ago

Adasunnylily commented 11 months ago

hi, thanks for your great work! I am a little confusedd about the code. I noticed in the code that two parameters in the cfg use_img_norm and use_txt_norm are set as False which is unusual in normal CLIP process for classification. Is there any specific reason for not using norm on img and text features? Shouldn't using the norm process a better way for image-concept alignment and better interpretable ability based on CLIP? THANKS a lot!

YueYANG1996 commented 11 months ago

In the paper, we mentioned we used dot product instead of cosine similarity. We found using normalized features will hurt the final performance (including the linear probe performance). The cosine similarity scores tend to be very close (most are between 0 and 0.4) which are not discriminative enough to help the classification. In terms of alignment and interpretablity, we found dot product can still capture the semantics between the image and text very well.

Adasunnylily commented 11 months ago

Got it. Thank you for your response. I also wanted to ask if there are specific methods to measure the ability to capture semantics between images and text, or if the approach involves using classification accuracy or traditional CBM methods to identify important concepts?

YueYANG1996 commented 11 months ago

If you have some ground truth data, i.e., some image-concept pairs, you can measure the performance of the alignment using retrieval metrics. For example, given an image, compute the alignment scores with all candidate concepts, check whether the ground truth concepts of this image is within the top-k concepts.

Adasunnylily commented 11 months ago

I understand, thanks again!