anguyen8 / gScoreCAM

MIT License
47 stars 8 forks source link

gradCAM in CLIP ViT #7

Open royg27 opened 1 year ago

royg27 commented 1 year ago

Hi,

Thanks for the great work. When running gradCAM on ViT, it completely fails (unlike results in other works). Is there a bug or a different implementation details?

Thanks!

royg27 commented 1 year ago

More information - I am running it using the provided COLAB notebook. When I run evaluation using COCO, I get 0 matches, and when I run HilaCAM I get the reported numbers. Here are some visual examples for images from ImageNet using gradCAM:

Picture1 Picture2
Chanfeechen commented 1 year ago

@royg27 While GradCAM leverages the output of the last layer for visualizations, it's not as effective with ViT models, which are better served by attention-based methods. Methods like gScoreCAM, designed for CNNs, might not fully capture ViT's nuances. In contrast, HilaCAM, specifically crafted for ViT, utilizes both attention and its gradients for more appropriate visualizations. Consequently, we don't recommend GradCAM for ViT-based models. A side note: visualizing the attention rather than the final layer outputs in ViT can yield more intuitive results.

royg27 commented 1 year ago

Thank you for your reply. How can I modify the code to look at the attention rather than the final layer outputs? I tried to change - target_layer = clip_model.visual.transformer.resblocks[-1] to several layers inside ResidualAttentionBlock, for instance target_layer = clip_model.visual.transformer.resblocks[-1].ln_2 but all results in meaningless outputs.

Thanks again.