Open royg27 opened 1 year ago
More information - I am running it using the provided COLAB notebook. When I run evaluation using COCO, I get 0 matches, and when I run HilaCAM I get the reported numbers. Here are some visual examples for images from ImageNet using gradCAM:
@royg27 While GradCAM leverages the output of the last layer for visualizations, it's not as effective with ViT models, which are better served by attention-based methods. Methods like gScoreCAM, designed for CNNs, might not fully capture ViT's nuances. In contrast, HilaCAM, specifically crafted for ViT, utilizes both attention and its gradients for more appropriate visualizations. Consequently, we don't recommend GradCAM for ViT-based models. A side note: visualizing the attention rather than the final layer outputs in ViT can yield more intuitive results.
Thank you for your reply. How can I modify the code to look at the attention rather than the final layer outputs?
I tried to change -
target_layer = clip_model.visual.transformer.resblocks[-1]
to several layers inside ResidualAttentionBlock, for instance
target_layer = clip_model.visual.transformer.resblocks[-1].ln_2
but all results in meaningless outputs.
Thanks again.
Hi,
Thanks for the great work. When running gradCAM on ViT, it completely fails (unlike results in other works). Is there a bug or a different implementation details?
Thanks!