Sense-X / UniFormer

[ICLR2022] official implementation of UniFormer
Apache License 2.0
812 stars 111 forks source link

About video attention visualization #53

Closed yliu1229 closed 2 years ago

yliu1229 commented 2 years ago

As I see it, you use GradCam to visualize the matrix A from the last layer, could you please kindly provide a demo code of how to use the GradCam code in your repo? If it is possible, could you plz upload the demo in your demo? Many thanks!

Andy1621 commented 2 years ago

Thanks for your question. I have not saved the visualization demo. However, the demo for UniFormer is adapted from my previous repo, see CT-Net. The main difference is that in UniFormer, I downsample the temporal dimension, thus I have to use interpolation to align the dimension. If you maintain the temporal dimension, you can simply use my previous repo.

yliu1229 commented 2 years ago

Thank you for your swift reply! :) I've manged to get the visualization code working, however the attention matrix A that i compute is strange which results in the following figure, despite the very accurate classification result...

1

I compute the global A by all-heads-concatenated Q @ K (without softmax) in the last attention block, and do the following so that i get the summed importance score for each token:

    A = F.softmax(torch.sum(A, dim=-1), dim=0)    #  Matrix A of dimension (392,)
    A_list = [A[i*49: (i+1)*49].reshape(7, 7) for i in range(8)]     # To get A for the 8 temporal images of shape (7, 7)
    cam = zoom(A_map, (2, 32, 32))   # Interpolation as suggested to get 16x224x224 (original image size)

I realize there must be something wrong here, just could not figure it out myself (: Hope you can kindly shine some light on me... Thanks again!

Andy1621 commented 2 years ago

In my opinion, attention visualization is difficult to show its interpretability. The phenomenon you show also exists in an image transformer like DeiT.

I suggest that you can use the code in Transformer-Explainability to explore more about it.

yliu1229 commented 2 years ago

Thanks, will do, appreciate the help!

DoranLyong commented 2 years ago

@yliu1229 Hi, did you solve this issue?

I wonder how to do this, could you share your code? :)

yliu1229 commented 2 years ago

@DoranLyong I tried several methods to visualize the attention map withdrawn from the last block, however it is hard to get figures like the ones in the authors' paper. But after deeper investigation, i agreed with Andy's comment above "attention visualization is difficult to show its interpretability", and if you want to get a nice visualization figure, you may need to use techniques like attention rollout, or Transformer-Explainability which is also mentioned above. If you find out more, please share :)