cmsflash / efficient-attention

An implementation of the efficient attention module.
https://arxiv.org/abs/1812.01243
MIT License
272 stars 26 forks source link

How to replicate attention maps in object detection #7

Closed chandlerbing65nm closed 1 year ago

chandlerbing65nm commented 2 years ago

Can you share the code on how to visualize attention maps in object detection like the one shown in your paper?

image

cmsflash commented 2 years ago

Hi Chandler,

The visualization code was inside the code base of my company at that time. Because it was not part of this open-source project, I believe they will not release it. (I also no longer have access to it since I left the company.)

The logic is very simple though. We were visualization each channel in keys. For keys of shape [n, d_k, h, w], we slice it to n * d_k tensors each of shape [1, 1, h, w]. Since we were visualizing the softmax variant, each element is in the range (0, 1), which was easy to paint as a greyscale image.

chandlerbing65nm commented 2 years ago

Hi Chandler,

The visualization code was inside the code base of my company at that time. Because it was not part of this open-source project, I believe they will not release it. (I also no longer have access to it since I left the company.)

The logic is very simple though. We were visualization each channel in keys. For keys of shape [n, d_k, h, w], we slice it to n * d_k tensors each of shape [1, 1, h, w]. Since we were visualizing the softmax variant, each element is in the range (0, 1), which was easy to paint as a greyscale image.

@cmsflash In the image above (Figure 3 in paper), the description is that it is the visualization of attention maps from the efficient attention module. Yet, you mentioned here that the visualization is only done only at the keys.

I thought you visualized the attention maps from the output of the module.

cmsflash commented 2 years ago

@cmsflash In the image above (Figure 3 in paper), the description is that it is the visualization of attention maps from the efficient attention module. Yet, you mentioned here that the visualization is only done only at the keys.

I thought you visualized the attention maps from the output of the module.

The description says the figure is visualizing the "global attention maps" from the efficient attention module. The "global attention maps" are the individual channels in keys.

chandlerbing65nm commented 2 years ago

visualization is only done only at the keys

@cmsflash I'm confused here. Isn't the global attention can only be extracted when we use softmax(QK.T/sqrt(dk))V (or variations of it) ?

If only the channels at the keys are visualized, then it is just the spatial information from the input image, no attention has been extracted yet.

cmsflash-pony commented 2 years ago

The attention maps generated from QK are the pixel-wise attention maps. In our terminology, the "global attention maps" are the individual channels in K. Please check Section 3.4 in the paper for the reasoning behind the terminology.