Open alexcbb opened 1 year ago
The attention map is calculated right here:
https://github.com/facebookresearch/segment-anything/blob/6fdee8f2727f4506cfbbe553e23b895e27956588/segment_anything/modeling/image_encoder.py#L231
If you don't care about model speed, then simply something like this right below attn
:
attn_map = attn.detach().cpu().numpy()
np.save_txt('attn_map.dat', attn_map)
However, SAM uses global and local attention. You likely want to look at the global attention maps. In that case, the indices of the global attention are set here:
The code snippet above will likely not work because I think save_txt
expects 2D arrays. So you need to cast to 2D shape.
Hi, thank you for your answer. By the way I was looking more in detail in this "global attention" by looking in the referred paper "Exploring Plain Vision Transformer Backbones for Object Detection" but i'm not sure to really understand how this is done here. In the original paper they are talking about using only the final feature maps but in the unofficial Pytorch code (from which SAM get the same implementation : https://github.com/ViTAE-Transformer/ViTDet) they used different layers of window attention. Can you explain me more on this part or provide me the right source to understand it better ? thank you !
Hi @alexcbb Did you find anyway to do so??
Hello,
I'm searching for a way to visualize the attention maps of the pre-trained models but I didn't found any solution yet. Did someone already succesfully did this ?
Thank you !