facebookresearch / detr

End-to-End Object Detection with Transformers
Apache License 2.0
13.08k stars 2.37k forks source link

In the paper, it is mentioned that visualizing the last layer of attention graph, how is this operation done? #593

Closed notfacezhi closed 11 months ago

notfacezhi commented 11 months ago

image I don't understand what the points in this graph represent, and how the attention graph connected to this point is visualized. In the self-attention process, the input shape is (b, c, h, w) - > (b, h w, c) The attention graph is (h w, h * w) How to visualize this on the original image?

fmassa commented 11 months ago

Hi,

In https://github.com/facebookresearch/detr#notebooks the first notebook has the code to visualize the images that we used in the paper, including the attention matrix.

Each point in the original image (which corresponds to a line (or column) in the attention mask) can be reshaped as an image.

I believe I've answered your question and as such I'm closing this issue

tldrafael commented 11 months ago

hey @fmassa, thanks for the great detr work! I've been trying to replicate some of the work illustrations.

I'd expect the self-attention weights would come from the operation attn = (q*scale) @ k.T that weighs the values. It turned out that looking at the detr repo at the Transformers classes definition: https://github.com/facebookresearch/detr/blob/main/models/transformer.py#L127, the forward outcome only yields the final tensor of dimensions (b, h * w, c).

I don't know how you could get the hook's outcome from the colab's notebook. Is there any other code that the colab model used?

MLDeS commented 7 months ago

hey @fmassa, thanks for the great detr work! I've been trying to replicate some of the work illustrations.

I'd expect the self-attention weights would come from the operation attn = (q*scale) @ k.T that weighs the values. It turned out that looking at the detr repo at the Transformers classes definition: https://github.com/facebookresearch/detr/blob/main/models/transformer.py#L127, the forward outcome only yields the final tensor of dimensions (b, h * w, c).

I don't know how you could get the hook's outcome from the colab's notebook. Is there any other code that the colab model used?

Did you figure this out?

tldrafael commented 7 months ago

@MLDeS I just used the model straight from detr = torch.hub.load('facebookresearch/detr', 'detr_resnet50', pretrained=True). Then, the hook from the detr.transformer.encoder.layers[-1].self_attn comes with two outputs, one is the features map and the other is the attention map.