Closed notfacezhi closed 11 months ago
I don't understand what the points in this graph represent, and how the attention graph connected to this point is visualized. In the self-attention process, the input shape is (b, c, h, w) - > (b, h w, c) The attention graph is (h w, h * w) How to visualize this on the original image?
Hi,
In https://github.com/facebookresearch/detr#notebooks the first notebook has the code to visualize the images that we used in the paper, including the attention matrix.
Each point in the original image (which corresponds to a line (or column) in the attention mask) can be reshaped as an image.
I believe I've answered your question and as such I'm closing this issue
hey @fmassa, thanks for the great detr work! I've been trying to replicate some of the work illustrations.
I'd expect the self-attention weights would come from the operation attn = (q*scale) @ k.T
that weighs the values. It turned out that looking at the detr repo at the Transformers classes definition: https://github.com/facebookresearch/detr/blob/main/models/transformer.py#L127, the forward outcome only yields the final tensor of dimensions (b, h * w, c)
.
I don't know how you could get the hook's outcome from the colab's notebook. Is there any other code that the colab model used?
hey @fmassa, thanks for the great detr work! I've been trying to replicate some of the work illustrations.
I'd expect the self-attention weights would come from the operation
attn = (q*scale) @ k.T
that weighs the values. It turned out that looking at the detr repo at the Transformers classes definition: https://github.com/facebookresearch/detr/blob/main/models/transformer.py#L127, the forward outcome only yields the final tensor of dimensions(b, h * w, c)
.I don't know how you could get the hook's outcome from the colab's notebook. Is there any other code that the colab model used?
Did you figure this out?
@MLDeS I just used the model straight from detr = torch.hub.load('facebookresearch/detr', 'detr_resnet50', pretrained=True)
. Then, the hook from the detr.transformer.encoder.layers[-1].self_attn
comes with two outputs, one is the features map and the other is the attention map.