Why the Multihead Attention Map bbox_mask for instance/panoptic segmentation?

Hi @nicolasugrinovic Thank you for your interest in DETR.

You are right about the design of the panoptic head. Intuitively, the transformer should already have done all the job in understanding what objects are on the scene and separating instances. It then returns a set of object embeddings, which are very versatile:

In the "normal" DETR, we simply decode the box and class from them, using simple MLP/linear projection.
In the panoptic head, we show that they can also be decoded into a mask. It works by first using this MHAttentionMap to check the cosine similarity between each object embeddings and each pixel in the image. This gives a rough sense of low resolution mask, which is then upsampled by the CNN. Upsampling and cleaning the mask is a relatively simple task, and doesn't require attention per-se, CNNs are just perfect for that.

As an example, here is the output of the attention layer in the panoptic head (averaged over head): 1993_detr_R101_attn_3 And here is the final mask after the Panoptic layers: 1993_detr_R101_mask_3

I believe I have answered your question, and as such I'm closing this. Feel free to reach out if you have further concerns.

facebookresearch / detr

Why the Multihead Attention Map bbox_mask for instance/panoptic segmentation? #163