facebookresearch / detr

End-to-End Object Detection with Transformers
Apache License 2.0
13.56k stars 2.45k forks source link

Why the Multihead Attention Map bbox_mask for instance/panoptic segmentation? #163

Closed nicolasugrinovic closed 4 years ago

nicolasugrinovic commented 4 years ago

Hi thanks for the great work!

I have more of a philosophical question. Why do you use the bbox_mask variable (the output weights of the MHAttentionMap block) as an input to the FPN-like network for instance segmentation? Is it because those weights/maps resemble a very coarse form of the segmentation masks? This does not enforce the use of attention for the segmentation task, it uses a convenient input for the transpose convolutions. So object detection task benefits directly from the use of attention but segmentation does it more indirectly.
Am I correct? Thank you

alcinos commented 4 years ago

Hi @nicolasugrinovic Thank you for your interest in DETR.

You are right about the design of the panoptic head. Intuitively, the transformer should already have done all the job in understanding what objects are on the scene and separating instances. It then returns a set of object embeddings, which are very versatile:

As an example, here is the output of the attention layer in the panoptic head (averaged over head): 1993_detr_R101_attn_3 And here is the final mask after the Panoptic layers: 1993_detr_R101_mask_3

I believe I have answered your question, and as such I'm closing this. Feel free to reach out if you have further concerns.