Closed nicolasugrinovic closed 4 years ago
Hi @nicolasugrinovic Thank you for your interest in DETR.
You are right about the design of the panoptic head. Intuitively, the transformer should already have done all the job in understanding what objects are on the scene and separating instances. It then returns a set of object embeddings, which are very versatile:
MHAttentionMap
to check the cosine similarity between each object embeddings and each pixel in the image. This gives a rough sense of low resolution mask, which is then upsampled by the CNN. Upsampling and cleaning the mask is a relatively simple task, and doesn't require attention per-se, CNNs are just perfect for that.As an example, here is the output of the attention layer in the panoptic head (averaged over head): And here is the final mask after the Panoptic layers:
I believe I have answered your question, and as such I'm closing this. Feel free to reach out if you have further concerns.
Hi thanks for the great work!
I have more of a philosophical question. Why do you use the
bbox_mask
variable (the output weights of theMHAttentionMap
block) as an input to the FPN-like network for instance segmentation? Is it because those weights/maps resemble a very coarse form of the segmentation masks? This does not enforce the use of attention for the segmentation task, it uses a convenient input for the transpose convolutions. So object detection task benefits directly from the use of attention but segmentation does it more indirectly.Am I correct? Thank you