Epiphqny / VisTR

[CVPR2021 Oral] End-to-End Video Instance Segmentation with Transformers
https://arxiv.org/abs/2011.14503
Apache License 2.0
739 stars 95 forks source link

The visualization of decoder attention_weight #30

Open sally1913105 opened 3 years ago

sally1913105 commented 3 years ago

image I want to visualization attention_weight of decoder moudle, I take the output of multihead_attn in the last layer of decoder ,but the shape is that(bs,360,36hw) where h*w is the shape of feature map, I don't understand that there are 36 different attention_weight with the same instances of the same frame as the picture show
Can you explain what this means

Epiphqny commented 3 years ago

Hi @sally1913105, we compute the spatial and temporal attention, then for 36 frames sequence there are 36 attention weights for each prediction, even the prediction is for a specific frame. In this way, the features from other frames could help the segmentation of this frame.

sally1913105 commented 3 years ago

Thank you for your answer! Can I think of it as within the 36 attention weights of ith prediction only ith attention weights is for ith features and others attention weights is for other features? but How to combine these 36 attention weights?

Epiphqny commented 3 years ago

hi @sally1913105, for each prediction we only use the attention weights of the corresponding frame in this stage. The weights do not need to be combined. Interaction with other frames is realized by the following 3D convolutions.