facebookresearch / Mask2Former

Code release for "Masked-attention Mask Transformer for Universal Image Segmentation"
MIT License
2.59k stars 388 forks source link

Learnable query features are directly supervised before being used in the Transformer decoder ??? #161

Open haozhi1817 opened 2 years ago

haozhi1817 commented 2 years ago

In Mask2Former, learnable query features are directly supervised before being used in the Transformer decoder to predict masks and class labels, but the learnable query features are independent of inputs, in other words, during inference, learnable query features will predict fixed class labels, this supervision is unreasonable。

DemonsHunter commented 2 years ago

Could you show where the learnable query features are created and used?

haozhi1817 commented 1 year ago

Could you show where the learnable query features are created and used?

https://github.com/facebookresearch/Mask2Former/blob/main/mask2former/modeling/transformer_decoder/mask2former_transformer_decoder.py#L392

The subsequent <output_classes , outputs> get information from input_data by Transformer_Blocks, and they should be supervised by aux_loss_function, but the first <output_class and output > has nothing to do with input_data, I do not understand why they should be supervisied by aux_loss.

haozhi1817 commented 1 year ago

Could you show where the learnable query features are created and used?

https://github.com/facebookresearch/Mask2Former/blob/main/mask2former/modeling/transformer_decoder/mask2former_transformer_decoder.py#L392

最开始初始化的'output'与输入数据无关,forward_prediction_heads函数也不会建立'output'与输入x的关系,这个关系的建立是通过transformer_block实现的。后续的output以及output_class由于经过了transformer_block,是包含有输入数据x的信息的,所以output_class可以被x对应的标签y监督。但是,正如前文所述,第一个'output'以及'output_class'没有经过transformer_block, 与x无关,我不理解为什么他们也可以被x对应的标签y监督。

RubenS02 commented 1 year ago

Having the same question @haozhi1817. Did you find an answer? I went here after looking at the Figure 3 of the paper, where they show "mask predictions of four selected learnable queries BEFORE feeding them into the Transformer decoder". But before the first cross-attention layer, image features and learned queries do not share any information so how can you make meaningful predictions based on that? My thought was that learnable queries were just better than randomly initialized ones but apparently I'm wrong.

bowenc0221 commented 1 year ago

Hi @haozhi1817, you are right, the class prediction before feeding into the Transformer decoder predicts "random" stuff. I added supervision on class prediction just for consistency reasons (to reuse the same loss function). But because it predicts random classes, the assignment based on class is random and thus the final assignment on these learnable queries before Transformer is based on the mask prediction only, so it does not matter to include class into supervision or not.

@RubenS02 For the mask prediction, because it is generated by the dot product between query and per-pixel features, so it is actually image dependent. In Figure 3, we showed mask prediction WITHOUT any class (like region proposals).

I hope this answers your question.

RubenS02 commented 1 year ago

@bowenc0221 I understand now, thanks a lot.

haozhi1817 commented 1 year ago

@bowenc0221 I understand now, thanks a lot.