Closed cjfcsjt closed 1 year ago
Hi @cjfcsjt, thanks for your interest in our work.
Thanks for raising this point. Your understanding of the positional embedding and input feats is correct. We can use either pos or feats as positional embeddings and the other one as the input features as they are both passed through the TransformerDecoder to update the task-conditioned queries. In practice, we use convolution-mapped positional embeddings and sinusoidal features to update the task-conditioned queries since it performs empirically slightly better:
PQ: 49.8, AP: 35.9, mIoU (ss/ms): 57.0/57.7
PQ: 49.8, AP: 35.3, mIoU (ss/ms): 56.1/57.4
We would like to also point out that we do not use any encoder layers in our class_transformer as the enc_layers is zero. Please let me know if you more questions.
Closing due to inactivity. Feel free to open if you have any more questions.
Thanks for your great work!! But I found something that confused me.
To make things easier, let's first see the logic of the
Transformer
in the code.The
self.class_transformer
is an instance ofTransformer
, and its forward should be https://github.com/SHI-Labs/OneFormer/blob/761189909f392a110a4ead574d85ed3a17fbc8a7/oneformer/modeling/transformer_decoder/transformer.py#L64 here, thesrc
will be fed intotransformer_encoder
layers (an instance ofTransformerEncoderLayer
) https://github.com/SHI-Labs/OneFormer/blob/761189909f392a110a4ead574d85ed3a17fbc8a7/oneformer/modeling/transformer_decoder/transformer.py#L99-L104which will be further fed into the
self.with_pos_embed
https://github.com/SHI-Labs/OneFormer/blob/761189909f392a110a4ead574d85ed3a17fbc8a7/oneformer/modeling/transformer_decoder/transformer.py#L196 And the functionwith_pos_embed
is https://github.com/SHI-Labs/OneFormer/blob/761189909f392a110a4ead574d85ed3a17fbc8a7/oneformer/modeling/transformer_decoder/transformer.py#L186-L187Here, in my understanding, the
tensor
denotes the input features the transformer encoder, and thepos
denotes the positional embeddings.However, it seems that the tensor
feats
is actually the positional embeddings but is treated as the input features, while the tensorself.class_input_proj(mask_features)
is actually the input features, but is treated as the positional embeddings https://github.com/SHI-Labs/OneFormer/blob/761189909f392a110a4ead574d85ed3a17fbc8a7/oneformer/modeling/transformer_decoder/oneformer_transformer_decoder.py#L432-L437Am I misunderstanding here?