SHI-Labs / OneFormer

OneFormer: One Transformer to Rule Universal Image Segmentation, arxiv 2022 / CVPR 2023
https://praeclarumjj3.github.io/oneformer
MIT License
1.46k stars 129 forks source link

Questions about the input of the class_transformer #29

Closed cjfcsjt closed 1 year ago

cjfcsjt commented 1 year ago

Thanks for your great work!! But I found something that confused me.

To make things easier, let's first see the logic of the Transformer in the code.

The self.class_transformer is an instance of Transformer, and its forward should be https://github.com/SHI-Labs/OneFormer/blob/761189909f392a110a4ead574d85ed3a17fbc8a7/oneformer/modeling/transformer_decoder/transformer.py#L64 here, the src will be fed into transformer_encoder layers (an instance of TransformerEncoderLayer ) https://github.com/SHI-Labs/OneFormer/blob/761189909f392a110a4ead574d85ed3a17fbc8a7/oneformer/modeling/transformer_decoder/transformer.py#L99-L104

which will be further fed into the self.with_pos_embed https://github.com/SHI-Labs/OneFormer/blob/761189909f392a110a4ead574d85ed3a17fbc8a7/oneformer/modeling/transformer_decoder/transformer.py#L196 And the function with_pos_embed is https://github.com/SHI-Labs/OneFormer/blob/761189909f392a110a4ead574d85ed3a17fbc8a7/oneformer/modeling/transformer_decoder/transformer.py#L186-L187

Here, in my understanding, the tensor denotes the input features the transformer encoder, and the pos denotes the positional embeddings.

However, it seems that the tensor feats is actually the positional embeddings but is treated as the input features, while the tensor self.class_input_proj(mask_features) is actually the input features, but is treated as the positional embeddings https://github.com/SHI-Labs/OneFormer/blob/761189909f392a110a4ead574d85ed3a17fbc8a7/oneformer/modeling/transformer_decoder/oneformer_transformer_decoder.py#L432-L437

Am I misunderstanding here?

praeclarumjj3 commented 1 year ago

Hi @cjfcsjt, thanks for your interest in our work.

Thanks for raising this point. Your understanding of the positional embedding and input feats is correct. We can use either pos or feats as positional embeddings and the other one as the input features as they are both passed through the TransformerDecoder to update the task-conditioned queries. In practice, we use convolution-mapped positional embeddings and sinusoidal features to update the task-conditioned queries since it performs empirically slightly better:

We would like to also point out that we do not use any encoder layers in our class_transformer as the enc_layers is zero. Please let me know if you more questions.

praeclarumjj3 commented 1 year ago

Closing due to inactivity. Feel free to open if you have any more questions.