Some question about your code and Cross-attention module in your paper

Huzhen757 commented 3 years ago

Hi, thank you excellent work about Transformer in Object Detection，I'm extremely interested in your work。But I have some questions when reading the paper and code. I hope you can give me some answers。

In your paper, you ‘Conditional’ mean merely the formation of query matrix(Q) of Cross-attention module consist of the output embedding of self-attention module(same as DETR) and p_q（in your paper 3.3 purposed）; Otherwise, the formation of key matrix(K) and the value matrix(V) is same as DETR，The difference is that your work is Concat and DETR is addition。 Here, I would like to ask: the reference point s is generated by Object queries ? what is the conditional spatial query from the embedding f ? If in the first decoder layer，the decoder embeddings also initial by nn.Embedding() method；and in the back encoder layer，the decoder embeddings is the outputs of previous decoder layer?
1. Just the formation of p_q in the first decoder layer consist of reference point s and decoder embeddings f, and in the back decoder layer（layer2-6）, the p_q is generated by Object queries（same as DETR）？Because I see in the source code that the initial function in the encoder module include: self.layers[layer_id + 1].ca_qpos_proj = None (layer id begins 0 to 4, in other words, the 2nd-6th decoder) However，in the initial function of TransformerDecoderLayer，the definition of ca_qpos_proj is Linear layer: self.ca_qpos_proj = nn.Linear(d_model, d_model)
2. When I debug code，the model I choose is ConditionalDETR-res50dc5，but entering forward propagation，the sample contains a input images 'tensors'(batch,3,800,1096) and bool mask(batch,800,1096), Where does this mask come from? I don't see any relevant definitions in the initial function。I know the role of this mask，it is used to generate PE by PositionEmbeddingSine function for encoder and decoder.
3. the shape of input images is （batch，3，800，1096），the shape is (batch，3,50,69) through backbone，this downsampling rate is 16 not 32，I guess the convolution step in the last bottleneck is changed to 1, but i cant find the change in your code, and where is the deformable convolution that initial and forward propagation process 。

The above are all my questions. I sincerely hope I can get your help。Thanks！

Huzhen757 commented 3 years ago

Hi，after I carefully read the source code, there is only one question I still doubt: the sample contains a input images 'tensors'(batch,3,800,1096) and bool mask(batch,800,1096), Where does this mask come from? And what is the role of this mask? I only see the use of this mask in the Cross-attention module in the decoder layer，like this：

if key_padding_mask is not None:
    attn_output_weights = attn_output_weights.view(bsz, num_heads, tgt_len, src_len)
    attn_output_weights = attn_output_weights.masked_fill(
        key_padding_mask.unsqueeze(1).unsqueeze(2), # [1, 1, 1, hw]
        float('-inf'), ) 
    attn_output_weights = attn_output_weights.view(bsz * num_heads, tgt_len, src_len)

According to the last dimension(H*W)in the mask , if the value in the mask is true, modify the value on the corresponding attention weight map to '-inf'。Why do you do this? Is it because the position in the mask that is true corresponds to the position on the attention weights map that no longer needs to be paid attention to? In other words， you'd need not pay attention to some pixels on the feature map（batch_size, 2048, H, W ） output in the backbone? And just focus on the places that need to be focused. It's the one of the reasons why it is called ‘Conditional‘？

charlesCXK commented 3 years ago

Hi, if two images in a batch have different resolutions, the smaller one will be padded with zeros. Usually, we do not calculate the attention values in the padded regions because there is no sense to do this. This has nothing to do with 'conditional' and other methods (including DETR) will do the same.

Huzhen757 commented 3 years ago

OK, thank you very much for your reply. That is to say, the mask just filters out the padding area in the image with smaller scale after data augmentation，because of its have no image features。Another，when I was training mode: Conditional-DETR-res50dc5, I used one 24G memory of 3090, and freezen the weights of backbone and the encoder in Transformer，only finetune the decoder weights in Transformer ，but the batch_ size only be set to 。Can only batch size be set to 1 for the model with DC5，no matter what kind of Graphics card I use or how large memory is?

DeppMeng commented 3 years ago

Hi, @Huzhen757 . I guess you are asking about the training batch_size of R50-DC5 model. As far as I remember, even 32G V100 GPU has not enough memory for DC5 models with batch_size=2, so we train all DC5 models with batch_size=1. Of course, the DC5 model can be trained with larger batch_size, as far as you have GPU with larger memory, e.g., A100.

And some clarification about the padding mask:

We re-use this part of code in DETR, it is not related to conditional mechamism.
Its function is make the padded region (mask=1) has no contribution in attention layer.
Why fill with '-inf'? Because this operation is before softmax operation, value '-inf' will become 0 after softmax (the attention score), and that is what we want.
The mask is not only used in decoder cross-attention, it is also used in encoder self-attention https://github.com/Atten4Vis/ConditionalDETR/blob/0b04a859c7fac33a866fcdea06f338610ba6e9d8/models/transformer.py#L221

Huzhen757 commented 3 years ago

OK，I'm fully understand.Thanks for your carefully reply.

Atten4Vis / ConditionalDETR

Some question about your code and Cross-attention module in your paper #3