facebookresearch / detr

End-to-End Object Detection with Transformers
Apache License 2.0
13.08k stars 2.37k forks source link

first decoder_layer input different with paper #580

Open ehdrndd opened 1 year ago

ehdrndd commented 1 year ago

I'm going deeper in code. And i find something weird

image

See decoder input. There are object query going direct to value.

but In DecoderLayer code.

q = k = self.with_pos_embed(tgt, query_pos)
tgt2 = self.self_attn(q, k, value=tgt, attn_mask=tgt_mask,key_padding_mask=tgt_key_padding_mask)[0]
tgt = tgt + self.dropout1(tgt2)
tgt = self.norm1(tgt)
tgt2 = self.multihead_attn(query=self.with_pos_embed(tgt, query_pos),
                                   key=self.with_pos_embed(memory, pos),
                                   value=memory, attn_mask=memory_mask,
                                   key_padding_mask=memory_key_padding_mask)[0]
tgt = tgt + self.dropout2(tgt2)
tgt = self.norm2(tgt)
tgt2 = self.linear2(self.dropout(self.activation(self.linear1(tgt))))
tgt = tgt + self.dropout3(tgt2)
tgt = self.norm3(tgt)

decoder first layer tgt is 0 Tensor. size=(num_quries=100, batch_size, hidden_dim=256) query_pos is object query...

Zhong-Zi-Zeng commented 11 months ago

I have same confusion about this issue.

Zhong-Zi-Zeng commented 11 months ago

@ehdrndd On the last page of the original paper, they give a simple code of DETR, but the decoder's input is just a random value of size (100, 256).