Open darewolf007 opened 3 months ago
Hello, in your detr code, you use transformer get the output is [bs, hidden_dim, feature_dim], the code is
self.transformer(self.input_proj(src), mask, self.query_embed.weight, pos[-1])[0]
the transformer code is
hs = self.decoder(tgt, memory, memory_key_padding_mask=mask, pos=pos_embed, query_pos=query_embed) hs = hs.transpose(1, 2) return hs
Based on my understanding, in your code, you only choose the first decoder layer output as the feature to predict the action. However, i see the original detr code the transformer output is:
hs = self.decoder(tgt, memory, memory_key_padding_mask=mask, pos=pos_embed, query_pos=query_embed) return hs.transpose(1, 2), memory.permute(1, 2, 0).view(bs, c, h, w)
The original detr code use the same feature processing code
hs = self.transformer(self.input_proj(src), mask, self.query_embed.weight, pos[-1])[0] outputs_class = self.class_embed(hs)
I would like to ask why only the first-layer output is chosen as the feature. Would selecting the seventh layer be a better choice? Thank you!!!
I think the authors made a mistake when they cherry-pick the original DETR code.
Hello, in your detr code, you use transformer get the output is [bs, hidden_dim, feature_dim], the code is
the transformer code is
Based on my understanding, in your code, you only choose the first decoder layer output as the feature to predict the action. However, i see the original detr code the transformer output is:
The original detr code use the same feature processing code
I would like to ask why only the first-layer output is chosen as the feature. Would selecting the seventh layer be a better choice? Thank you!!!