MarkFzp / act-plus-plus

Imitation learning algorithms with Co-training for Mobile ALOHA: ACT, Diffusion Policy, VINN
https://mobile-aloha.github.io/
MIT License
2.86k stars 525 forks source link

about dert code problem(transformer part) #44

Open darewolf007 opened 3 months ago

darewolf007 commented 3 months ago

Hello, in your detr code, you use transformer get the output is [bs, hidden_dim, feature_dim], the code is

self.transformer(self.input_proj(src), mask, self.query_embed.weight, pos[-1])[0]

the transformer code is

hs = self.decoder(tgt, memory, memory_key_padding_mask=mask, pos=pos_embed, query_pos=query_embed)
hs = hs.transpose(1, 2)
return hs

Based on my understanding, in your code, you only choose the first decoder layer output as the feature to predict the action. However, i see the original detr code the transformer output is:

hs = self.decoder(tgt, memory, memory_key_padding_mask=mask, pos=pos_embed, query_pos=query_embed)
return hs.transpose(1, 2), memory.permute(1, 2, 0).view(bs, c, h, w)

The original detr code use the same feature processing code

hs = self.transformer(self.input_proj(src), mask, self.query_embed.weight, pos[-1])[0]
outputs_class = self.class_embed(hs)

I would like to ask why only the first-layer output is chosen as the feature. Would selecting the seventh layer be a better choice? Thank you!!!

ka2hyeon commented 3 months ago

I think the authors made a mistake when they cherry-pick the original DETR code.