IDEA-Research / detrex

detrex is a research platform for DETR-based object detection, segmentation, pose estimation and other visual recognition tasks.
https://detrex.readthedocs.io/en/latest/
Apache License 2.0
1.97k stars 206 forks source link

Some questions on DAB-DETR #160

Closed xyupeng closed 1 year ago

xyupeng commented 1 year ago

Hi there, Thanks for the great repo and the nice compilation of so many power DETR methods. But when I look into the code, I have some problems understanding some of them. Hope to get some advice here.

One is https://github.com/IDEA-Research/detrex/blob/75318d139157ea2dbf0b3100a3dc3623d9740878/projects/dab_detr/modeling/dab_transformer.py#L208-L220 where the first layer is indicated by is_first_layer. Only in the first decoder transformer layer, the cross_attn's query_content would add query_pos. I wonder why it is the case? I think query_pos is already added to query in self_attn before cross_attn in the first layer? What if I remove the addition of query_pos in the first cross_attn?

Another one is https://github.com/IDEA-Research/detrex/blob/75318d139157ea2dbf0b3100a3dc3623d9740878/projects/dab_detr/modeling/dab_transformer.py#L223-L231 where reference_boxes for the next layer is a detached version of new_reference_boxes. What's the idea here of detach()? Does removing it affect the performance a lot?

The third one is about initialization: https://github.com/IDEA-Research/detrex/blob/75318d139157ea2dbf0b3100a3dc3623d9740878/projects/dab_detr/modeling/dab_detr.py#L129-L130 What's the idea of initializing the last fc layer in MLP as all 0?

Sorry for so many questions proposed at a time. It's totally ok if there's no 'standard' answer. Just any advice would be very much appreciated if it can boost the understanding of this great model.

Thanks in advance!

SlongLiu commented 1 year ago

Thanks for your questions.

  1. It is borrowed from Conditional DETR. As the 1st layer's query_content are all-zero vectors, we add query_pos to the query_content to distinguish different queries.
  2. It is borrowed from Deformable DETR. The detach operations help to stable the training.
  3. it is borrowed from Deformable DETR. The 0 init results in a better result.
xyupeng commented 1 year ago

Got them. Thanks!