IDEA-Research / DAB-DETR

[ICLR 2022] Official implementation of the paper "DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR"
Apache License 2.0
519 stars 88 forks source link

Temperature tuning and positional embeddings for DAB-DETR #21

Open YellowPig-zp opened 2 years ago

YellowPig-zp commented 2 years ago

In the paper, there is a section saying the optimal temperature for positional embedding is 20 in your model. However, this line under gen_sineembed_for_position indicates that a value of 10000 is used for the temperature. Is there any part I missed when I am trying to understand the codes?

Besides, the paper also says that only x and y coordinates are used to generate positional embedding for the cross-attention, but this line, despite commenting as num_queries x batch_size x 2, actually operates on num_queries x batch_size x 4 if printing out the tensor shape. Does this perform better than only using x&y or they are similar in performance?

SlongLiu commented 2 years ago

For the first question, you are right, and it seems a bug in our implementations. For the second, we only use PE(xy) as positional queries, see this line, which will slice the PE(xywh) to PE(xy). By the way, we will use projected (PE(xywh)) as pos query for self-attention.

vanche commented 2 years ago

Hi, I was confused about temperature for pos emb. Thus, is your code available in github different with code of paper? Or, Did you use temperature of 10000 ?

helq2612 commented 2 years ago

I think the temp=20 currently is used in the image position encoding only, see main.py and position_encoding.py. Agree with @YellowPig-zp, this temp=20 should be also applied to the box position in the cross attention, which is also ranged from 0 ~ 1.