Issues about Positional Embedding and Reference Point

Hi, thanks for sharing your wonderful work.

however, I don't understand the reason why does 2*(dim_t//2) has to be devided by 128, instead of the actual dimension pos_tensor has (e.g., 256 by default). https://github.com/Atten4Vis/ConditionalDETR/blob/ead865cbcf88be10175b79165df0836c5fcfc7e3/models/transformer.py#L38 Is it works correctly even dim_t is divided by 128?

I would appreciate to be corrected !

And another question is, when we do the calculation of the equation (1) in the paper, https://github.com/Atten4Vis/ConditionalDETR/blob/ead865cbcf88be10175b79165df0836c5fcfc7e3/models/conditional_detr.py#L89 can I understand that the model would learn "offsets" from the corresponding reference points? what is precise role of the reference points?

Thank you!

Atten4Vis / ConditionalDETR