Atten4Vis / ConditionalDETR

This repository is an official implementation of the ICCV 2021 paper "Conditional DETR for Fast Training Convergence". (https://arxiv.org/abs/2108.06152)
Apache License 2.0
358 stars 48 forks source link

Issues about Positional Embedding and Reference Point #32

Open tae-mo opened 1 year ago

tae-mo commented 1 year ago

Hi, thanks for sharing your wonderful work.

I got a question in here, https://github.com/Atten4Vis/ConditionalDETR/blob/ead865cbcf88be10175b79165df0836c5fcfc7e3/models/transformer.py#L33 which embedes positional information in the query_pos.

however, I don't understand the reason why does 2*(dim_t//2) has to be devided by 128, instead of the actual dimension pos_tensor has (e.g., 256 by default). https://github.com/Atten4Vis/ConditionalDETR/blob/ead865cbcf88be10175b79165df0836c5fcfc7e3/models/transformer.py#L38 Is it works correctly even dim_t is divided by 128?

I would appreciate to be corrected !

And another question is, when we do the calculation of the equation (1) in the paper, https://github.com/Atten4Vis/ConditionalDETR/blob/ead865cbcf88be10175b79165df0836c5fcfc7e3/models/conditional_detr.py#L89 can I understand that the model would learn "offsets" from the corresponding reference points? what is precise role of the reference points?

Thank you!

Run542968 commented 12 months ago

Hi, for question (1), why does 2*(dim_t//2) has to be devided by 128, since the position embedding performs on both the x and y direction, then concat.