fundamentalvision / Deformable-DETR

Deformable DETR: Deformable Transformers for End-to-End Object Detection.
Apache License 2.0
3.15k stars 513 forks source link

why is the last dimension of query_embed 2*hidden_num instead of hidden_num? #179

Open Sampson-Lee opened 1 year ago

Sampson-Lee commented 1 year ago

The paper mentions

For each object query, the 2-d normalized coordinate of the reference point p_q is predicted from its object query embedding via a learnable linear projection followed by a sigmoid function.

Based on this description, I guess the last dimension of query_embed is hidden_num. But line 58 shows it is 2*hidden_num.

Could you share the interpretation? Many thanks.

monstre0731 commented 1 year ago

I think it's because the reference point p_q is a 2-d coordinate, one set with the number as "hidden_num" for axis_x and another with the same number for axis_y.

shubham83183 commented 1 year ago

This is so that the decoder can use the first half of the hidden dimension as tgt and the second half as position embedding. Tgt is the learnable parameter in a deformable decoder, not the decoder output. To obtain decoder queries, tgt and position embedding are added together.

Tonsty commented 2 months ago

@shubham83183 So you mean that deformable detr use extra parameter to initialize tgt? I was confused about the purpose for that since I found tgt is initialized as zero tensor in detr there tgt = torch.zeros_like(query_embed) https://github.com/facebookresearch/detr/blob/29901c51d7fe8712168b8d0d64351170bc0f83e0/models/transformer.py#L55