Open Sampson-Lee opened 1 year ago
I think it's because the reference point p_q is a 2-d coordinate, one set with the number as "hidden_num" for axis_x and another with the same number for axis_y.
This is so that the decoder can use the first half of the hidden dimension as tgt and the second half as position embedding. Tgt is the learnable parameter in a deformable decoder, not the decoder output. To obtain decoder queries, tgt and position embedding are added together.
@shubham83183 So you mean that deformable detr use extra parameter to initialize tgt? I was confused about the purpose for that since I found tgt is initialized as zero tensor in detr there tgt = torch.zeros_like(query_embed) https://github.com/facebookresearch/detr/blob/29901c51d7fe8712168b8d0d64351170bc0f83e0/models/transformer.py#L55
The paper mentions
Based on this description, I guess the last dimension of query_embed is hidden_num. But line 58 shows it is 2*hidden_num.
Could you share the interpretation? Many thanks.