Some questions about design choices

Hi, thanks for your excellent work! Although the idea is similar to reference points in Deformable-DETR and SMCA, your extension is quite powerful and may trigger more attempts to optimize this object query setting. It is exciting to see more advances in DETR.

However, I notice some differences from DETR implmentation, and I wonder how these design choices were made. Hopefully you can provide some insights and maybe experimental results.

Projections of contents and position embeddings are seperated.

All projections are removed in MultiheadAttention and introduced in decoder layers, and the purpose seems to be separating the projections of contents and position embeddings. Given this increases the number of parameters, I wonder if this helps. If so, to what extent?

Concatenation of contents and position embeddings.

In decoder cross attention, instead of element-wise addition, you concatenate contents and position embeddings. Why? Is this crucial to encode coordinate embeddings predicted by object queries? If so, to what extent?

More about coordinate embedding.

Although object queries in DETR already have ability to represent object centers (or at least some reference points,) I wonder why reference points can be encoded in such a way (sinusoidal & concatenation.)

Specifically, why sinusoidal position embedding? In DETR implementation, object queries themselves are already learnable position embeddings. It is reasonable to apply some projections to outputs of sigmoid function, but given that position embedding has separate projection weights in your implementation, is sinusoidal position embedding necessary?

From my point of view, this seems to be empiricism. If so, could you provide more about what attempts have you tried? If there is any reason to do so, please elaborate.

Thanks a lot!

Hi, thank you for your interest in our work. I hope the following explanations can clear your concerns.

Q1: Projections of contents and position embeddings are separated Q2: Concatenation of contents and position embeddings

Q1 and Q2 are related, both for content-position separation in cross-attention. (1) Why use concatenation. We want to claim that using addition or concatenation is just a form and is not our key contribution. The concatenation instead of addition in cross-attention could separate the roles of content and spatial queries, which is concise and easy to analyze. Please more details in Section 3.3 in our paper. (2) We conducted DETR with concatenation experiment (same to ours, projections are separated, and use concatenation of content and position embedding), and the result is 34.5, which is not higher than DETR using addition (34.9). The lower performance is reasonable: by concatenation, because object queries (learned positional embeddings) are not dependent on images, the spatial attention weight map does not provide enough help. This experiment illustrates that using addition or concatenation does not bring much influence to the performance.

While in our conditional DETR, the learned conditional spatial query is composed with the embedding of previous decoder layer, and contains the displacement of the distance regions w.r.t the reference points, so separation of content and position embedding in cross-attention makes sense.

Q3: More about coordinate embedding, is sinusoidal position embedding necessary?

The coordinate embedding is used to form the conditional spatial query. Since sinusoidal position embedding is used to encode the spatial key in cross-attention, it is natural and necessary to encode the spatial query to the same embedding space. If you visualize the dot product between sinusoidal embedding of points in a 2D map, and sinusoidal embedding of a single point P, you would find that regions around P have a stronger response. Meanwhile, \lambda_q contains the displacement information of the distinct regions w.r.t the reference point to help the spatial attention to locate the interested region.

Q4: More attempts

We provide ablation study on more ways forming the conditional spatial query in Table 3 in our paper, which is useful for understanding our design on conditional spatial query.

I hope this answers your questions.

Atten4Vis / ConditionalDETR

Some questions about design choices #2