What are the best content and position/anchor query pairs for DETR decoder?

IDEA-Research / detrex

detrex is a research platform for DETR-based object detection, segmentation, pose estimation and other visual recognition tasks.

https://detrex.readthedocs.io/en/latest/

Apache License 2.0

1.95k stars 204 forks source link

What are the best content and position/anchor query pairs for DETR decoder? #226

Open smartbarbarian opened 1 year ago

smartbarbarian commented 1 year ago

The DN DETR architecture employs static queries, while DINO uses mixed query selection. Later, masked DINO reverted back to using the pure query selection of deformable DETR. In the context of this DETR architecture, is there any further research or explanation on which content and position or anchor query pairs should be used during the decoding process?

FengLi-ust commented 1 year ago

For detection, using learnable content query could be better. Mask DINO mainly focuses on segmentation that is deeply related to content query, so we use selected content query.

smartbarbarian commented 1 year ago

May I ask if you have any follow-up research on the topic? For example, content and anchor queries from the encoder, along with some learnable embeddings, can be integrated in a variety of ways.

smartbarbarian commented 1 year ago

In DAB-DETR, a complex design for anchor queries is used in both self and cross attentions. However, in DINO, you discarded the design and just compared no, pure, and mixed query selections. Could you please explain why this change was made?