dddzg / up-detr

[TPAMI 2022 & CVPR2021 Oral] UP-DETR: Unsupervised Pre-training for Object Detection with Transformers
Apache License 2.0
476 stars 71 forks source link

Deformable DETR support #7

Closed FrancescoCappio closed 3 years ago

FrancescoCappio commented 3 years ago

Hello! I am really interested in your work as I think it is something necessary to successfully exploit DETR in real world applications. At ICLR 2021 an improvement of DETR called "Deformable-DETR" has been proposed with a number of modifications in the transformer part of the network which improve performance and reduce computational complexity. Are you planning to support Deformable DETR and provide a pretrained model even for it? I certainly think that this could improve the success of your pretraining approach as more people could exploit it.

Code for Deformable DETR is available: https://github.com/fundamentalvision/Deformable-DETR

Thanks in advance

dddzg commented 3 years ago

Thanks for you attention! We noticed the awesome work,Deformable-DETR. There are my opinions in terms of Deformable-DETR.

  1. In my opinion, deformable attention is not a global attention mechanism (Check more disscusions in https://openreview.net/forum?id=gZ9hCDWe6ke&noteId=x1VT5henOtF ). It is more like a sparse sampled deformable convolution. Deformable attention can replace the self-attention in the encoder and cross-attention in the decoder. So, it converges much faster than DETR and it can extend to multi-scale feature maps due to the sparse sampling. But, it is hard to replace the self-attention in the decoder by deformable attention, which needs global attention to perform NMS-like mechanism.
  2. As we disscused in 1, the two part of attentions are sparse connected in Deformable-DETR. We guess the improvement of pre-training is very limited for Deformable-DETR (like the result of https://arxiv.org/abs/1811.08883). So, we may not provide the Deformable DETR support. If it works, the improvement may come from the pre-trained self-attention in the decoder. You can have a try.
  3. Comparisions with Deformable-DETR. As far as we observe, UP-DETR still performs a little better on large objects with the single-scale feature. Deformable-DETR performs better on small and medium objects by making full use of multi-scale features.