Deformable DETR support

Thanks for you attention! We noticed the awesome work,Deformable-DETR. There are my opinions in terms of Deformable-DETR.

In my opinion, deformable attention is not a global attention mechanism (Check more disscusions in https://openreview.net/forum?id=gZ9hCDWe6ke&noteId=x1VT5henOtF ). It is more like a sparse sampled deformable convolution. Deformable attention can replace the self-attention in the encoder and cross-attention in the decoder. So, it converges much faster than DETR and it can extend to multi-scale feature maps due to the sparse sampling. But, it is hard to replace the self-attention in the decoder by deformable attention, which needs global attention to perform NMS-like mechanism.
As we disscused in 1, the two part of attentions are sparse connected in Deformable-DETR. We guess the improvement of pre-training is very limited for Deformable-DETR (like the result of https://arxiv.org/abs/1811.08883). So, we may not provide the Deformable DETR support. If it works, the improvement may come from the pre-trained self-attention in the decoder. You can have a try.
Comparisions with Deformable-DETR. As far as we observe, UP-DETR still performs a little better on large objects with the single-scale feature. Deformable-DETR performs better on small and medium objects by making full use of multi-scale features.

dddzg / up-detr