Understanding the role of refpoint_embed

IDEA-Research / DAB-DETR

[ICLR 2022] Official implementation of the paper "DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR"

Apache License 2.0

501 stars 86 forks source link

Understanding the role of refpoint_embed #13

Closed YellowPig-zp closed 2 years ago

YellowPig-zp commented 2 years ago

I am having some trouble understanding the role of refpoint_embed under DABDeformableDETR module. Particularly what are the 4 values represent. Do they represent x,y,w,h or correspond to the 4 levels of the input feature maps? Because from line 391 in models/dab_deformanle_detr/deformable_transformer.py(link) the four values seem to be multiplied with valid ratios from four levels and broadcast along the xywh dimension. Also, when they are inputted into deformable attention, the order of dimension indicate that the 4 values correspond to 4 levels. On the other hand, when random_refpoints_xy is used, the first two values seem to represent xy instead? It's a bit confusing.

SlongLiu commented 2 years ago

Thanks for your issues. It seems like a bug in our implementation. The code in line 391 should be changed to reference_points_input = reference_points[:, None] \ I think. Hence better results could be expected if modified. I will do more experiments and fix them later.

YellowPig-zp commented 2 years ago

Thank you for the reply! Seems like the bug is only affecting the first decoder layer. The shape of reference_points in the concerned line of code is 300x4 for the first layer, and later layers have that of batchsize x 300 x 4 which would be broadcasted correctly regarding the last dimension as xywh. I guess a simple fix is to unsqueeze and expand/repeat the first reference_points to have a batch dimension :)

SlongLiu commented 2 years ago

Hello, I fixed the bug and reran the dab-deformable-detr models. We achieved a better result 48.7 AP on COCO with a R50 backbone. Thanks for pointing out the problem.