fundamentalvision / Deformable-DETR

Deformable DETR: Deformable Transformers for End-to-End Object Detection.
Apache License 2.0
3.14k stars 513 forks source link

Clearification about the feature maps used in Def. DETR #201

Open mburges-cvl opened 1 year ago

mburges-cvl commented 1 year ago

Hello,

I am currently slightly confused about the number of feature maps used in the deformable DETR code.

In the ArgumentParser you define that you will use 4 feature maps:

https://github.com/fundamentalvision/Deformable-DETR/blob/11169a60c33333af00a4849f1808023eba96a931/main.py#L64

however, in the code you "only" return 3 feature maps from resnet:

https://github.com/fundamentalvision/Deformable-DETR/blob/11169a60c33333af00a4849f1808023eba96a931/models/backbone.py#L76

so you generate an additional "artificial" feature map in the code:

https://github.com/fundamentalvision/Deformable-DETR/blob/11169a60c33333af00a4849f1808023eba96a931/models/deformable_detr.py#L140-L152

which utilizes the last feature map, and combines it with an additional input projection, from the feature maps obtained from ResNet.

  1. Do I understand this correctly?
  2. Why exactly did you add this "artificial" feature map?
  3. Why did you choose this over using earlier feature maps, like layer1 or even the layer before the maxpool? (Which are part of the code, but commented out):

https://github.com/fundamentalvision/Deformable-DETR/blob/11169a60c33333af00a4849f1808023eba96a931/models/backbone.py#L75

Thanks for the great model!