Clearification about the feature maps used in Def. DETR

Hello,

I am currently slightly confused about the number of feature maps used in the deformable DETR code.

In the ArgumentParser you define that you will use 4 feature maps:

however, in the code you "only" return 3 feature maps from resnet:

so you generate an additional "artificial" feature map in the code:

which utilizes the last feature map, and combines it with an additional input projection, from the feature maps obtained from ResNet.

Do I understand this correctly?
Why exactly did you add this "artificial" feature map?
Why did you choose this over using earlier feature maps, like layer1 or even the layer before the maxpool? (Which are part of the code, but commented out):

Thanks for the great model!

fundamentalvision / Deformable-DETR