facebookresearch / detr

End-to-End Object Detection with Transformers
Apache License 2.0
13.09k stars 2.37k forks source link

Does anynone know the performance of DETR without Transformer part (only with ResNet)? #559

Closed seaman1900 closed 1 year ago

seaman1900 commented 1 year ago

in the paper, I don't find any explaination about the profit of Transformer part, maybe without Transformer, DETR still works well (just guessing)

NielsRogge commented 1 year ago

Hi,

The design of DETR inherently requires a Transformer to perform self-attention between object queries. The YOLOS model for instance removes the convolutional backbone + Transformer decoder, and trains the model with the same loss function, obtaining the same average precision (AP) on COCO.

ResNet itself is just a convolutional backbone, so you still need a decoder head on top, like Mask R-CNN.

seaman1900 commented 1 year ago

Hi,

The design of DETR inherently requires a Transformer to perform self-attention between object queries. The YOLOS model for instance removes the convolutional backbone + Transformer decoder, and trains the model with the same loss function, obtaining the same average precision (AP) on COCO.

ResNet itself is just a convolutional backbone, so you still need a decoder head on top, like Mask R-CNN.

thank you for your reply, i have a deeper understanding for the detr architecture. The ResNet is used to extract the image feature(local feature), the Transformer is used to encode and decode the feature (this step gives the model global view of the image)

seaman1900 commented 1 year ago

the image feature extractor(ResNet or Transformer, ViT....) is neccesary, the encoder and decoder is also neccesary and very important