Does anynone know the performance of DETR without Transformer part (only with ResNet)?

seaman1900 commented 1 year ago

in the paper, I don't find any explaination about the profit of Transformer part, maybe without Transformer, DETR still works well (just guessing)

NielsRogge commented 1 year ago

Hi,

The design of DETR inherently requires a Transformer to perform self-attention between object queries. The YOLOS model for instance removes the convolutional backbone + Transformer decoder, and trains the model with the same loss function, obtaining the same average precision (AP) on COCO.

ResNet itself is just a convolutional backbone, so you still need a decoder head on top, like Mask R-CNN.

seaman1900 commented 1 year ago

Hi,

The design of DETR inherently requires a Transformer to perform self-attention between object queries. The YOLOS model for instance removes the convolutional backbone + Transformer decoder, and trains the model with the same loss function, obtaining the same average precision (AP) on COCO.

ResNet itself is just a convolutional backbone, so you still need a decoder head on top, like Mask R-CNN.

thank you for your reply, i have a deeper understanding for the detr architecture. The ResNet is used to extract the image feature(local feature), the Transformer is used to encode and decode the feature (this step gives the model global view of the image)

seaman1900 commented 1 year ago

the image feature extractor(ResNet or Transformer, ViT....) is neccesary, the encoder and decoder is also neccesary and very important

facebookresearch / detr

Does anynone know the performance of DETR without Transformer part (only with ResNet)? #559