Closed seaman1900 closed 1 year ago
Hi,
The design of DETR inherently requires a Transformer to perform self-attention between object queries. The YOLOS model for instance removes the convolutional backbone + Transformer decoder, and trains the model with the same loss function, obtaining the same average precision (AP) on COCO.
ResNet itself is just a convolutional backbone, so you still need a decoder head on top, like Mask R-CNN.
Hi,
The design of DETR inherently requires a Transformer to perform self-attention between object queries. The YOLOS model for instance removes the convolutional backbone + Transformer decoder, and trains the model with the same loss function, obtaining the same average precision (AP) on COCO.
ResNet itself is just a convolutional backbone, so you still need a decoder head on top, like Mask R-CNN.
thank you for your reply, i have a deeper understanding for the detr architecture. The ResNet is used to extract the image feature(local feature), the Transformer is used to encode and decode the feature (this step gives the model global view of the image)
the image feature extractor(ResNet or Transformer, ViT....) is neccesary, the encoder and decoder is also neccesary and very important
in the paper, I don't find any explaination about the profit of Transformer part, maybe without Transformer, DETR still works well (just guessing)