Closed kuozhijishou closed 3 years ago
Yes, the authors also used it for object detection by replacing the backbone of the respective object detector with the Swin Transformer. Depending on the object detector I assume you need to either extract the feature map after the last stage (before the global average pooling and linear layer) or for an FPN based approach extract the feature maps from each stage as they do for the ResNet. But in general just changing the backbones and leaving everything else the same should work.
Can this be used for target detection? I didn't make it