A question about MaskRCNN with PatchConvNet?

facebookresearch / deit

Official DeiT repository

Apache License 2.0

4.07k stars 556 forks source link

A question about MaskRCNN with PatchConvNet? #138

Closed xiaohu2015 closed 2 years ago

xiaohu2015 commented 2 years ago

In the paper Augmenting Convolutional networks with attention-based aggregation, a simple PatchConvNet is presented. But PatchConvNet only output a feature map with 1/16 of original image size, the Mask RCNN model needs multi-level features, eg p2, p3, p4, p5, so how PatchConvNet can adapt to Mask RCNN? Do we need downsample or upsample the output of PatchConvNet to get multi-level features? @jegou @TouvronHugo @Celebio

jegou commented 2 years ago

Dear @xiaohu2015

you can use the same method as used in the XCiT paper (https://arxiv.org/abs/2106.09681) , where we up-/down-scaled intermediate feature maps so as to fit Mask RCNN resolution.

The following papers have actually adopted this technique:

"Benchmarking Detection Transfer Learning with Vision Transformers" https://arxiv.org/pdf/2111.11429.pdf
The BeiT paper: https://arxiv.org/pdf/2106.08254.pdf (they don't explain how they did in the paper but in the code you can see that they adopted the method from XCiT).

xiaohu2015 commented 2 years ago

OK, thanks