How would you apply this to a ViT with no clear ERF / TRF?

Chasel-Tsui / mmdet-rfla

ECCV22: RFLA

MIT License

245 stars 22 forks source link

This work is very helpful for my research. I am training detectors using a ViT backbone. I have used RFLA for both a ResNet and a ViT backbone and I find that in either case it improves the detection accuracy of small objects compared to NWD RKA.

However, this work is built on the ERF / TRF of the ResNet, which is computed based on the gaussian of the series of convolutional layers in a ResNet. But ViTs don't have as clear of a way of attributing the receptive field for each pyramid in a FPN built on the ViT output (e.g. https://openreview.net/pdf?id=Gl8FHfMVTZu). I'm curious whether you have any suggestions for modifying the ERF calculations for a ViT.

Thanks!

Chasel-Tsui / mmdet-rfla

How would you apply this to a ViT with no clear ERF / TRF? #68