Chasel-Tsui / mmdet-rfla

ECCV22: RFLA
MIT License
245 stars 22 forks source link

How would you apply this to a ViT with no clear ERF / TRF? #68

Open JohnMBrandt opened 1 month ago

JohnMBrandt commented 1 month ago

This work is very helpful for my research. I am training detectors using a ViT backbone. I have used RFLA for both a ResNet and a ViT backbone and I find that in either case it improves the detection accuracy of small objects compared to NWD RKA.

However, this work is built on the ERF / TRF of the ResNet, which is computed based on the gaussian of the series of convolutional layers in a ResNet. But ViTs don't have as clear of a way of attributing the receptive field for each pyramid in a FPN built on the ViT output (e.g. https://openreview.net/pdf?id=Gl8FHfMVTZu). I'm curious whether you have any suggestions for modifying the ERF calculations for a ViT.

Thanks!

Chasel-Tsui commented 2 weeks ago

Very interesting question. At now, it is hard to estimate the effective receptive field for vits. If you want to adapt the pipeline into ViT-based methids, a simple solution may be directly using the receptive field (from bottom to top) in this repo for calculation, and discard those redundant receptive fields (for example, if you only have 4 FPN levels in ViT, you can use the lowest 4 level receptive field calculation from use code). However, i am not sure whether this way will perform well or not