Closed weigq closed 2 years ago
I am also interested in this claim, but did not find ablation study on that.
Hello, and thank you for your interest.
Generally we observed on-par/worse performance when using zero padding, and the gap increased as we scaled up, or moved towards downstream tasks. I should also note that with zero padding, the module would no longer be as expressive as Swin's SWA, because of the reduced receptive field size. Additionally, with zero padding, the attention mechanism would not end up being equivalent to self-attention when the neighborhood size matches window size. In other words, zero padding is just less expressive, and at best only saves a limited amount of compute, even with the CUDA kernel, which would be unnoticeable.
We may add our findings regarding the zero padding version in our supplementary materials in future releases.
I hope this helps.
Thanks
Excellent work! BTW, the proposed edge/corner neighborghood selection has stronger performance than the zero padding version is claimed in the paper, i wonder about the performance of the latter one, which is not mentioned in the paper?