SHI-Labs / Neighborhood-Attention-Transformer

Neighborhood Attention Transformer, arxiv 2022 / CVPR 2023. Dilated Neighborhood Attention Transformer, arxiv 2022
MIT License
1.05k stars 86 forks source link

Comparison with zero-padding version. #3

Closed weigq closed 2 years ago

weigq commented 2 years ago

Excellent work! BTW, the proposed edge/corner neighborghood selection has stronger performance than the zero padding version is claimed in the paper, i wonder about the performance of the latter one, which is not mentioned in the paper?

XiaoyuShi97 commented 2 years ago

I am also interested in this claim, but did not find ablation study on that.

alihassanijr commented 2 years ago

Hello, and thank you for your interest.

Generally we observed on-par/worse performance when using zero padding, and the gap increased as we scaled up, or moved towards downstream tasks. I should also note that with zero padding, the module would no longer be as expressive as Swin's SWA, because of the reduced receptive field size. Additionally, with zero padding, the attention mechanism would not end up being equivalent to self-attention when the neighborhood size matches window size. In other words, zero padding is just less expressive, and at best only saves a limited amount of compute, even with the CUDA kernel, which would be unnoticeable.

We may add our findings regarding the zero padding version in our supplementary materials in future releases.

I hope this helps.

weigq commented 2 years ago

Thanks