SHI-Labs / Neighborhood-Attention-Transformer

Neighborhood Attention Transformer, arxiv 2022 / CVPR 2023. Dilated Neighborhood Attention Transformer, arxiv 2022
MIT License
1.05k stars 86 forks source link

About the size of neighborhood #27

Closed wangning7149 closed 2 years ago

wangning7149 commented 2 years ago

Hi a neighborhood of size L × L ,Is L here equal to 3?

qwopqwop200 commented 2 years ago

According to the paper, the overall setup follows Swin, where Swin has an L size of 7 and NAT is the same. https://github.com/SHI-Labs/Neighborhood-Attention-Transformer/blob/main/classification/nat.py#L259

wangning7149 commented 2 years ago

Isn't NAT pixel by pixel? So why is it lower than the flops of swin?

发自我的iPhone

------------------ Original ------------------ From: qwopqwop200 @.> Date: Sun,May 8,2022 3:02 PM To: SHI-Labs/Neighborhood-Attention-Transformer @.> Cc: wangning7149 @.>, Author @.> Subject: Re: [SHI-Labs/Neighborhood-Attention-Transformer] About the size of neighborhood (Issue #27)

The size of L for this NAT is 7, same as Swin. https://github.com/SHI-Labs/Neighborhood-Attention-Transformer/blob/main/classification/nat.py#L259

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

alihassanijr commented 2 years ago

Hello and thank you for your interest.

Firstly, L x L is the term we use to denote kernel (window) size in the paper. Neighborhood size would technically be half the window size, because in theory, each query has L // 2 neighbors on each side of it across each axis, thus L // 2 * 2 neighbors plus itself yields L total pixels across each axis. That's actually why we force kernel size to be specifically odd numbers, so that query pixels can be centered.

We followed Swin in setting the window size to 7x7 so that both end up having the same sized receptive fields. In other words, in every attention module, both NA and SWSA limit each query to exactly 7x7 keys and values.

As for the models, we used a new configuration that is different from Swin. We firstly found overlapping convolutions to be more effective than patched convolutions for both tokenization and downsampling. We also found that with slightly deeper models (but with thinner inverted bottlenecks), we can achieve even better performance. That's why our final models end up with fewer FLOPs than their Swin counterparts.

We've done an ablation study on these changes, which is presented in the paper.

I hope this answers both of your questions.

alihassanijr commented 2 years ago

Closing this due to inactivity. If you still have questions feel free to open it back up.