SHI-Labs / Neighborhood-Attention-Transformer

Neighborhood Attention Transformer, arxiv 2022 / CVPR 2023. Dilated Neighborhood Attention Transformer, arxiv 2022
MIT License
1.05k stars 86 forks source link

Motivation on choosing NAT depth #67

Closed oksanadanilova closed 2 years ago

oksanadanilova commented 2 years ago

Hi

Great generalizing research and inspiring results, thank you for the work.

Could you please explain what was your motivation on choosing depth for NAT = [3, 4, 18, 5] link while depth for DiNAT = [2, 2, 18, 2] link ?

The paper A ConvNet for the 2020s (FAIR) introduce an intuition according to which they replace classical ResNet depth = [3,4,6,3] with depth which proportional to depth of Swin T (1:1:3:1), setting it up = [3,3,9,3]. The paper showed great impact of adjusting depths

I see that your DiNAT depth is proportional to Large Swin T (1:1:9:1) while NAT depth is not proportional neither to ResNet depth nor Swin depth. Have you conducted some researches on this issue?

Can't sleep without understanding motivation on choosing such a strange depths for NAT.

alihassanijr commented 2 years ago

Hello and thank you for your interest.

To clarify, DiNAT is identical to NAT in terms of depth, dimensions, and pretty much everything else. The only difference is that DiNAT dilates the NA module in half of the layers: Check these lines for NAT models, and these lines for their DiNAT equivalents.

The model you referenced that has [2, 2, 18, 2] layers is DiNAT_s, which is the alternative DiNAT model we introduced. DiNAT_s is identical in terms of structure to Swin Transformer. We discuss why we explored this alternative model in the DiNAT paper.

The main difference is that the NAT/DiNAT architecture utilizes overlapping convolutions instead of non-overlapping ones to downsample images. Therefore, the single 4x4 convolution in the tokenizer (PatchEmbed) is replaced with two 3x3 convolutions with 2x2 strides (which yield same sized outputs, but are slightly more expensive). Additionally, the 2x2 non-overlapping convolutions between the levels (PatchMerge) are also replaced with 3x3 overlapping convolutions.

If only this change is applied to the model (which we discuss in the NAT paper), you'll end up with a model that's roughly 30M parameters, and 4.9GFLOPs (ImageNet, 224x224), which is not really a fair comparison to Swin-Tiny and ConvNeXt-Tiny as far as compute requirements go. Therefore, we had to change the number of layers and inverted bottleneck sizes to adjust for that. We tried a few options but finally settled on this design choice, which is:

[3, 4, 6X, 5] instead of [2, 2, 6X, 2] in Swin, or [3, 3, 9X, 3] in ConvNeXt.

I hope that answers your question.

oksanadanilova commented 2 years ago

Thank you for you patience! I studied carefully both papers except of appendix where DiNAT_s was introduced ) There are a lot of work has been done.

So far I discussed the issue that the intuition of changing the number of layers in the stage in ConvNeXt in order to make them proportional to Swin configuration isn't entirely clear. I was surprised when saw similar proportion here, however I was not completely attentive reading your code and paper appendix.

However this is very helpful that you admitted here that you have experimented with the number of layers! Thank you )