swin transformer 이상의 성능을 발휘하는 CNN flops 뿐만 아니라 실제 추론 속도도 빠름

도대체 transformer 가 무엇 때문에 좋을까? 하는 의문들이 많이 생기던 와중에 (#47) 종지부를 찍어준 논문이라고 생각함

ConvNeXts

ResNet50 부터 Swin 에서까지 사용된 "최신-기술"들을 하나씩 적용해 보자. 제일 먼저 할 것. training procedure 맞추기

training procedure

RSB (#46) 에서 보여줬듯이 굉장히 중요하다. 여기서는 DeiT, Swin과 유사한 training procedure를 사용한다. 76.5 => 78.8

최종 세팅은 다음과 같음 (귀한 자료다... 귀한 자료야....!)

Scheduler
- AdamW
- 300 epoch
- LR 4e-3
- 20-epoch linear warmup
- cosine decaying
- weight decay 0.05
4096 Batch
EMA 사용 ==> 큰 모델에서 overfitting 막아준다고 함.
Augmentation
- Mixup
- Cutmix
- RandAugment
- RandomErasing
Regularization
- Stochastic Depth
- Label Smoothing
Layer Scale 사용
- https://arxiv.org/pdf/2103.17239.pdf
- residual block output 에다가 learnable diagonal matrix 곱하는 것임
- initial 값은 1e-6 으로 두었음

다 정리하고 보니 표도 그려놨네 ㅋㅋㅋ 아래 표들 참고하면 될 듯.

Macro Design

Swin에서는 2가지 macro design 이 있었다.

Changing stage compute ratio.
Changing stem to “Patchify”.

resnet block 은 다음과 같았고, swin block 은 다음과 같이 1:1:3:1, 1:1:9:1 이다.

swin 처럼 block 개수 바꾸면 78.8 => 79.4 patch input 넣게 바꾸면 79.4 => 79.5

ResNextify

depthwise-conv 사용으로 computation 줄여주고, swin-T 처럼 network width 늘려줌 depthwise conv 는 self-attention 의 weighted sum operation 과 굉장히 비슷하다.

Inverted BottleNeck

inverted bottleneck 은 transformer에서도 비슷하게 사용되고 있었다. bottleneck 사용으로 flop 수 줄여 주었다.

Large Kernel Sizes

self-attention 은 전체 이미지에 대한 receptive field를 갖고 있다. swin 만 생각해 봐도 7x7 를 쓰니까, 3x3 보다는 훨 크지 않은가?!

Moving up depthwise conv layer

큰 receptive field를 depthwise conv 를 위로 올려서 computation을 아낄 수 있지 않을까? 저자들은 이 기술이 transformer 에서도 사용되었다고 말한다. ~MLP Layer 이전에 MSA 가 있지 않느냐...!!~ 어떻게 보면 당연한 것. inverted bottleneck에서 channel 늘어난 상태에서 굳.이. depthwise 를 할 필요 없지 않느냐는 이야기.

(a) resnext 의 block (b) resnext-like ConvNeXt block (c) 새롭게 제안하는 block

Increasing the kernel size

Computation 도 아꼈겠다, kernel size 를 이리저리 바꿔보았다. 7x7 kernel이 FLOP도 비슷하면서 상당히 잘 작동하더라.

Micro Design

micro design 까지 모두 마친 그림.

아래 설명들을 보며 그림을 보면 도움이 됨.

Replacing ReLU with GELU

GELU를 썼더니 쬐꼼 좋아짐 (+0.1%)

Fewer activation functions

생각해 보면 transformer에서 activation 1개만 쓰지 않냐, 우리도 하나만 쓰자! (+0.7%)

Fewer normalization layers

block 시작지점에 BN 하나 쓰는게 좋더라 (+0.1%)

BN => LN

쬐끔 더 좋아짐 (+0.1%)

Separate downsampling layers

swin 생각해보면 downsample 할 때 3×3 conv with stride 2 사용함 비슷하게 2×2 conv layers with stride 2 로 작업하니까 꽤 좋더라 (+0.5%)

Model 비교

Result

ImageNet

throughput 은 V100 기준

Isotropic ConvNeXt vs. ViT

ViT style 의 downsample 없는 architecture 를 흉내내 보았다. ViT 와 같은 feature size 로 downsample 없이! ViT 느낌으로!!

dhkim0225 / 1day_1paper

[61] A ConvNet for the 2020s (ConvNeXts) #89