Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Idea

Path Partition layer: Each patch is treated as a token (ex. 4x4x3 channel, dimension=48)
Stage 1-4 a. Stage 1
- Linear Embedding Layer
- Swin Transformer Block x2 (Two Successive Swin Transformer Block) b. Stage 2-4
- Patch Merging Layer
- Swin Transformer Block x2
  - Hierarchical Transformer whose representation is computed with Shifted windows
  - Shifted windowing scheme: Greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection
  - ViT: apply a transformer architecture on non-overlapping image patches (quadratic computation complexity)
  - (Successive Transformer block)Swin Transformer block: modifies self-attention based Shifted windows (linear computation complexity)
    1. W-MSA: Input image is divided into 4 windows and compute attention for patches within the windows
    2. SW-MSA: Connections between neighboring non-overlapping windows in the previous layer 매 stage마다 다른 window를 연결하여 window마다의 연결성 강조
      - Shifted Window based Self-attention: window 내에서만 진행. shift가 진행되면서 self-attention 진행. 서브 윈도우들을 mxm로 만들기(Cyclic-Shift)
  - Self-attention: a measurement of a specific word's effect on all other words of the same sentence --> 한 패치와 다른 패치와의 관계를 적용하는 방법 (: Image size에 따른 연산량이 다르다.)
  - --> 이 논문은 window라는 개념을 적용하여 실행

CNN에서 Transformer를 사용하는 이유: CNNs are much more localized, it doesn't have the spatial information necessary for many tasks like instance recognition because convolutions don't consider distanced-pixels relations