Hierarchical Transformer whose representation is computed with Shifted windows
Shifted windowing scheme: Greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection
ViT: apply a transformer architecture on non-overlapping image patches (quadratic computation complexity)
(Successive Transformer block)Swin Transformer block: modifies self-attention based Shifted windows (linear computation complexity)
W-MSA: Input image is divided into 4 windows and compute attention for patches within the windows
SW-MSA: Connections between neighboring non-overlapping windows in the previous layer 매 stage마다 다른 window를 연결하여 window마다의 연결성 강조
Shifted Window based Self-attention: window 내에서만 진행. shift가 진행되면서 self-attention 진행. 서브 윈도우들을 mxm로 만들기(Cyclic-Shift)
Self-attention: a measurement of a specific word's effect on all other words of the same sentence --> 한 패치와 다른 패치와의 관계를 적용하는 방법 (: Image size에 따른 연산량이 다르다.)
--> 이 논문은 window라는 개념을 적용하여 실행
Basic
CNN에서 Transformer를 사용하는 이유: CNNs are much more localized, it doesn't have the spatial information necessary for many tasks like instance recognition because convolutions don't consider distanced-pixels relations
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
reference
Idea
Basic