JisuHann / One-day-One-paper

Review paper
3 stars 0 forks source link

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows #30

Closed JisuHann closed 2 years ago

JisuHann commented 2 years ago

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

reference

Idea

  1. Path Partition layer: Each patch is treated as a token (ex. 4x4x3 channel, dimension=48)
  2. Stage 1-4 a. Stage 1
    • Linear Embedding Layer
    • Swin Transformer Block x2 (Two Successive Swin Transformer Block) b. Stage 2-4
    • Patch Merging Layer
    • Swin Transformer Block x2
      • Hierarchical Transformer whose representation is computed with Shifted windows
      • Shifted windowing scheme: Greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection
      • ViT: apply a transformer architecture on non-overlapping image patches (quadratic computation complexity)
      • (Successive Transformer block)Swin Transformer block: modifies self-attention based Shifted windows (linear computation complexity)
        1. W-MSA: Input image is divided into 4 windows and compute attention for patches within the windows
        2. SW-MSA: Connections between neighboring non-overlapping windows in the previous layer 매 stage마다 다른 window를 연결하여 window마다의 연결성 강조
          • Shifted Window based Self-attention: window 내에서만 진행. shift가 진행되면서 self-attention 진행. 서브 윈도우들을 mxm로 만들기(Cyclic-Shift)
      • Self-attention: a measurement of a specific word's effect on all other words of the same sentence --> 한 패치와 다른 패치와의 관계를 적용하는 방법 (: Image size에 따른 연산량이 다르다.)
      • --> 이 논문은 window라는 개념을 적용하여 실행

Basic