Seyoung9304 / Seyoung9304.github.io

0 stars 0 forks source link

RAFT #2

Open Seyoung9304 opened 2 years ago

Seyoung9304 commented 2 years ago

Abstract

Seyoung9304 commented 2 years ago

Structure

All stages are differentiable End-to-end trainable architecture

  1. Feature encoder Extracts per-pixel features from input images (I0, I1) Performed once

  2. Context encoder Extracts per-pixel features from input images (I1) Performed once

  3. Correlation layer

    • Constructs 4D correlation volume
    • Compute visual similarity by constructing full correlation volume between all pairs
    • Correlation volume(C) = dot product between all pairs of feature vectors
    • Correlation Pyramid
    • Construct 4-layer pyramid C1, C2, C3, C4 by pooling last 2 dimensions of the correlation volume w/ kernel sizes 1, 2, 4, 8 and equivalent stride
    • Correlation Lookup
    • Given current estimate (𝑓^1,𝑓^2) of optical flow, map each pixel 𝑥=(𝑢, 𝑣) in 𝐼_1 to its estimated correspondence in 𝐼_2: 𝑥′=(𝑢+𝑓^1 (𝑢), 𝑣+𝑓^2 (𝑣))
    • Use bilinear sampling
  4. Update operator Update operator estimates {𝑓_0,𝑓_1,𝑓_2,𝑓_3, …,𝑓_𝑁 } from an initial starting point 𝑓_0=0

    • Update operator
    • Input: flow, correlation, latent hidden state
    • Output: update ∆𝑓, hidden state
    • Process
      1. Initialization – Zero init / Warm start
      2. Inputs
      3. Update
      4. Flow Prediction
      5. Upsampling