Structure

All stages are differentiable End-to-end trainable architecture

Feature encoder Extracts per-pixel features from input images (I0, I1) Performed once
Context encoder Extracts per-pixel features from input images (I1) Performed once
Correlation layer
- Constructs 4D correlation volume
- Compute visual similarity by constructing full correlation volume between all pairs
- Correlation volume(C) = dot product between all pairs of feature vectors
- Correlation Pyramid
- Construct 4-layer pyramid C1, C2, C3, C4 by pooling last 2 dimensions of the correlation volume w/ kernel sizes 1, 2, 4, 8 and equivalent stride
- Correlation Lookup
- Given current estimate (𝑓^1,𝑓^2) of optical flow, map each pixel 𝑥=(𝑢, 𝑣) in 𝐼_1 to its estimated correspondence in 𝐼_2: 𝑥′=(𝑢+𝑓^1 (𝑢), 𝑣+𝑓^2 (𝑣))
- Use bilinear sampling
Update operator Update operator estimates {𝑓_0,𝑓_1,𝑓_2,𝑓_3, …,𝑓_𝑁 } from an initial starting point 𝑓_0=0
- Update operator
- Input: flow, correlation, latent hidden state
- Output: update ∆𝑓, hidden state
- Process
  1. Initialization – Zero init / Warm start
  2. Inputs
  3. Update
  4. Flow Prediction
  5. Upsampling

Seyoung9304 / Seyoung9304.github.io