ICLR'23에서 outstanding paper award로 선정된 논문

Summary

dense prediction task 같은 경우엔 labeling cost가 크다는 문제가 존재. 따라서 few labeled image로부터 어떤 dense task라도 수행을 할 수 있는 프레임워크가 굉장히 필요한 상황임. 하지만, 현재의 few-shot learning 방법론 같은 경우에는 제한된 task 집합에서만 유효함. 이에 논문에서는 VTM(Visual Token Matching)이라는 unified few-shot learner를 제안함. 이 VTM은 결국에는 support set 내부의 이미지와의 visual token matching을 통해서 match되는 정도 (similarity)에 label을 matching해주는 형태로 prediction이 되는 것임. 아주 적은 task-specific parameter만을 가지고 arbitrary unseen task에 대해서도 dense prediction을 수행할 수 있는 few shot learning framework임. 실제로 아주 적은 이미지(i.e. 10 shots)만을 가지고도 fully supervised method에 견줄 만한 성능을 냈고, full supervision 대비 0.1%의 이미지만 가지고도 fully supervised baselines를 outperform하는 성능을 보여줌.

Method Highlights

any arbitrary dense prediction task $\mathcal{T}$ can be expressed as follows:

위처럼 기존의 unified few shot learner를 디자인하기 어려웠던 이유는 각 task가 각각 다양한 output의 structure를 포함한다는 점 (dimensionality가 다르거나 혹은 아예 discrete한 label을 갖거나 continuous한 label을 갖는 형식으로) + 이에 필요한 사전 지식이 다 다르다는 점에서 기인함.

task-agnostic한 architecture를 디자인하기 위해, output dimension을 task의 개수로 지정함. 각 채널에서 하나의 task를 담당하는 형태로 됨. 그리고 하나의 task는 기존의 multi-channel prediction과는 다르게 single-channel prediction을 수행함.

앞서 말했던 것처럼, similarity 기반으로 few-shot learning하기 위해 Visual Token Matching을 수행하는데, 이는 아래의 수식으로 표현될 수 있음.

쉽게 말하면, 쿼리 & support set 이미지를 tokenize한 후에 이에 대한 유사도를 구하고 이를 support set의 label과 dot product해주는 형태.

Architecture overview

크게 image encoder (query, support), label encoder, label decoder, token matching module로 구성되어 있음. image encoder의 경우에는 다양한 task에 쉽게 adapt할 수 있도록 task-specific parameter를 도입함. image encoder의 경우에는 BEiT pre-trained weights로 초기화하고, label encoder와 decoder는 scratch로 학습함. label decoder의 경우에는 DPT (Transformer for Dense Prediction, Rantfl et al., 2021)의 구조를 따라감. Token matching은 multi-head attention으로 구현할 수 있음.

학습을 위한 objective function은 아래와 같음.

Loss function은 task-specific하게 정의되는데 semantic segmentation은 cross entropy loss을 사용하고 others는 L1 loss을 사용함.

Experiments Highlights

Taskonomy dataset 중 dense prediction에 맞게 데이터를 선정하였고, 평가를 위한 Dense Prediction Task는 아래와 같이 구성함.

semantic segmentation (SS), surface normal (SN), Euclidean distance (ED), Z-buffer depth (ZD), texture edge (TE), occlusion edge (OE), 2D keypoints (K2), 3D keypoints (K3), reshading (RS), and principal curvature (PC)

quantitative and qualitative results (comparison with fully supervised baselines and few shot learning baselines)
Varying results on k-shots (점선은 fully supervised baselines)
Ablation

Strengths

universal few-shot learning for dense prediction에 있어서 pioneering work
strong experimental results

Weaknesses

내가 생각한 general purpose와는 좀 다른 느낌. task-specific한 parameter를 최대한 배제하면서 성능을 올릴 만한 다른 방도가 있을 순 없는지?

YoojLee / paper_review

UNIVERSAL FEW-SHOT LEARNING OF DENSE PREDICTION TASKS WITH VISUAL TOKEN MATCHING (2023) #50

Summary

Method Highlights

Experiments Highlights

Strengths

Weaknesses