image와 language pair의 관계에 대한 multi-modal pre-training을 넘어서, language supervision과 image self-supervision을 통해 데이터 효율을 높여본다.
CLIP + SimCLR(image supervision)을 해보자
Glossary
Linear Classification (Linear probing): Encoder를 freeze한 상태로 learnable final classification layer를 붙여서 unsupervised 혹은 self-supervised로 학습한 Encoder의 representation을 평가하는 방법
Baseline
CLIP: 텍스트와 이미지의 semantic 관계에 대해 학습했으나, self-supervision이 더 적용될 여지가 있음
SimCLR: ResNet + SSL
(Ours) SLIP: CLIP + ViT + SSL
Data details
name
abbr
type
format
source
size
description
remark
related tasks
ImageNet-1K
image
(image, class)
1K classes, no-labels, highly-curated
classification, captioning
YFCC15M
image
15M
English title & descriptions only
image-text pretraining
Conceptual Captions
CC3M
image
(image, caption)
3M
image-text pretraining
Conceptual 12M
CC12M
image
(image, caption)
12M
image-text pretraining
DTD
image
downstream transferability, little overlap with the semantic distribution of YFCC15M
classification
SST2
image
downstream transferability, little overlap with the semantic distribution of YFCC15M
classification
KITTI
image
downstream transferability, little overlap with the semantic distribution of YFCC15M
Problem statement
Glossary
Baseline
Data details
Approach
Evaluation
평가 데이터셋:
평가 방법 3종류
결과
Limitations
(친절하게 별개 섹션에서 예상 질답이 적혀있다.)