[SLIP] SLIP: Self-supervision meets Language-Image Pre-training

Problem statement

image와 language pair의 관계에 대한 multi-modal pre-training을 넘어서, language supervision과 image self-supervision을 통해 데이터 효율을 높여본다.
- CLIP + SimCLR(image supervision)을 해보자

Glossary

Linear Classification (Linear probing): Encoder를 freeze한 상태로 learnable final classification layer를 붙여서 unsupervised 혹은 self-supervised로 학습한 Encoder의 representation을 평가하는 방법

Baseline

CLIP: 텍스트와 이미지의 semantic 관계에 대해 학습했으나, self-supervision이 더 적용될 여지가 있음
SimCLR: ResNet + SSL
(Ours) SLIP: CLIP + ViT + SSL

Data details

name	abbr	type	format	source	size
ImageNet-1K	image	(image, class)	1K classes, no-labels, highly-curated	classification, captioning
YFCC15M	image	15M	English title & descriptions only	image-text pretraining
Conceptual Captions	CC3M	image	(image, caption)	3M	image-text pretraining
Conceptual 12M	CC12M	image	(image, caption)	12M	image-text pretraining
DTD	image	downstream transferability, little overlap with the semantic distribution of YFCC15M	classification
SST2	image	downstream transferability, little overlap with the semantic distribution of YFCC15M	classification
KITTI	image	downstream transferability, little overlap with the semantic distribution of YFCC15M	classification

Approach

self-supervised 학습: CLIP 아키텍처 w/ ViT + 3-layer MLP(4096 -> 256, SimCLR v2) + batch_size 4096
image supervision: SimCLR의 SSL
language supervision: CLIP의 Image-Language Contrastive Learning
- CLIP의 attention을 보면, 이미지를 통해 language의 attention이 잘 학습되는 것 같더라
$Loss = 1.0 {Loss_{CLIP}} + 1.0 {Loss_{SimCLR}}$

Evaluation

평가 데이터셋:
- ImageNet-1K
- YFCC15M
평가 방법 3종류
- Zero-shot Transfer evaluation w/ prompt ensembling for caption embedding
- Linear Classification evaluation
- E2E Finetuning evaluation
결과
- SLIP 뿐 아니라, SimCLR과 MoCO v3 모두 ImageNet에 pretraining할 때보다, YFCC15M에 pretraining할 때, Linear cls와 Finetuning의 갭이 더 큼
- ImageNet-1K이 YFCC15M보다 더 highly-curated data이기 때문에 그런 것으로 추정됨
- (돌려말하면 YFCC15M이 더 real한 데이터라는거니까) "YFCC15M에 대해 학습한 SILP 또한 Linear와 Finetuning 사이의 갭이 큰 건 참작해주세요"
- 예상대로 Finetuning보다 ZS와 Linear에서의 상승폭이 크다.
- 모델 params의 차이에 따른 모델 간 퍼포먼스 갭이 트렌드를 갖는 것으로 보아 SLIP은 scaling에도 consistent improvement를 기대할 수 있다.
- fine-tuning 시 SSL(SimCLR)과 SLIP이 CLIP에 비해 충분한 갭을 두고 있는데, self-supervision이 CLIP의 fine-tuning 시 퍼포먼스 병목을 일부 해소할 수 있는 것으로 보인다.
- SST2, KITTI, PCAM과 같은 데이터셋은 YFCC15M와 semantic distribution에 차이가 꽤 있으므로 낮은 성능 (인 것으로 추정)
- 반면, YFCC15M과 결(?)이 잘 맞는 Food-101, Oxford Pets, Caltech-101, STL-10 등에선 높은 성능
- MNIST, CIFAR-10, CIFAR-100과 같이 low-resolution 데이터셋에선 성능이 낮은데, language supervision만으로는 충분한 효과를 보기 어렵다.
  - 'low-resolution -> 이미지 정보량이 낮다 -> image supervision(SSL)은 있으나 마나 했을 것이다 -> language supervision 주도의 결과가 아닐까'
- (pre-training 데이터셋에 따른 차이 비교) YFCC15M, CC3M, CC12M
- pre-training 데이터셋이 달라져도 CLIP과 SLIP 사이의 트랜드는 위와 같다.
- CLIP의 경우, CC12M과 YFCC15M에서 오버피팅이 관측되는데, SLIP은 (데이터 효율성이 높아져서) 이러한 문제가 완화된 것으로 보인다.
- CC3M의 경우, CLIP과 SLIP 모두 epoch를 늘릴수록 상승을 계속 보이는 걸로 보아, 오버피팅이다.

Limitations

(친절하게 별개 섹션에서 예상 질답이 적혀있다.)

Q1. Is SLIP just CLIP with data augmentation?
- 저자는 SimCLR에서 제시된 color, blur, crop 등의 augmentation 기법'만' 가져올 경우, SLIP의 성능이 나오지 않는다고 반박함
- SLIP에서 제시하는 CLIP w/ SSL은 (질문의) 명시적인 data augmentation과 함께, SSL 과정에서 일어나는 암시적인 data augmentation 효과도 있기 때문에, 엄밀하게는 Q1은 SLIP의 적절한 한 마디 표현 아닐까..

bigshanedogg / survey

[SLIP] SLIP: Self-supervision meets Language-Image Pre-training #19

Problem statement

Glossary

Baseline

Data details

Approach

Evaluation

Limitations