[DeCLIP] Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm

Problem statement

CLIP
- contrastive loss 활용해 image와 text의 유사도를 계산
- 데이터셋 내의 이미지와 텍스트를 1번 사용하기 때문에 400M pairs를 사용해야 하는 data hungry한 모델 => DeCLIP은 1번보다 많이 보도록 할 예정

DeCLIP Open-source data (29M) = CC3M + CC12M + YFCC15M
DeCLIP Full-data (88M) = DeCLIP Open-source data (29M) + DeCLIP Web-crawled data (59M)

name	abbr	type	format	source	size
Conceptual Captions	CC3M	image	(image, caption)	3M	image-text pretraining
Conceptual 12M	CC12M	image	(image, caption)	12M	image-text pretraining
YFCC15M	image	15M	image-text pretraining
DECLIP WEB-CRAWLED DATA	image	59M	image-text pretraining
ImageNet	image	(image, class)	classification, captioning
Pets	image	downstream transferability	classification
CIFAR10	image	downstream transferability	classification
CIFAR100	image	downstream transferability	classification
SUN	image	downstream transferability	classification
Food101	image	downstream transferability	classification
Flowers	image	downstream transferability	classification
Caltech	image	downstream transferability	classification
Aircraft	image	downstream transferability	classification
DTD	image	downstream transferability	classification

Supervision within data (Data를 더 착취하기 위한 objectives)
- sample-level 정보를 활용한 supervision
  - TSS (Text self-supervision)
    - MLM (BERT와 동일한 세팅)
  - ISS (Image self-supervision)
    - SSL (self-supervsied learning, augmentation으로 positive sample을 만들어 contrastive learning)의 변형
    - augmentation한 sample $\tilde{Z^I}$과 2-layer MLP에 원래 이미지를 통과시킨 $P^i$ 간의 유사도를 학습
    - $\tilde{Z^I}$과 $Z^i$를 인코딩하는 encoder의 weights는 공유되며, augmented image에 대한 gradient는 업데이트하지 않음 (학습 X)
- sample 간 정보를 활용한 supervision
  - MVS (Multi-view supervision)
    - $Z^i$, $\tilde{Z^I}$, $Z^t$, $\tilde{Z^t}$로 본래 1개 페어를 4개 페어로 augment함
    - $\tilde{Z^t}$는 $Z^t$의 특정 단어를 replace해서 생성 (e.g. 'cat' <-> 'kitty')
  - NNS (Nearest-Neighbor Supervision)
    - 전체 text candidates 중 $Z^t$와 가장 유사한 ${Z^{t\prime}}$을 $Z^i$, $\tilde{Z^I}$의 positive sample로 사용 (pair 2개 추가)
    - semantic-level augmentation의 효과
    - 64K Queue를 FIFO 형태로 구현하여 사용
${L{DeCLIP}} = (1 - \alpha - \beta - \gamma){L{CLIP}} + \alpha({L{ISS}} + {L{TSS}})+ \beta{L{MVS}} + \gamma{L{NNS}}$

ImageNet zero-shot Top1 accuracy에서 동일 컴포넌트를 사용한 CLIP보다 0.8% 높은 성능 (데이터는 7.1배 적게 사용)
CLIP보다 뛰어난 transferability: pretraining -> fine-tuning할 시 11개 중 8개의 다운스트림 태스크에서 높은 성능
DECLIP이 더 낮은 SUN과 Food101은 사전학습에 사용한 데이터와 분포가 다르기 때문이라는 추정 ~(가불기)~
Pets의 경우 Resnet과 ViT의 차이가 크지 않은 반면, Aircraft의 경우 그 차이가 크다. 각 모델이 다른 feature extraction capacities를 갖고 있기 떄문이라는 추정 ~(가불기2)~
MVS가 4.2%P를 올린 반면, SS와 NSS는 그보다 조금 적다. SSL 메소드 튜닝에 SS의 improvements 따라 더 오를 수 있을 것으로 본다. ~(가불기3)~
NNS는 일부 노이즈를 유발할 수 있으나 (미스매칭), 그럼에도 결과적으로 좋았다.
DECLIP의 CAM이 더 유의미한 부분에 집중하는 것을 확인할 수 있음
학습에 필요한 시간도 감소한 것으로 보임

TSS는 text encoder가 기존에 배운 linguistic knowledge가 희석되지 않도록 하는 용도로 사용되었다 치더라도, ISS나 NSS의 distinct한 영향력은 미지수인 것 같다. '할 수 있는 ISS / MVS / NSS의 구분보다는 sample-level augmentation은 다 넣었더니 잘되었다'는 결론 같다.
- data augmentation을 잔뜩한 것에 비해, 사실 MVS와 NSS의 경우 generic해보이지 않는다.
- novelty를 위해 MVS, NSS를 감싸지 않고, '사실 augmentation이 좋은 것 같고, multi-modal pretraining은 데이터 많이 붓는 게 최고야!'를 인정한다면 SimCLR처럼 조금 더 일반적인 프레임워크로 방향이 나오지 않았을까 하는 개인적인 생각
TSS도 사실 미심쩍긴 하다. 입력되는 text 프롬프트는 10 단어 미만의 짧은 시퀀스인데, 이 시퀀스의 MLM으로 linguistic feature가 유의미하게 학습될까?