[SimCLR] A Simple Framework for Contrastive Learning of Visual Representations

Problem statement

최근 제안된 방법론들이 Contrastive learning에 미치는 각각의 영향 실험
- data augmentation: crop, cutout, color, sobel filtering, noise, blur, rotate
- projection layer: linear vs. non-linear
- supervised vs. unsupervised

Baseline

supervised ResNet-50의 구조로 제안된 모델들
- 같은 구조, 같은 데이터셋으로 contrastive learning 적용 여부(와 그에 맞는 세팅)와 적용 후 차이를 최소화하려는 의도
- MoCo, PIRL, CPC v2
구조에 약간의 변형을 준 다른 ResNet 계열 모델들
- BigBiGAN, AMDIM, CMC 등

Data details

사전학습: ImageNet ILSVRC-2012 (main), CIFAR-10 (일부)

Approach

사용한 구조:
- 1. 이미지에 같은 설정의 augmentation에서 뽑은 2개의 operators(i.e. arguments를 다르게 한 동일한 augmentation)를 적용해 각각 ${x_i}$와 ${x_j}$를 추출
- 1. 배치 사이즈가 N일 때, $2N$개의 샘플이 생기는데, 이 때, ${x_i}$, ${x_j}$는 positive sample로, 이 2개를 제외한 $2(N-1)$개는 negative samples로 간주
- 1. f = vision encoder, g = MLP w/ ReLU이고 ${h_i} = f({x_i}), {z_i} = g({h_i})$일 때, training은 ${z_k}$에 대해서, representation vector로 취급하는 것은 ${h_k}
- 1. 학습은 large batch로 cross entropy
경험론적 결론:
- three augmentations in a row: 'random cropping and resize back', 'random color distortions', and 'random Gaussian blur'
  - 위의 7개 augmentation $7P2$개만큼 테스트해본 결과, 시너지가 좋았던 것들을 조합
    - cropping과 color distortion을 섞는 게 효과적임 (1개는 충분한 효과가 없었고, 2개 이상을 혼합해서 사용했을 때 represenation을 배울만큼 충분히 어려워졌음)
  - cropping의 경우, 2개의 operators를 뽑음으로써 모델에게 cropped images 간의 관계를 학습시킬 수 있음
    - (a) Global and local views (b) Adjacent views
- model output에 MLM w/ non-linear function을 추가하여 cross entropy로 학습하고, inference time에는 MLP를 통과시키지 않은 model output을 사용
  - normalized embeddings + cross entropy가 representation learning에 효과적임
    - cross entropy는 negative samples의 relative hardness를 반영하기 때문 (hard negs에 더 높은 pseudo prob이 부여되므로)
    - cross entropy 전에 L2 정규화를 하지 않으면, contrastive acc은 높아졌으나, represenation의 품질은 더 낮아짐
  - non-linear > linear >> no projection layer 순의 효과
- supervised setting과 비교하며 실험한 것 (지금에 와선 너무 당연한 말...이지만)
  - larger batch에 영향을 많이 받음 (point-wise update가 아닌 list-wise이므로)
  - 오래 학습할수록 더 높은 성능
  - color distortion이 supervised에선 좋지 않았으나, unsupervised에선 효과적이었음
  - 모델 사이즈가 커질수록(깊고, 넓게) supervised와 unsupervised의 차이가 줄어들었고, 전체적인 퍼포먼스 증가
    - 레이어를 깊게 쌓는 실험을 한 본 논문과 달리 SimCLR v2에선 넓게 쌓아서 실험함

Evaluation

1) linear evaluation
- encoder는 freeze, 그 위에 linear classifier를 붙여서 평가
- ResNet-50(4x) w/ SimCLR이 supervised RestNet-50과 유사한 성능
2) semi-supervised (1% 혹은 10%의 데이터로 학습)
- unsupervised w/ SimCLR 후 1% or 10%의 데이터로 전체 encoder를 fine-tuning 역시 기존 supervised baseline보다 높은 성능
3) transfer learning (linear classifier를 붙인 후 hparams tuning)
- 12개 분류 태스크로 평가 (Food, CIFAR10, Birdsnap, Cars, Aircraft, DTD, Pets, Caltech-101, Flowers 등)
- SimCLR이 우수하거나 거의 동일하다
그 외
- data transformation을 예측하는 태스크에서 ${h_k}$를 이용하는 것이 $g({h_k})$를 이용하는 것보다 우세했음
  - $g$는 데이터 변형에 invariant하게 학습되기 때문에, $g$를 거치면서 data representation을 위한 정보가 손실되기 때문으로 추정
  - 근거: classification task와 irrelevant하고 data representation에는 relevant한 태스크(e.g. Color vs grayscale, Rotation, Org vs corrupted, Org vs Sobel filtered)로 실험한 결과 ${h_k}$를 이용할 때가 $g({h_k})$를 이용할 때보다 더 높았음

Limitations

제안된 구조의 약점
저자가 숨기고 싶어하는 부분 혹은 논리적 허점

bigshanedogg / survey

[SimCLR] A Simple Framework for Contrastive Learning of Visual Representations #1

Problem statement

Baseline

Data details

Approach

Evaluation

Limitations