Sik-Ho Tang | Review -- MoCo: Momentum Contrast for Unsupervised Visual Representation Learning.

NorbertZheng commented 10 months ago

Sik-Ho Tang. Review — MoCo: Momentum Contrast for Unsupervised Visual Representation Learning.

NorbertZheng commented 10 months ago

Overview

Momentum Update for the Key Encoder, Outperforms Exemplar-CNN, Context Prediction, Jigsaw Puzzles, RotNet/Image Rotations, Colorization, DeepCluster, Instance Discrimination, CPCv1, CPCv2, CMC.

Momentum Contrast (MoCo).

Momentum Contrast for Unsupervised Visual Representation Learning MoCo, by Facebook AI Research (FAIR) 2020 CVPR, Over 2400 Citations Self-Supervised Learning, Contrastive Learning, Image Classification, Object Detection, Segmentation

A dynamic dictionary with a queue and a moving-averaged encoder are built. This enables building a large and consistent dictionary on-the-fly that facilitates contrastive unsupervised learning.

NorbertZheng commented 10 months ago

This can tackle the inconsistent dictionary problem caused by Instance Discrimination.

NorbertZheng commented 10 months ago

Contrastive Learning

Left: Contrastive Learning Without Dictionary Lookup, Right Contrastive Learning With Dictionary Lookup.

Contrastive learning since DrLIM, and its recent developments, can be thought of as training an encoder for a dictionary look-up task.
Consider an encoded query $q$ and a set of encoded samples $\{k{0},k{1},k_{2},...\}$ that are the keys of a dictionary.
Assume that there is a single key (denoted as $k{+}$) in the dictionary that $q$ matches. A contrastive loss in DrLIM is a function whose value is low when $q$ is similar to its positive key $k{+}$ and dissimilar to all other keys (considered negative keys for $q$).
With similarity measured by dot product, a form of a contrastive loss function, called InfoNCE (CPCv1), is considered in this paper: where $\tau=0.07$ is a temperature hyper-parameter as in Instance Discrimination.
The sum is over one positive and $K$ negative samples. Intuitively, this loss is the log loss of $(K+1)$-way softmax-based classifier that tries to classify $q$ as $k_{+}$.
In general, the query representation is $q=f{q}(x{q})$ where $f{q}$ is an encoder network and $x{q}$ is a query sample.

(a) End-to-End

The end-to-end update by back-propagation is a natural mechanism.
The keys are consistently encoded (by the same set of encoder parameters). But the dictionary size is coupled with the mini-batch size, limited by the GPU memory size. It is also challenged by large mini-batch optimization.

(b) Memory Bank

The dictionary for each mini-batch is randomly sampled from the memory bank with no back-propagation, so it can support a large dictionary size.
The memory bank was updated when it was last seen, so the sampled keys are essentially about the encoders at multiple different steps all over the past epoch and thus are less consistent.
A momentum update is adopted on the memory bank in Instance Discrimination. Its momentum update is on the representations of the same sample, not the encoder.
(Please read NCE, CPCv1 and Instance Discrimination for more details.)

NorbertZheng commented 10 months ago

Contrastive learning since DrLIM, and its recent developments, can be thought of as

training an encoder for a dictionary look-up task.

NorbertZheng commented 10 months ago

Momentum Contrast (MoCo)

Momentum Contrast (MoCo).

The dictionary is dynamic in the sense that

the keys are randomly sampled,
and that the key encoder evolves during training.

The hypothesis is that good features can be learned by a large dictionary that covers a rich set of negative samples, while the encoder for the dictionary keys is kept as consistent as possible despite its evolution.

NorbertZheng commented 10 months ago

Dictionary as a Queue

MoCo maintains the dictionary as a queue of data samples.
This allows to reuse the encoded keys from the immediate preceding mini-batches.
The dictionary size can be much larger than a typical mini-batch size.

The samples in the dictionary are progressively replaced. The current mini-batch is enqueued to the dictionary, and the oldest mini-batch in the queue is removed.

Removing the oldest mini-batch can be beneficial, because its encoded keys are the most outdated.

NorbertZheng commented 10 months ago

Momentum Update

Using a queue makes it intractable to update the key encoder by back-propagation.
A naive solution is to copy the key encoder $f{k}$ from the query encoder $f{q}$, ignoring this gradient. But this solution yields poor results. It is hypothesized that such failure is caused by the rapidly changing encoder that reduces the key representations' consistency**.

Formally, denoting the parameters of $f{k}$ as $\theta{k}$ and those of $f{q}$ as $\theta{q}$, $\theta_{k}$ is updated by:

Here, $m\in [0,1)$ is a momentum coefficient. Only the parameters $\theta_{q}$ are updated by back-propagation.

As a result, though the keys in the queue are encoded by different encoders (in different mini-batches), the difference among these encoders can be made small.
In experiments, a relatively large momentum (e.g., $m=0.999$, default) works much better than a smaller value (e.g., $m=0.9$), suggesting that a slowly evolving key encoder is a core to making use of a queue.

NorbertZheng commented 10 months ago

So it's better than Proximal Regularization used in Instance Discrimination???

NorbertZheng commented 10 months ago

Some Other Details

MoCo Algorithm.

Query $x{q}$ and key $x{k}$ are two augmented versions of $x$, i.e., two random "views" of the same image under random data augmentation to form a positive pair.
The data augmentation setting follows:
- a $224\times 224$-pixel crop is taken from a randomly resized image,
- and then undergoes random color jittering,
- random horizontal flip,
- and random grayscale conversion.
The queries and keys are respectively encoded by their encoders, $f{q}$ and $f{k}$.
Similar to Instance Discrimination, a ResNet is as the encoder, whose last fully-connected layer (after global average pooling) has a fixed-dimensional output (128-D). This output vector is normalized by its $L2$-norm. This is the representation of the query and key.

NorbertZheng commented 10 months ago

Shuffling BN

Using BN prevents the model from learning good representations. The model appears to "cheat" the pretext task and easily finds a low-loss solution.
Multiple GPUs are used to train the model and BN is performed on the samples independently for each GPU (as done in common practice).
For the key encoder $f_{k}$, the sample order in the current mini-batch is shuffled before distributing it among GPUs (and shuffle back after encoding); The sample order of the mini-batch for the query encoder $f_{q}$ is not altered.
This ensures the batch statistics used to compute a query and its positive key come from two different subsets. This effectively tackles the cheating issue and allows training to benefit from BN.

NorbertZheng commented 10 months ago

Ablation & ImageNet Results

Datasets

ImageNet-1M (IN-1M): 1.28 million images in 1000 classes (often called ImageNet-1K).
Instagram-1B (IG-1B): Following WSL, this is a dataset of ~1 billion (940M) public images from Instagram.
For IN-1M, a mini-batch size of 256 ($N$ in Algorithm 1) is used in 8 GPUs. It takes ~53 hours to train ResNet-50.
For IG-1B, a mini-batch size of 1024 is used in 64 GPUs. It takes about 6 days to train ResNet-50.
Linear Classification Protocal: A very common protocol that a classifier is trained on the global average pooling features of a ResNet, for 100 epochs. 1-crop, top-1 classification accuracy on the ImageNet validation set is reported.

NorbertZheng commented 10 months ago

Ablation: Contrastive Loss Mechanisms

Comparison of three contrastive loss mechanisms under the ImageNet linear classification protocol.

Overall, all three mechanisms benefit from a larger $K$.
The end-to-end mechanism performs similarly to MoCo when $K$ is small. But the largest mini-batch a high-end machine (8 Volta 32GB GPUs) can afford is 1024 for the end-to-end mechanism.
The memory bank mechanism in Instance Discrimination can support a larger dictionary size. But it is 2.6% worse than MoCo.

NorbertZheng commented 10 months ago

Ablation: Momentum

Study of Momentum $m$.

It performs reasonably well when $m$ is in $0.99\sim 0.9999$, showing that a slowly progressing (i.e., relatively large momentum) key encoder is beneficial.

When $m$ is too small (e.g., $0.9$), the accuracy drops considerably.

NorbertZheng commented 10 months ago

SOTA Comparison

Comparison under the linear classification protocol on ImageNet.

Besides ResNet-50 (R50) [33], 2 and 4 wider (more channels) variants are also tested.
$K=65536$ and $m=0.999$.

MoCo with R50 performs competitively and achieves 60.6% accuracy, better than all competitors of similar model sizes (~24M).

MoCo benefits from larger models and achieves 68.6% accuracy with R50×4 outperforms such as Exemplar-CNN, Relative Context Prediction, Jigsaw Puzzles, RotNet/Image Rotations, Colorization, DeepCluster, Instance Discrimination, LocalAgg, CPCv1, CPCv2, CMC.

"MoCo v2", an extension of a preliminary version of this manuscript, achieves 71.1% accuracy with R50 (up from 60.6%), given small changes on the data augmentation and output projection head.

NorbertZheng commented 10 months ago

Transferring Features Results

Features produced by unsupervised pre-training can have different distributions compared with ImageNet supervised pre-training.
Feature normalization is adopted during fine-tuning: Fine-tune with BN that is trained, BN in the newly initialized layers (e.g., FPN) is also used.

PASCAL VOC Object Detection

The detector is Faster R-CNN with a backbone of R50-dilated-C5 or R50-C4. All layers are fine-tuned end-to-end.

Object detection fine-tuned on PASCAL VOC trainval07+12 In the brackets are the gaps to the ImageNet supervised pre-training counterpart.

(a): For R50-dilated-C5, MoCo pre-trained on IN-1M is comparable to the supervised pre-training counterpart, and MoCo pretrained on IG-1B surpasses it.
(b): For R50-C4, MoCo with IN-1M or IG-1B is better than the supervised counterpart: up to +0.9 AP50, +3.7 AP, and +4.9 AP75.

Comparison with previous methods on object detection fine-tuned on PASCAL VOC trainval2007.

For the AP50 metric, no previous method can catch up with its respective supervised pre-training counterpart.

MoCo pre-trained on any of IN-1M, IN-14M (full ImageNet), YFCC-100M [55], and IG-1B can outperform the supervised baseline.

NorbertZheng commented 10 months ago

COCO Object Detection and Segmentation

Object detection and instance segmentation fine-tuned on COCO.

The model is Mask R-CNN [32] with the FPN [41] or C4 backbone, with BN tuned, is used.
With the 1× schedule, all models (including the ImageNet supervised counterparts) are heavily under-trained, as indicated by the ~2 points gaps to the 2× schedule cases.

With the 2× schedule, MoCo is better than its ImageNet supervised counterpart in all metrics in both backbones.

NorbertZheng commented 10 months ago

More Downstream Tasks

MoCo vs. ImageNet supervised pre-training, finetuned on various tasks.

COCO keypoint detection: Supervised pre-training has no clear advantage over random initialization, whereas MoCo outperforms in all metrics.
COCO dense pose estimation: MoCo substantially outperforms supervised pre-training, e.g., by 3.7 points in APdp75, in this highly localization-sensitive task.
LVIS v0.5 instance segmentation: MoCo with IG-1B surpasses it in all metrics.
Cityscapes instance segmentation: MoCo with IG-1B is on par with its supervised pre-training counterpart in APmk, and is higher in APmk50.
Semantic segmentation: On Cityscapes, MoCo outperforms its supervised pre-training counterpart by up to 0.9 point. But on VOC semantic segmentation, MoCo is worse by at least 0.8 point, a negative case we have observed.

In sum, MoCo can outperform its ImageNet supervised pre-training counterpart in 7 detection or segmentation tasks.

Remarkably, in all these tasks, MoCo pre-trained on IG-1B is consistently better than MoCo pre-trained on IN-1M. This shows that MoCo can perform well on this large-scale, relatively uncurated dataset. This represents a scenario towards real-world unsupervised learning.

NorbertZheng commented 10 months ago

Reference

[2020 CVPR] [MoCo] Momentum Contrast for Unsupervised Visual Representation Learning.

NorbertZheng / read-papers