Closed NorbertZheng closed 10 months ago
Momentum Update for the Key Encoder, Outperforms Exemplar-CNN, Context Prediction, Jigsaw Puzzles, RotNet/Image Rotations, Colorization, DeepCluster, Instance Discrimination, CPCv1, CPCv2, CMC.
Momentum Contrast (MoCo).
Momentum Contrast for Unsupervised Visual Representation Learning MoCo, by Facebook AI Research (FAIR) 2020 CVPR, Over 2400 Citations Self-Supervised Learning, Contrastive Learning, Image Classification, Object Detection, Segmentation
This can tackle the inconsistent dictionary problem caused by Instance Discrimination.
Left: Contrastive Learning Without Dictionary Lookup, Right Contrastive Learning With Dictionary Lookup.
Contrastive learning since DrLIM, and its recent developments, can be thought of as
Momentum Contrast (MoCo).
The dictionary is dynamic in the sense that
The hypothesis is that good features can be learned by a large dictionary that covers a rich set of negative samples, while the encoder for the dictionary keys is kept as consistent as possible despite its evolution.
The samples in the dictionary are progressively replaced. The current mini-batch is enqueued to the dictionary, and the oldest mini-batch in the queue is removed.
Formally, denoting the parameters of $f{k}$ as $\theta{k}$ and those of $f{q}$ as $\theta{q}$, $\theta_{k}$ is updated by:
Here, $m\in [0,1)$ is a momentum coefficient. Only the parameters $\theta_{q}$ are updated by back-propagation.
So it's better than Proximal Regularization used in Instance Discrimination???
MoCo Algorithm.
Comparison of three contrastive loss mechanisms under the ImageNet linear classification protocol.
Study of Momentum $m$.
It performs reasonably well when $m$ is in $0.99\sim 0.9999$, showing that a slowly progressing (i.e., relatively large momentum) key encoder is beneficial.
When $m$ is too small (e.g., $0.9$), the accuracy drops considerably.
Comparison under the linear classification protocol on ImageNet.
MoCo with R50 performs competitively and achieves 60.6% accuracy, better than all competitors of similar model sizes (~24M).
MoCo benefits from larger models and achieves 68.6% accuracy with R50×4 outperforms such as Exemplar-CNN, Relative Context Prediction, Jigsaw Puzzles, RotNet/Image Rotations, Colorization, DeepCluster, Instance Discrimination, LocalAgg, CPCv1, CPCv2, CMC.
Object detection fine-tuned on PASCAL VOC trainval07+12 In the brackets are the gaps to the ImageNet supervised pre-training counterpart.
Comparison with previous methods on object detection fine-tuned on PASCAL VOC trainval2007.
MoCo pre-trained on any of IN-1M, IN-14M (full ImageNet), YFCC-100M [55], and IG-1B can outperform the supervised baseline.
Object detection and instance segmentation fine-tuned on COCO.
With the 2× schedule, MoCo is better than its ImageNet supervised counterpart in all metrics in both backbones.
MoCo vs. ImageNet supervised pre-training, finetuned on various tasks.
In sum, MoCo can outperform its ImageNet supervised pre-training counterpart in 7 detection or segmentation tasks.
Remarkably, in all these tasks, MoCo pre-trained on IG-1B is consistently better than MoCo pre-trained on IN-1M. This shows that MoCo can perform well on this large-scale, relatively uncurated dataset. This represents a scenario towards real-world unsupervised learning.
Sik-Ho Tang. Review — MoCo: Momentum Contrast for Unsupervised Visual Representation Learning.