Open NorbertZheng opened 10 months ago
Instability Study of ViT for Self-Supervised Learning. Vision Transformer (ViT) Network Architecture (Figure from ViT).
An Empirical Study of Training Self-Supervised Vision Transformers MoCo v3, by Facebook AI Research (FAIR) 2021 ICCV, Over 100 Citations Self-Supervised Learning, Unsupervised Learning, Contrastive Learning, Representation Learning, Image Classification, Vision Transformer (ViT)
MoCo v3 is an incremental improvement of MoCo v1/MoCo v2, studying the instability issue when ViT is used for self-supervised learning.
MoCo v3: PyTorch-like Pseudocode.
The linear probing accuracy with ResNet-50 (R50) on ImageNet.
By using ResNet-50, the improvement here is mainly due to the extra prediction head and large-batch (4096) training.
It is straightforward to replace a ResNet backbone with a ViT backbone. But in practice, a main challenge is the instability of training.
Training curves of different batch sizes.
A larger batch is also beneficial for accuraacy. A batch of 1k and 2k produces reasonably smooth curves, with 71.5% and 72.6% linear probing accuracy.
The curve of a 4k batch becomes noticeably unstable: see the "dips". The curve of a 6k batch has worse failure patterns.
Training curves of different learning rates.
When $lr$ is smaller, the training is more stable, but it is prone to under-fitting.
$lr$=1.5e-4 for this setting has more dips in the curve, and its accuracy is lower. In this regime, the accuracy is determined by stability.
Training curves of LAMB optimizer.
As a result, authors opt to use AdamW.
Gradient magnitude, shown as relative values for the layer.
It is found that a sudden change of gradients (a "spike") causes a "dip" in the training curve.
The gradient spikes happen earlier in the first layer (patch projection), and are delayed by couples of iterations in the last layers.
Better with residual connection when patch projection???
Random vs. learned patch projection.
The instability happens earlier in the shallower layers.
Random vs. learned patch projection on SimCLR and BYOL.
Random patch projection improves stability in both SimCLR and BYOL, and increases the accuracy by 0.8% and 1.3%.
Configurations of ViT models.
It takes 2.1 hours of training ViT-B for 100 epochs. ViT-H takes 9.8 hours per 100 epochs using 512 TPUs.
ViT-S/16 and ViT-B/16 in different self-supervised learning frameworks (ImageNet, linear probing).
Different Backbones and different framworks.
Sik-Ho Tang. Review — MoCo v3: An Empirical Study of Training Self-Supervised Vision Transformers.