Due to limit resource available, we only test the model on cifar10. We mainly want to reproduce the result that pre-training an ViT with MAE can achieve a better result than directly trained in supervised learning with labels. This should be an evidence of self-supervised learning is more data efficient than supervised learning.
We mainly follow the implementation details in the paper. However, due to difference between Cifar10 and ImageNet, we make some modification:
pip install -r requirements.txt
# pretrained with mae
python mae_pretrain.py
# train classifier from scratch
python train_classifier.py
# train classifier from pretrained model
python train_classifier.py --pretrained_model_path vit-t-mae.pt --output_model_path vit-t-classifier-from_pretrained.pt
See logs by tensorboard --logdir logs
.
Model | Validation Acc |
---|---|
ViT-T w/o pretrain | 74.13 |
ViT-T w/ pretrain | 89.77 |
Weights are in github release. You can also view the tensorboard logs at tensorboard.dev.
Visualization of the first 16 images on Cifar10 validation dataset: