This repository is the official implementation of our
TinyMIM: An Empirical Study of Distilling MIM Pre-trained Models (CVPR2023)
Sucheng Ren, Fangyun Wei, Zheng Zhang, Han Hu
Small models that are critical for real-world applications but cannot or only marginally benefit from MIM pre-training. In this paper, we explore distillation techniques to transfer the success of large MIM-based pre-trained models to smaller ones. We systematically study different options in the distillation framework, including distilling targets, losses, input, network regularization, sequential distillation, etc, revealing that: 1) Distilling token relations is more effective than CLS token- and feature-based distillation; 2) An intermediate layer of the teacher network as target perform better than that using the last layer when the depth of the student mismatches that of the teacher; 3) Weak regularization is preferred.
We build the repo based on MAE
We pretrain TinyMIM on 32 V100 GPU with overall batch size of 4096 which is identical to that in MAE.
python -m torch.distributed.launch \
--nnodes 4 --node_rank $noderank \
--nproc_per_node 8 --master_addr $ip --master_port $port \
main_pretrain.py \
--batch_size 128 \
--model tinymim_vit_base_patch16 \
--epochs 300 \
--warmup_epochs 15 \
--blr 1.5e-4 --weight_decay 0.05 \
--teacher_path /path/to/teacher_ckpt \
--teacher_model mae_vit_large \
--data_path /path/to/imagenet
python -m torch.distributed.launch --nproc_per_node=8 main_finetune.py \
--batch_size 128 \
--model vit_base_patch16 \
--finetune ./output_dir/checkpoint-299.pth \
--epochs 100 \
--output_dir ./out_finetune/ \
--blr 5e-4 --layer_decay 0.65 \
--weight_decay 0.05 --drop_path 0.2 --reprob 0.25 --mixup 0.8 --cutmix 1.0 \
--dist_eval --data_path /path/to/imagenet
Please refer Segmentation/README.md
The pretrained and finetuned model on ImageNet-1K are available at
Performance comparison on ImageNet-1K classification and ADE20K Semantic Segmentation. | Method | Model Size | Top-1 | mIoU |
---|---|---|---|---|
MAE | ViT-T | 71.6 | 37.6 | |
TinyMIM | ViT-T | 75.8 | 44.0 | |
TinyMIM* | ViT-T | 79.6 | 45.0 | |
MAE | ViT-S | 80.6 | 42.8 | |
TinyMIM | ViT-S | 83.0 | 48.4 | |
MAE | ViT-B | 83.6 | 48.1 | |
TinyMIM | ViT-B | 85.0 | 52.2 |
Generalization comparison on out-of-domain dataset (ImageNet-A/R/C)
Method | Model Size | ImageNet-1K | ImageNet-Adversarial $\uparrow$ | ImageNet-Rendition $\uparrow$ | ImageNet-Corruption $\downarrow$ |
---|---|---|---|---|---|
MAE | ViT-T | 71.6 | 7.0 | 36.5 | 55.2 |
TinyMIM | ViT-T | 75.8 | 11.0 | 39.8 | 50.1 |
MAE | ViT-S | 80.6 | 20.1 | 45.6 | 40.6 |
TinyMIM | ViT-S | 83.0 | 27.5 | 48.8 | 35.8 |
MAE | ViT-B | 83.6 | 33.6 | 50.0 | 37.8 |
TinyMIM | ViT-B | 85.0 | 43.0 | 54.6 | 32.7 |
If you have any question, feel free to contact Sucheng Ren :)
@InProceedings{Ren_2023_CVPR,
author = {Ren, Sucheng and Wei, Fangyun and Zhang, Zheng and Hu, Han},
title = {TinyMIM: An Empirical Study of Distilling MIM Pre-Trained Models},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2023},
pages = {3687-3697}
}