* We change the project name from ConvMAE to MCMAE.
This repo is the official implementation of MCMAE: Masked Convolution Meets Masked Autoencoders. It currently concludes codes and models for the following tasks:
ImageNet Pretrain: See PRETRAIN.md.\ ImageNet Finetune: See FINETUNE.md.\ Object Detection: See DETECTION.md.\ Semantic Segmentation: See SEGMENTATION.md. \ Video Classification: See VideoConvMAE.
14/Mar/2023
MR-MCMAE (a.k.a. ConvMAE-v2) paper released: Mimic before Reconstruct: Enhancing Masked Autoencoders with Feature Mimicking.
15/Sep/2022
Paper accepted at NeurIPS 2022.
9/Sep/2022
ConvMAE-v2 pretrained checkpoints are released.
21/Aug/2022
Official-ConvMAE-Det which follows official ViTDet codebase is released.
08/Jun/2022
🚀FastConvMAE🚀: significantly accelerates the pretraining hours (4000 single GPU hours => 200 single GPU hours). The code is going to be released at FastConvMAE.
27/May/2022
20/May/2022
Update results on video classification.
16/May/2022
The supported codes and models for COCO object detection and instance segmentation are available.
11/May/2022
08/May/2022
The preprint version is public at arxiv.
ConvMAE framework demonstrates that multi-scale hybrid convolution-transformer can learn more discriminative representations via the mask auto-encoding scheme.
The following table provides pretrained checkpoints and logs used in the paper. | ConvMAE-Base | |
---|---|---|
pretrained checkpoints | download | |
logs | download |
The following results are for ConvMAE-v2 (pretrained for 200 epochs on ImageNet-1k). | model | pretrained checkpoints | ft. acc. on ImageNet-1k |
---|---|---|---|
ConvMAE-v2-Small | download | 83.6 | |
ConvMAE-v2-Base | download | 85.7 | |
ConvMAE-v2-Large | download | 86.8 | |
ConvMAE-v2-Huge | download | 88.0 |
Models | #Params(M) | Supervision | Encoder Ratio | Pretrain Epochs | FT acc@1(%) | LIN acc@1(%) | FT logs/weights | LIN logs/weights |
---|---|---|---|---|---|---|---|---|
BEiT | 88 | DALLE | 100% | 300 | 83.0 | 37.6 | - | - |
MAE | 88 | RGB | 25% | 1600 | 83.6 | 67.8 | - | - |
SimMIM | 88 | RGB | 100% | 800 | 84.0 | 56.7 | - | - |
MaskFeat | 88 | HOG | 100% | 300 | 83.6 | N/A | - | - |
data2vec | 88 | RGB | 100% | 800 | 84.2 | N/A | - | - |
ConvMAE-B | 88 | RGB | 25% | 1600 | 85.0 | 70.9 | log/weight |
Models | Pretrain | Pretrain Epochs | Finetune Epochs | #Params(M) | FLOPs(T) | box AP | mask AP | logs/weights |
---|---|---|---|---|---|---|---|---|
Swin-B | IN21K w/ labels | 90 | 36 | 109 | 0.7 | 51.4 | 45.4 | - |
Swin-L | IN21K w/ labels | 90 | 36 | 218 | 1.1 | 52.4 | 46.2 | - |
MViTv2-B | IN21K w/ labels | 90 | 36 | 73 | 0.6 | 53.1 | 47.4 | - |
MViTv2-L | IN21K w/ labels | 90 | 36 | 239 | 1.3 | 53.6 | 47.5 | - |
Benchmarking-ViT-B | IN1K w/o labels | 1600 | 100 | 118 | 0.9 | 50.4 | 44.9 | - |
Benchmarking-ViT-L | IN1K w/o labels | 1600 | 100 | 340 | 1.9 | 53.3 | 47.2 | - |
ViTDet | IN1K w/o labels | 1600 | 100 | 111 | 0.8 | 51.2 | 45.5 | - |
MIMDet-ViT-B | IN1K w/o labels | 1600 | 36 | 127 | 1.1 | 51.5 | 46.0 | - |
MIMDet-ViT-L | IN1K w/o labels | 1600 | 36 | 345 | 2.6 | 53.3 | 47.5 | - |
ConvMAE-B | IN1K w/o lables | 1600 | 25 | 104 | 0.9 | 53.2 | 47.1 | log/weight |
Models | Pretrain | Pretrain Epochs | Finetune Iters | #Params(M) | FLOPs(T) | mIoU | logs/weights |
---|---|---|---|---|---|---|---|
DeiT-B | IN1K w/ labels | 300 | 16K | 163 | 0.6 | 45.6 | - |
Swin-B | IN1K w/ labels | 300 | 16K | 121 | 0.3 | 48.1 | - |
MoCo V3 | IN1K | 300 | 16K | 163 | 0.6 | 47.3 | - |
DINO | IN1K | 400 | 16K | 163 | 0.6 | 47.2 | - |
BEiT | IN1K+DALLE | 1600 | 16K | 163 | 0.6 | 47.1 | - |
PeCo | IN1K | 300 | 16K | 163 | 0.6 | 46.7 | - |
CAE | IN1K+DALLE | 800 | 16K | 163 | 0.6 | 48.8 | - |
MAE | IN1K | 1600 | 16K | 163 | 0.6 | 48.1 | - |
ConvMAE-B | IN1K | 1600 | 16K | 153 | 0.6 | 51.7 | log/weight |
Models | Pretrain Epochs | Finetune Epochs | #Params(M) | Top1 | Top5 | logs/weights |
---|---|---|---|---|---|---|
VideoMAE-B | 200 | 100 | 87 | 77.8 | ||
VideoMAE-B | 800 | 100 | 87 | 79.4 | ||
VideoMAE-B | 1600 | 100 | 87 | 79.8 | ||
VideoMAE-B | 1600 | 100 (w/ Repeated Aug) | 87 | 80.7 | 94.7 | |
SpatioTemporalLearner-B | 800 | 150 (w/ Repeated Aug) | 87 | 81.3 | 94.9 | |
VideoConvMAE-B | 200 | 100 | 86 | 80.1 | 94.3 | Soon |
VideoConvMAE-B | 800 | 100 | 86 | 81.7 | 95.1 | Soon |
VideoConvMAE-B-MSD | 800 | 100 | 86 | 82.7 | 95.5 | Soon |
Models | Pretrain Epochs | Finetune Epochs | #Params(M) | Top1 | Top5 | logs/weights |
---|---|---|---|---|---|---|
VideoMAE-B | 200 | 40 | 87 | 66.1 | ||
VideoMAE-B | 800 | 40 | 87 | 69.3 | ||
VideoMAE-B | 2400 | 40 | 87 | 70.3 | ||
VideoConvMAE-B | 200 | 40 | 86 | 67.7 | 91.2 | Soon |
VideoConvMAE-B | 800 | 40 | 86 | 69.9 | 92.4 | Soon |
VideoConvMAE-B-MSD | 800 | 40 | 86 | 70.7 | 93.0 | Soon |
The pretraining and finetuning of our project are based on DeiT and MAE. The object detection and semantic segmentation parts are based on MIMDet and MMSegmentation respectively. Thanks for their wonderful work.
ConvMAE is released under the MIT License.
@article{gao2022convmae,
title={ConvMAE: Masked Convolution Meets Masked Autoencoders},
author={Gao, Peng and Ma, Teli and Li, Hongsheng and Dai, Jifeng and Qiao, Yu},
journal={arXiv preprint arXiv:2205.03892},
year={2022}
}