Performance with MAE style pretraining

baaivision / EVA

EVA Series: Visual Representation Fantasies from BAAI

MIT License

2.24k stars 165 forks source link

Performance with MAE style pretraining #12

Closed vateye closed 1 year ago

vateye commented 1 year ago

Hi, I noticed that during the pretraining for EVA, there are two settings: MAE style and BEiT style. I am wondering the performance of MAE style, is there any comparison between these two styles of pertaining?

Yuxin-CV commented 1 year ago

Hi @vateye, thanks for your interest in EVA and your valuable question.

In our initial trails, we find mae is fragile to scale up with pytorch amp training. So we choose the BEiT-style pretraining paradigm.

Recently, we use the 1.1B EVA-CLIP as the mim target and train a ViT-Large with the MAE-style pre-training using the same hyper-parameters as MAE. We find our large model can reach up to 89.2 top-1 acc. on IN-1k. We will release our large model very soon, ViT-Base with MAE-style pre-training is also on the way, please stay tuned:)

vateye commented 1 year ago

So the MAE would be failed on fp16? How about bf16 training for MAE style scaling up? Thanks.

Yuxin-CV commented 1 year ago

bf16 is fine with MAE, but bf16 consumes more run-time GPU memory than fp16, and more importantly it is not supported on other GPU platforms (e.g., V100), limiting the accessibility. Fine-tuning bf16 pre-trained model in fp16 also has potential issues:

Overall, "you only pre-train once", the acceleration from MAE-style pre-training is not very attractive to us, and MAE could take more training steps to converge. So we choose BEiT-style & fp16.

vateye commented 1 year ago

Thanks.

Yuxin-CV commented 1 year ago

Hi @vateye, thanks for your interest in EVA and your valuable question.

In our initial trails, we find mae is fragile to scale up with pytorch amp training. So we choose the BEiT-style pretraining paradigm.

Recently, we use the 1.1B EVA-CLIP as the mim target and train a ViT-Large with the MAE-style pre-training using the same hyper-parameters as MAE. We find our large model can reach up to 89.2 top-1 acc. on IN-1k. We will release our large model very soon, ViT-Base with MAE-style pre-training is also on the way, please stay tuned:)

Weights and logs of EVA-L have been released at https://github.com/baaivision/EVA/tree/master/eva#eva-l-learning-better-mim-representations-from-eva-clip