Pretraining implementation - Githubissues

Alpha-VL / ConvMAE

ConvMAE: Masked Convolution Meets Masked Autoencoders

MIT License

477 stars 41 forks source link

Pretraining implementation #6

Closed hao-pt closed 2 years ago

hao-pt commented 2 years ago

I have implemented pretraining codes based on MAE repo but I wonder one thing: in the decoder phase, (1) do you sum all features of 3 stages and then normalize it or (2) you normalize the feature of last stage and then sum it with 2 previous ones? Because I got nan loss after 270 epochs with (1) approach. Btw, Have you ever met Nan loss during training?

gaopengpjlab commented 2 years ago

ConvMAE meets NAN loss for long training epoch such as 1600 epochs (usually around 1000-1300) or scaling to huge model. We solve the problem in three manners：

restart from the nearest checkpoint and do nothing.
restart from the nearest checkpoint and turn off fp16 in decoder.
restart from the nearest checkpoint and turn off fp16 in both encoder and decoder.

I will send you unpolished pretraining code to your email (tienhaophung@gmail.com)to spare you from trial and error.

Thanks so much for your interest in our paper.

gaopengpjlab commented 2 years ago

Our final ConvMAE with multi-scale decoder is pretrained for 1600 epochs with 2048 batch size. We do not observe loss nan.

hao-pt commented 2 years ago

Thank u for your suggestion! Due to computing limitations, I can only afford 2-4 gpus for training. So I think this may be the reason for nan loss.

gaopengpjlab commented 2 years ago

We are going to release Fast ConvMAE which can significantly accelerate the pretraining of ConvMAE in a few days. https://github.com/Alpha-VL/FastConvMAE

hao-pt commented 2 years ago

That sounds great! Thank you for your announcement!

gaopengpjlab commented 2 years ago

Please check the fast version of ConvMAE pretraining.

zhangxinyu-xyz commented 2 years ago

ConvMAE meets NAN loss for long training epoch such as 1600 epochs (usually around 1000-1300) or scaling to huge model. We solve the problem in three manners：

restart from the nearest checkpoint and do nothing.

restart from the nearest checkpoint and turn off fp16 in decoder.

restart from the nearest checkpoint and turn off fp16 in both encoder and decoder.

I will send you unpolished pretraining code to your email (tienhaophung@gmail.com)to spare you from trial and error.

Thanks so much for your interest in our paper.

Could you please also share me an unpolished pretraining code? Thanks a lot.

gaopengpjlab commented 2 years ago

@zhangxinyu-xyz Pretraining code of ConvMAE has already been released. You are recommended to try our FastConvMAE repo.

https://github.com/Alpha-VL/FastConvMAE

If you want to turn off fp16 optimization, please delete "with torch.cuda.amp.autocast()".

zhangxinyu-xyz commented 2 years ago

@zhangxinyu-xyz Pretraining code of ConvMAE has already been released. You are recommended to try our FastConvMAE repo.

https://github.com/Alpha-VL/FastConvMAE

If you want to turn off fp16 optimization, please delete "with torch.cuda.amp.autocast()".

Thanks