Closed hao-pt closed 2 years ago
ConvMAE meets NAN loss for long training epoch such as 1600 epochs (usually around 1000-1300) or scaling to huge model. We solve the problem in three manners:
I will send you unpolished pretraining code to your email (tienhaophung@gmail.com)to spare you from trial and error.
Thanks so much for your interest in our paper.
Our final ConvMAE with multi-scale decoder is pretrained for 1600 epochs with 2048 batch size. We do not observe loss nan.
Thank u for your suggestion! Due to computing limitations, I can only afford 2-4 gpus for training. So I think this may be the reason for nan loss.
We are going to release Fast ConvMAE which can significantly accelerate the pretraining of ConvMAE in a few days. https://github.com/Alpha-VL/FastConvMAE
That sounds great! Thank you for your announcement!
Please check the fast version of ConvMAE pretraining.
ConvMAE meets NAN loss for long training epoch such as 1600 epochs (usually around 1000-1300) or scaling to huge model. We solve the problem in three manners:
- restart from the nearest checkpoint and do nothing.
- restart from the nearest checkpoint and turn off fp16 in decoder.
- restart from the nearest checkpoint and turn off fp16 in both encoder and decoder.
I will send you unpolished pretraining code to your email (tienhaophung@gmail.com)to spare you from trial and error.
Thanks so much for your interest in our paper.
Could you please also share me an unpolished pretraining code? Thanks a lot.
@zhangxinyu-xyz Pretraining code of ConvMAE has already been released. You are recommended to try our FastConvMAE repo.
https://github.com/Alpha-VL/FastConvMAE
If you want to turn off fp16 optimization, please delete "with torch.cuda.amp.autocast()".
@zhangxinyu-xyz Pretraining code of ConvMAE has already been released. You are recommended to try our FastConvMAE repo.
https://github.com/Alpha-VL/FastConvMAE
If you want to turn off fp16 optimization, please delete "with torch.cuda.amp.autocast()".
Thanks
I have implemented pretraining codes based on MAE repo but I wonder one thing: in the decoder phase, (1) do you sum all features of 3 stages and then normalize it or (2) you normalize the feature of last stage and then sum it with 2 previous ones? Because I got nan loss after 270 epochs with (1) approach. Btw, Have you ever met Nan loss during training?