Closed dannyxiaocn closed 2 years ago
It seems that you are working with moe and facing training instabilities. Any more information about your implementation of the moe layer and modified ddp?
Yes, I am adding an MoE layer with 4 experts ( dimension same as FFN ), using the tutel library. And I skip the pre-training stage, directly use it in the fine-tuning. At the start, I will load the checkpoint from a base model, copy the weights from FFN and then add some random noise to give weights to MoE experts ( gate with random initialization ). Specifically, I am trying to use (FFN(x) + MoE(x)) to replace the original FFN(x) as the FFN layer output. In other words, the modified net has a diverge at the FFN layer position and then combine.
You can just check my Fork to see the source code inside models/ofa/unified_transformer_layer.py
Have you tried it on a single GPU? If you run it on multiple GPUs, you should consider about your implementation of DDP (mainly in concern of all reduce) and also gradient clipping (in concern of the norm). Also, use fewer MoE layers and fewer experts (like 2) first, to maintain a relatively large batch size (as smaller batch size may cause instabilities due to the ResNet).
thx! I will have a try.
So here comes a confusing problem. I tried as you mentioned ( n_experts=2, single GPU ), but during the training, the GPU memory is keep growing until I ran out of it. Do you have any clue why this happened and how can I fix it? Log goes like:
2022-03-22 12:24:37 - trainer.py[line:704] - INFO: begin training epoch 1
2022-03-22 12:24:37 - train.py[line:295] - INFO: Start iterating over samples
2022-03-22 12:25:21 - progress_bar.py[line:272] - INFO: epoch 001: 10 / 33096 loss=0.828, loss_v1=0, loss_v2=0, nll_loss=0.828, ntokens=32, nsentences=16, sample_size=32, sample_size_v1=0, sample_size_v2=0, ppl=1.77, wps=8.5, ups=0.27, wpb=32, bsz=16, num_updates=10, lr=1.6787e-08, gnorm=15.767, clip=100, loss_scale=128, train_wall=43, gb_free=15.4, wall=65
2022-03-22 12:25:52 - progress_bar.py[line:272] - INFO: epoch 001: 20 / 33096 loss=0.777, loss_v1=0, loss_v2=0, nll_loss=0.777, ntokens=32, nsentences=16, sample_size=32, sample_size_v1=0, sample_size_v2=0, ppl=1.71, wps=10.4, ups=0.32, wpb=32, bsz=16, num_updates=20, lr=3.35739e-08, gnorm=18.08, clip=100, loss_scale=128, train_wall=30, gb_free=13.6, wall=95
2022-03-22 12:26:20 - progress_bar.py[line:272] - INFO: epoch 001: 30 / 33096 loss=0.744, loss_v1=0, loss_v2=0, nll_loss=0.744, ntokens=32, nsentences=16, sample_size=32, sample_size_v1=0, sample_size_v2=0, ppl=1.67, wps=11.4, ups=0.36, wpb=32, bsz=16, num_updates=30, lr=5.03609e-08, gnorm=17.33, clip=100, loss_scale=128, train_wall=27, gb_free=11.9, wall=124
2022-03-22 12:26:49 - progress_bar.py[line:272] - INFO: epoch 001: 40 / 33096 loss=0.725, loss_v1=0, loss_v2=0, nll_loss=0.725, ntokens=32, nsentences=16, sample_size=32, sample_size_v1=0, sample_size_v2=0, ppl=1.65, wps=10.9, ups=0.34, wpb=32, bsz=16, num_updates=40, lr=6.71479e-08, gnorm=19.299, clip=100, loss_scale=128, train_wall=29, gb_free=10.1, wall=153
2022-03-22 12:27:18 - progress_bar.py[line:272] - INFO: epoch 001: 50 / 33096 loss=0.782, loss_v1=0, loss_v2=0, nll_loss=0.782, ntokens=32, nsentences=16, sample_size=32, sample_size_v1=0, sample_size_v2=0, ppl=1.72, wps=11.1, ups=0.35, wpb=32, bsz=16, num_updates=50, lr=8.39349e-08, gnorm=18.066, clip=100, loss_scale=128, train_wall=28, gb_free=8.3, wall=182
2022-03-22 12:27:47 - progress_bar.py[line:272] - INFO: epoch 001: 60 / 33096 loss=0.775, loss_v1=0, loss_v2=0, nll_loss=0.775, ntokens=32, nsentences=16, sample_size=32, sample_size_v1=0, sample_size_v2=0, ppl=1.71, wps=10.9, ups=0.34, wpb=32, bsz=16, num_updates=60, lr=1.00722e-07, gnorm=16.49, clip=100, loss_scale=128, train_wall=29, gb_free=6.6, wall=211
2022-03-22 12:28:16 - progress_bar.py[line:272] - INFO: epoch 001: 70 / 33096 loss=0.747, loss_v1=0, loss_v2=0, nll_loss=0.747, ntokens=32, nsentences=16, sample_size=32, sample_size_v1=0, sample_size_v2=0, ppl=1.68, wps=11.2, ups=0.35, wpb=32, bsz=16, num_updates=70, lr=1.17509e-07, gnorm=16.72, clip=100, loss_scale=128, train_wall=28, gb_free=4.8, wall=240
2022-03-22 12:28:45 - progress_bar.py[line:272] - INFO: epoch 001: 80 / 33096 loss=0.805, loss_v1=0, loss_v2=0, nll_loss=0.805, ntokens=32, nsentences=16, sample_size=32, sample_size_v1=0, sample_size_v2=0, ppl=1.75, wps=11.2, ups=0.35, wpb=32, bsz=16, num_updates=80, lr=1.34296e-07, gnorm=17.591, clip=100, loss_scale=128, train_wall=28, gb_free=3, wall=268
2022-03-22 12:28:48 - trainer.py[line:1304] - WARNING: OOM: Ran out of memory with exception: CUDA out of memory. Tried to allocate 88.00 MiB (GPU 0; 23.70 GiB total capacity; 20.81 GiB already allocated; 86.56 MiB free; 21.89 GiB reserved in total by PyTorch)
Seems you got rid of training instabilities but got stuck in memory consumption. I have never met this when I was training MoE models, so I am not sure whether this is caused by MoE layers. Are you sure it is a continuous growing instead of a spike? I guess there might be some other reasons. Try with a smaller model and record the memory consumption to better examine your implementation.
Ok, thx. BTW, were you using the MoE module implemented by yourself or open-source libraries?
We did it by ourselves on some of our previous works in Tensorflow. Later we'll look into implementing MoEs for OFA and maybe tutel is a choice :)
Cool! I am doing research about moe in multimodal pre-trained models and really admire your guys' work. You can drop me an email if you have plans and interests for some collaboration and intern opportunities!
Hi, I add another layer to the model but there is a problem that happened after several steps.
Then the training broke down. So how can I fix this problem? Hyperparameter Tuning? Or something else I need to pay attention to? I will really really really appreciate it if you can help me!