How can I handle this in a modified model?

dannyxiaocn commented 2 years ago

Hi, I add another layer to the model but there is a problem that happened after several steps.

2022-03-21 23:16:50 - progress_bar.py[line:272] - INFO: epoch 001:     41 / 24544 loss=1.825, loss_v1=0, loss_v2=0, nll_loss=1.825, ntokens=16, nsentences=16, sample_size=16, sample_size_v1=0, sample_size_v2=0, ppl=3.54, wps=11.3, ups=0.7, wpb=16, bsz=16, num_updates=41, lr=5.56838e-07, gnorm=32.218, clip=100, loss_scale=16, train_wall=1, gb_free=14.5, wall=67
2022-03-21 23:16:51 - trainer.py[line:922] - INFO: NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 8.0
2022-03-21 23:16:53 - trainer.py[line:922] - INFO: NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 4.0
2022-03-21 23:16:54 - trainer.py[line:922] - INFO: NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 2.0
2022-03-21 23:16:55 - trainer.py[line:922] - INFO: NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 1.0
2022-03-21 23:16:56 - trainer.py[line:922] - INFO: NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.5
2022-03-21 23:16:57 - trainer.py[line:922] - INFO: NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.25
2022-03-21 23:16:58 - trainer.py[line:922] - INFO: NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.125
2022-03-21 23:16:59 - trainer.py[line:922] - INFO: NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.0625
2022-03-21 23:17:01 - trainer.py[line:922] - INFO: NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.03125
2022-03-21 23:17:02 - trainer.py[line:922] - INFO: NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.015625
2022-03-21 23:17:02 - trainer.py[line:922] - INFO: NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.0078125
2022-03-21 23:17:03 - trainer.py[line:922] - INFO: NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.00390625
2022-03-21 23:17:04 - trainer.py[line:922] - INFO: NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.001953125
2022-03-21 23:17:05 - trainer.py[line:922] - INFO: NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.0009765625
2022-03-21 23:17:06 - trainer.py[line:922] - INFO: NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.00048828125
2022-03-21 23:17:07 - trainer.py[line:922] - INFO: NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.000244140625
2022-03-21 23:17:08 - trainer.py[line:922] - INFO: NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.0001220703125
/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py:787: UserWarning: Using a non-full backward hook when the forward contains multiple autograd Nodes is deprecated and will be removed in future versions. This hook will be missing some grad_input. Please use register_full_backward_hook to get the documented behavior.
  warnings.warn("Using a non-full backward hook when the forward contains multiple autograd Nodes "
/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py:787: UserWarning: Using a non-full backward hook when the forward contains multiple autograd Nodes is deprecated and will be removed in future versions. This hook will be missing some grad_input. Please use register_full_backward_hook to get the documented behavior.
  warnings.warn("Using a non-full backward hook when the forward contains multiple autograd Nodes "
/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py:787: UserWarning: Using a non-full backward hook when the forward contains multiple autograd Nodes is deprecated and will be removed in future versions. This hook will be missing some grad_input. Please use register_full_backward_hook to get the documented behavior.
  warnings.warn("Using a non-full backward hook when the forward contains multiple autograd Nodes "
/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py:752: UserWarning: Using non-full backward hooks on a Module that does not return a single Tensor or a tuple of Tensors is deprecated and will be removed in future versions. This hook will be missing some of the grad_output. Please use register_full_backward_hook to get the documented behavior.
  warnings.warn("Using non-full backward hooks on a Module that does not return a "
/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py:752: UserWarning: Using non-full backward hooks on a Module that does not return a single Tensor or a tuple of Tensors is deprecated and will be removed in future versions. This hook will be missing some of the grad_output. Please use register_full_backward_hook to get the documented behavior.
  warnings.warn("Using non-full backward hooks on a Module that does not return a "
/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py:787: UserWarning: Using a non-full backward hook when the forward contains multiple autograd Nodes is deprecated and will be removed in future versions. This hook will be missing some grad_input. Please use register_full_backward_hook to get the documented behavior.
  warnings.warn("Using a non-full backward hook when the forward contains multiple autograd Nodes "
/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py:752: UserWarning: Using non-full backward hooks on a Module that does not return a single Tensor or a tuple of Tensors is deprecated and will be removed in future versions. This hook will be missing some of the grad_output. Please use register_full_backward_hook to get the documented behavior.
  warnings.warn("Using non-full backward hooks on a Module that does not return a "
/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py:752: UserWarning: Using non-full backward hooks on a Module that does not return a single Tensor or a tuple of Tensors is deprecated and will be removed in future versions. This hook will be missing some of the grad_output. Please use register_full_backward_hook to get the documented behavior.
  warnings.warn("Using non-full backward hooks on a Module that does not return a "
/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py:762: UserWarning: Using non-full backward hooks on a Module that does not take as input a single Tensor or a tuple of Tensors is deprecated and will be removed in future versions. This hook will be missing some of the grad_input. Please use register_full_backward_hook to get the documented behavior.
  warnings.warn("Using non-full backward hooks on a Module that does not take as input a "
/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py:762: UserWarning: Using non-full backward hooks on a Module that does not take as input a single Tensor or a tuple of Tensors is deprecated and will be removed in future versions. This hook will be missing some of the grad_input. Please use register_full_backward_hook to get the documented behavior.
  warnings.warn("Using non-full backward hooks on a Module that does not take as input a "
/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py:762: UserWarning: Using non-full backward hooks on a Module that does not take as input a single Tensor or a tuple of Tensors is deprecated and will be removed in future versions. This hook will be missing some of the grad_input. Please use register_full_backward_hook to get the documented behavior.
  warnings.warn("Using non-full backward hooks on a Module that does not take as input a "
/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py:762: UserWarning: Using non-full backward hooks on a Module that does not take as input a single Tensor or a tuple of Tensors is deprecated and will be removed in future versions. This hook will be missing some of the grad_input. Please use register_full_backward_hook to get the documented behavior.
  warnings.warn("Using non-full backward hooks on a Module that does not take as input a "
/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py:777: UserWarning: Using a non-full backward hook when outputs are generated by different autograd Nodes is deprecated and will be removed in future versions. This hook will be missing some grad_output. Please use register_full_backward_hook to get the documented behavior.
  warnings.warn("Using a non-full backward hook when outputs are generated by different autograd Nodes "
/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py:777: UserWarning: Using a non-full backward hook when outputs are generated by different autograd Nodes is deprecated and will be removed in future versions. This hook will be missing some grad_output. Please use register_full_backward_hook to get the documented behavior.
  warnings.warn("Using a non-full backward hook when outputs are generated by different autograd Nodes "
/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py:777: UserWarning: Using a non-full backward hook when outputs are generated by different autograd Nodes is deprecated and will be removed in future versions. This hook will be missing some grad_output. Please use register_full_backward_hook to get the documented behavior.
  warnings.warn("Using a non-full backward hook when outputs are generated by different autograd Nodes "
/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py:777: UserWarning: Using a non-full backward hook when outputs are generated by different autograd Nodes is deprecated and will be removed in future versions. This hook will be missing some grad_output. Please use register_full_backward_hook to get the documented behavior.
  warnings.warn("Using a non-full backward hook when outputs are generated by different autograd Nodes "
2022-03-21 23:17:09 - nan_detector.py[line:89] - WARNING: NaN detected in output of encoder.layers.2.moe.moe_layer, shape: torch.Size([60, 1, 768]), forward input max: 3.67578125, input min: -7.75
Traceback (most recent call last):
  File "/workspace/OFA/trainer.py", line 871, in train_step
    grad_norm = self.clip_grad_norm(self.cfg.optimization.clip_norm)
  File "/workspace/OFA/trainer.py", line 1208, in clip_grad_norm
    return self.optimizer.clip_grad_norm(
  File "/workspace/OFA/fairseq/fairseq/optim/fp16_optimizer.py", line 200, in clip_grad_norm
    self.scaler.check_overflow(grad_norm)
  File "/workspace/OFA/fairseq/fairseq/optim/dynamic_loss_scaler.py", line 61, in check_overflow
    raise FloatingPointError(
FloatingPointError: Minimum loss scale reached (0.0001). Your loss is probably exploding. Try lowering the learning rate, using gradient clipping or increasing the batch size.

Then the training broke down. So how can I fix this problem? Hyperparameter Tuning? Or something else I need to pay attention to? I will really really really appreciate it if you can help me!

JustinLin610 commented 2 years ago

It seems that you are working with moe and facing training instabilities. Any more information about your implementation of the moe layer and modified ddp?

dannyxiaocn commented 2 years ago

Yes, I am adding an MoE layer with 4 experts ( dimension same as FFN ), using the tutel library. And I skip the pre-training stage, directly use it in the fine-tuning. At the start, I will load the checkpoint from a base model, copy the weights from FFN and then add some random noise to give weights to MoE experts ( gate with random initialization ). Specifically, I am trying to use (FFN(x) + MoE(x)) to replace the original FFN(x) as the FFN layer output. In other words, the modified net has a diverge at the FFN layer position and then combine.

dannyxiaocn commented 2 years ago

You can just check my Fork to see the source code inside models/ofa/unified_transformer_layer.py

JustinLin610 commented 2 years ago

Have you tried it on a single GPU? If you run it on multiple GPUs, you should consider about your implementation of DDP (mainly in concern of all reduce) and also gradient clipping (in concern of the norm). Also, use fewer MoE layers and fewer experts (like 2) first, to maintain a relatively large batch size (as smaller batch size may cause instabilities due to the ResNet).

dannyxiaocn commented 2 years ago

thx! I will have a try.

dannyxiaocn commented 2 years ago

So here comes a confusing problem. I tried as you mentioned ( n_experts=2, single GPU ), but during the training, the GPU memory is keep growing until I ran out of it. Do you have any clue why this happened and how can I fix it? Log goes like:

2022-03-22 12:24:37 - trainer.py[line:704] - INFO: begin training epoch 1
2022-03-22 12:24:37 - train.py[line:295] - INFO: Start iterating over samples
2022-03-22 12:25:21 - progress_bar.py[line:272] - INFO: epoch 001:     10 / 33096 loss=0.828, loss_v1=0, loss_v2=0, nll_loss=0.828, ntokens=32, nsentences=16, sample_size=32, sample_size_v1=0, sample_size_v2=0, ppl=1.77, wps=8.5, ups=0.27, wpb=32, bsz=16, num_updates=10, lr=1.6787e-08, gnorm=15.767, clip=100, loss_scale=128, train_wall=43, gb_free=15.4, wall=65
2022-03-22 12:25:52 - progress_bar.py[line:272] - INFO: epoch 001:     20 / 33096 loss=0.777, loss_v1=0, loss_v2=0, nll_loss=0.777, ntokens=32, nsentences=16, sample_size=32, sample_size_v1=0, sample_size_v2=0, ppl=1.71, wps=10.4, ups=0.32, wpb=32, bsz=16, num_updates=20, lr=3.35739e-08, gnorm=18.08, clip=100, loss_scale=128, train_wall=30, gb_free=13.6, wall=95
2022-03-22 12:26:20 - progress_bar.py[line:272] - INFO: epoch 001:     30 / 33096 loss=0.744, loss_v1=0, loss_v2=0, nll_loss=0.744, ntokens=32, nsentences=16, sample_size=32, sample_size_v1=0, sample_size_v2=0, ppl=1.67, wps=11.4, ups=0.36, wpb=32, bsz=16, num_updates=30, lr=5.03609e-08, gnorm=17.33, clip=100, loss_scale=128, train_wall=27, gb_free=11.9, wall=124
2022-03-22 12:26:49 - progress_bar.py[line:272] - INFO: epoch 001:     40 / 33096 loss=0.725, loss_v1=0, loss_v2=0, nll_loss=0.725, ntokens=32, nsentences=16, sample_size=32, sample_size_v1=0, sample_size_v2=0, ppl=1.65, wps=10.9, ups=0.34, wpb=32, bsz=16, num_updates=40, lr=6.71479e-08, gnorm=19.299, clip=100, loss_scale=128, train_wall=29, gb_free=10.1, wall=153
2022-03-22 12:27:18 - progress_bar.py[line:272] - INFO: epoch 001:     50 / 33096 loss=0.782, loss_v1=0, loss_v2=0, nll_loss=0.782, ntokens=32, nsentences=16, sample_size=32, sample_size_v1=0, sample_size_v2=0, ppl=1.72, wps=11.1, ups=0.35, wpb=32, bsz=16, num_updates=50, lr=8.39349e-08, gnorm=18.066, clip=100, loss_scale=128, train_wall=28, gb_free=8.3, wall=182
2022-03-22 12:27:47 - progress_bar.py[line:272] - INFO: epoch 001:     60 / 33096 loss=0.775, loss_v1=0, loss_v2=0, nll_loss=0.775, ntokens=32, nsentences=16, sample_size=32, sample_size_v1=0, sample_size_v2=0, ppl=1.71, wps=10.9, ups=0.34, wpb=32, bsz=16, num_updates=60, lr=1.00722e-07, gnorm=16.49, clip=100, loss_scale=128, train_wall=29, gb_free=6.6, wall=211
2022-03-22 12:28:16 - progress_bar.py[line:272] - INFO: epoch 001:     70 / 33096 loss=0.747, loss_v1=0, loss_v2=0, nll_loss=0.747, ntokens=32, nsentences=16, sample_size=32, sample_size_v1=0, sample_size_v2=0, ppl=1.68, wps=11.2, ups=0.35, wpb=32, bsz=16, num_updates=70, lr=1.17509e-07, gnorm=16.72, clip=100, loss_scale=128, train_wall=28, gb_free=4.8, wall=240
2022-03-22 12:28:45 - progress_bar.py[line:272] - INFO: epoch 001:     80 / 33096 loss=0.805, loss_v1=0, loss_v2=0, nll_loss=0.805, ntokens=32, nsentences=16, sample_size=32, sample_size_v1=0, sample_size_v2=0, ppl=1.75, wps=11.2, ups=0.35, wpb=32, bsz=16, num_updates=80, lr=1.34296e-07, gnorm=17.591, clip=100, loss_scale=128, train_wall=28, gb_free=3, wall=268
2022-03-22 12:28:48 - trainer.py[line:1304] - WARNING: OOM: Ran out of memory with exception: CUDA out of memory. Tried to allocate 88.00 MiB (GPU 0; 23.70 GiB total capacity; 20.81 GiB already allocated; 86.56 MiB free; 21.89 GiB reserved in total by PyTorch)

JustinLin610 commented 2 years ago

Seems you got rid of training instabilities but got stuck in memory consumption. I have never met this when I was training MoE models, so I am not sure whether this is caused by MoE layers. Are you sure it is a continuous growing instead of a spike? I guess there might be some other reasons. Try with a smaller model and record the memory consumption to better examine your implementation.

dannyxiaocn commented 2 years ago

Ok, thx. BTW, were you using the MoE module implemented by yourself or open-source libraries?

JustinLin610 commented 2 years ago

We did it by ourselves on some of our previous works in Tensorflow. Later we'll look into implementing MoEs for OFA and maybe tutel is a choice :)

dannyxiaocn commented 2 years ago

Cool! I am doing research about moe in multimodal pre-trained models and really admire your guys' work. You can drop me an email if you have plans and interests for some collaboration and intern opportunities!

OFA-Sys / OFA

How can I handle this in a modified model? #52