facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
30.63k stars 6.42k forks source link

Gradient overflow when running the example of Adaptive Input Representations #4293

Open ghrua opened 2 years ago

ghrua commented 2 years ago

❓ Questions and Help

What is your question?

Got inf loss and gradient overflow when running the code example of adaptive input representation with --fp16. I am trying to reproduce the results of Baevski and Auli, 2018, and the code example provided by fairseq is pretty fine with fp32. However, the model doesn't work well when I use fp16 to reduce the training time, following Baevski and Auli, 2018 . Are there any tips for preventing the model from inf loss?

Code

Almost the same as the code in this link except for the fp16 argument:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python train.py \
    --task language_modeling \
    $DEST/data-bin/wikitext-103 \
    --save-dir $DEST/results/debug \
    --arch transformer_lm_wiki103 \
    --max-update 286000 --lr 1.0 --t-mult 2 --lr-period-updates 270000 --lr-scheduler cosine --lr-shrink 0.75 \
    --warmup-updates 16000 --warmup-init-lr 1e-07 --stop-min-lr 1e-09 --optimizer nag --min-lr 0.0001 --clip-norm 0.1 \
    --criterion adaptive_loss --max-tokens 3072 --update-freq 3 --tokens-per-sample 3072 --seed 1 \
    --sample-break-mode none --skip-invalid-size-inputs-valid-test --ddp-backend=legacy_ddp \
    --fp16

Results:

...
2022-03-21 14:38:24 | INFO | fairseq.tasks.text_to_speech | Please install tensorboardX: pip install tensorboardX
2022-03-21 14:38:36 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 64.0
epoch 001:   0%|                                                                                        | 1/1401 [00:14<5:43:19, 14.71s/it]
2022-03-21 14:38:42 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 32.0
epoch 001:   0%|▏                                                                                       | 2/1401 [00:21<3:48:24,  9.80s/it]
2022-03-21 14:38:49 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 16.0
epoch 001:   4%|███▎                                                                                     | 53/1401 [01:06<14:50,  1.51it/s]
2022-03-21 14:39:29 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 8.0
epoch 001:  13%|▏| 185/1401 [02:34<13:29,  1.50it/s, loss=inf, ppl=inf, wps=108973, ups=1.48, wpb=73714.1, bsz=24, num_updates=100, lr=0.00

What's your environment?

freddy5566 commented 2 years ago

Hi @ghrua, Have you found the solution? I am facing the same problem...

ghrua commented 2 years ago

Hi @jamfly

Yes, actually there are two solutions I think:

  1. Detect the overflow operation step-by-step and address it.
  2. Pretrain the ADP model for 3 epochs using FP32, and then reload the parameters when training with FP16. Please reset the optimizer when you load the FP32 model under the FP16 setting.

The second one is somehow tricky... But it works for me.

freddy5566 commented 2 years ago

Hi @ghrua, Thank you for your suggestions, I will at least try the second one. I have questions regarding the training:

  1. Have you tried using fp16 from scratch, will it turn lnf loss to the normal scale?
  2. I noticed that you set update-freq to 3, but based on the description from the paper, they are using tokens-per-sample 4096 with 8 GPUs. I know they said that they changed to 3072 because of a better performance. But, is update-freq always set to 3?

Thank you in advance.

ghrua commented 2 years ago

Yes, I have tried FP16 from scratch with many hyper-parameters, e.g., different values of warmup updates and clip norm, but they didn't work for me.

In the section of batching, the author said "This gives an effective batch size of 65K tokens for WIKITEXT-103.", where 65,000 / 8 / 3072 is around 2.6. I think that's why they set update-freq to 3.

Hi @ghrua, Thank you for your suggestions, I will at least try the second one. I have questions regarding the training:

  1. Have you tried using fp16 from scratch, will it turn lnf loss to the normal scale?
  2. I noticed that you set update-freq to 3, but based on the description from the paper, they are using tokens-per-sample 4096 with 8 GPUs. I know they said that they changed to 3072 because of a better performance. But, is update-freq always set to 3?

Thank you in advance.

freddy5566 commented 2 years ago

Hi @ghrua, I got it, they are using 4096 (tokens) 8 (GPUs) 2 (update-freq). Anyways, thank you for your kindly help and suggestions, I really appreciate it. Thank you.

Psycoy commented 2 years ago

can you replicate the results in paper? I ran the same recipe as yours, got a test ppl of 29.14, but the results in paper should be 18.7.

freddy5566 commented 2 years ago

Hi @Psycoy, sorry for the late reply. Mine was 19.7, and despite being close to 18.7, it still has a gap. Can you reproduce their results?

Psycoy commented 2 years ago

Hi @Psycoy, sorry for the late reply. Mine was 19.7, and despite being close to 18.7, it still has a gap. Can you reproduce their results?

Yes, I can, as far as setting the right update frequency according to the gpus and batch size.

freddy5566 commented 2 years ago

Hi @Psycoy, sorry for the late reply. Mine was 19.7, and despite being close to 18.7, it still has a gap. Can you reproduce their results?

Yes, I can, as far as setting the right update frequency according to the gpus and batch size.

Which script did you use to evaluate your model?