facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
30.63k stars 6.42k forks source link

Gradient overflow detected problem (pretraining new model) #3684

Open sanakhamekhem opened 3 years ago

sanakhamekhem commented 3 years ago

I'm using fairseq to pretrain a wav2vec self-supervised model on 11000 samples using one GPU (cuda 8.0). I obtained a 'Gradient overflow detected' warning and the loss is equal to 3.7. I would be greatful if you can indicate to me if that is normal and my model learns well. Thank you in advance.

Learning rate =0.00005 batch size=8

I'm running the following command:

CUDA_VISIBLE_DEVICES=0 python train.py --distributed-world-size 1 --max-epoch 500 --batch-size 8 --distributed-port -1 "${manifest_path}" \ --save-dir $dirPretrained --fp16 --fp16-scale-tolerance 0.25 --num-workers 4 --task audio_pretraining --criterion wav2vec --arch wav2vec2 \ --log-keys '["prob_perplexity","code_perplexity","temp"]' --quantize-targets --extractor-mode default \ --conv-feature-layers '[(512, 10, 5)] + [(512, 3, 2)] 4 + [(512,2,2)] 2' --final-dim 256 --latent-vars 320 \ --latent-groups 2 --latent-temp '(2,0.5,0.999995)' --infonce --optimizer adam \ --adam-betas '(0.9,0.98)' --adam-eps 1e-06 --lr-scheduler polynomial_decay --total-num-update 400000 \ --lr 0.00005 --warmup-updates 32000 --mask-length 10 --mask-prob 0.65 --mask-selection static --mask-other 0 \ --encoder-layerdrop 0.05 --dropout-input 0.1 --dropout-features 0.1 --feature-grad-mult 0.1 \ --loss-weights '[0.1, 10]' --conv-pos 128 --conv-pos-groups 16 --num-negatives 100 --cross-sample-negatives 0 \ --max-sample-size 250000 --min-sample-size 32 --dropout 0.1 --attention-dropout 0.1 --weight-decay 0.01 \ --max-tokens 1400000 --max-update 400000 --skip-invalid-size-inputs-valid-test --ddp-backend no_c10d

The output of the training stage is :

079: 5%| | 67/1403 [00:44<13:57, 1.59it/s, loss=3.697, ntokens=1466.24, nsentences=7.95, prob_perplexity=36.109, code_perplexity=36.078, temp=1.18, loss_0=3.542021-07-03 17:45:58 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 128 epoch 079: 5%| | 70/1403 [00:45<13:28, 1.65it/s, loss=3.697, ntokens=1466.24, nsentences=7.95, prob_perplexity=36.109, code_perplexity=36.078, temp=1.18, loss_0=3.542021-07-03 17:46:00 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 128 epoch 079: 5%| | 72/1403 [00:47<13:25, 1.65it/s, loss=3.697, ntokens=1466.24, nsentences=7.95, prob_perplexity=36.109, code_perplexity=36.078, temp=1.18, loss_0=3.542021-07-03 17:46:01 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 128 epoch 079: 5%| | 74/1403 [00:48<13:58, 1.59it/s, loss=3.697, ntokens=1466.24, nsentences=7.95, prob_perplexity=36.109, code_perplexity=36.078, temp=1.18, loss_0=3.542021-07-03 17:46:02 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 128 epoch 079: 5%| | 76/1403 [00:49<13:40, 1.62it/s, loss=3.697, ntokens=1466.24, nsentences=7.95, prob_perplexity=36.109, code_perplexity=36.078, temp=1.18, loss_0=3.542021-07-03 17:46:03 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 128 epoch 079: 5%| | 77/1403 [00:50<13:21, 1.65it/s, loss=3.697, ntokens=1466.24, nsentences=7.95, prob_perplexity=36.109, code_perplexity=36.078, temp=1.18, loss_0=3.542021-07-03 17:46:04 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 128 epoch 079: 6%| | 78/1403 [00:50<13:22, 1.65it/s, loss=3.697, ntokens=1466.24, nsentences=7.95, prob_perplexity=36.109, code_perplexity=36.078, temp=1.18, loss_0=3.542021-07-03 17:46:05 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 128 epoch 079: 6%| | 81/1403 [00:52<13:34, 1.62it/s, loss=3.697, ntokens=1466.24, nsentences=7.95, prob_perplexity=36.109, code_perplexity=36.078, temp=1.18, loss_0=3.542021-07-03 17:46:06 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 128 epoch 079: 6%| | 84/1403 [00:54<13:42, 1.60it/s, loss=3.697, ntokens=1466.24, nsentences=7.95, prob_perplexity=36.109, code_perplexity=36.078, temp=1.18, loss_0=3.542021-07-03 17:46:08 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 128 epoch 079: 6%| | 85/1403 [00:55<13:27, 1.63it/s, loss=3.697, ntokens=1466.24, nsentences=7.95, prob_perplexity=36.109, code_perplexity=36.078, temp=1.18, loss_0=3.542021-07-03 17:46:09 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 128 epoch 079: 6%| | 86/1403 [00:55<13:15, 1.66it/s, loss=3.697, ntokens=1466.24, nsentences=7.95, prob_perplexity=36.109, code_perplexity=36.078, temp=1.18, loss_0=3.542021-07-03 17:46:10 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 128

EricLina commented 2 years ago

Not use --fp16 may help

Macsim2 commented 1 year ago

Not use --fp16 may help

@EricLina This is also could be answer but, training with fp32 precision is needed a more memory about 2 times, So very slower.. another advices you have?