LiyuanLucasLiu / Transformer-Clinic

Understanding the Difficulty of Training Transformers
https://arxiv.org/abs/2004.08249
Apache License 2.0
326 stars 20 forks source link

wmt_en_de admin: Function 'SoftmaxBackward' returned nan values in its 0th output. #14

Closed sshleifer closed 3 years ago

sshleifer commented 3 years ago

I was wondering if you ever encountered nan-gradients during admin training. I'm in torch 1.6/CUDA 10.1 with no modifications to the code:

Command

export dd=data-bin/wmt14_en_de_joined_dict
GPUS=0,1,2,3
GPUID=1
TOKEN_NUMBER=8192
UPDATE_FREQUENCE=1
for lnum in 18
do
  CUDA_VISIBLE_DEVICES=$GPUID fairseq-train \
    $dd -s en -t de \
    --arch transformer_wmt_en_de --share-all-embeddings \
    --optimizer radam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
    --lr-scheduler inverse_sqrt --max-update 500000 \
    --warmup-init-lr 1e-07 --warmup-updates 8000 --lr 0.001 --min-lr 1e-09  \
    --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
    --weight-decay 0.0 --attention-dropout 0.1 --relu-dropout 0.1 \
    --max-tokens $TOKEN_NUMBER --update-freq $UPDATE_FREQUENCE \
    --save-dir wmt14ende/wmt-admin-${lnum}l --restore-file x.pt --seed 1111 \
    --user-dir ../radam_fairseq --log-format simple --log-interval 500 \
    --init-type adaptive-profiling --fp16 --fp16-scale-window 256 \
    --encoder-layers $lnum --decoder-layers $lnum \
    --threshold-loss-scale 0.03125 

  CUDA_VISIBLE_DEVICES=$GPUS fairseq-train \
    $dd -s en -t de \
    --arch transformer_wmt_en_de --share-all-embeddings \
    --optimizer radam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
    --lr-scheduler inverse_sqrt --max-update 500000 \
    --warmup-init-lr 1e-07 --warmup-updates 8000 --lr 0.001 --min-lr 1e-09  \
    --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
    --weight-decay 0.0 --attention-dropout 0.1 --relu-dropout 0.1 \
    --max-tokens $TOKEN_NUMBER --update-freq $UPDATE_FREQUENCE \
    --save-dir wmt14ende/wmt-admin-${lnum}l --restore-file x.pt --seed 1111 \
    --user-dir ../radam_fairseq --log-format simple --log-interval 500 \
    --init-type adaptive --fp16 --fp16-scale-window 256 \
    --encoder-layers $lnum --decoder-layers $lnum \
    --threshold-loss-scale 0.03125 | tee ./wmt14ende/log/loss_admin-${lnum}l.log

  bash eval_wmt_en-de.sh wmt14ende/wmt-admin-${lnum}l $GPUID 
done

The profiling command works fine, but the second command raises:

Traceback

| WARNING: overflow detected, setting loss scale to: 32.0
| epoch 002 | loss 4.937 | nll_loss 3.371 | ppl 10.34 | wps 24011 | ups 1 | wpb 28913.466 | bsz 942.984 | num_updates 9352 | lr 0.000
924896 | gnorm 0.368 | clip 0.000 | oom 0.000 | loss_scale 32.000 | wall 228 | train_wall 226
Traceback (most recent call last):
  File "/private/home/sshleifer/.conda/envs/clinic/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
    fn(i, *args)
  File "/private/home/sshleifer/Transformer-Clinic/fairseq/fairseq_cli/train.py", line 307, in distributed_main
    main(args, init_distributed=True)
  File "/private/home/sshleifer/Transformer-Clinic/fairseq/fairseq_cli/train.py", line 90, in main
    train(args, trainer, task, epoch_itr)
  File "/private/home/sshleifer/Transformer-Clinic/fairseq/fairseq_cli/train.py", line 139, in train
    log_output = trainer.train_step(samples)
  File "/private/home/sshleifer/Transformer-Clinic/fairseq/fairseq/trainer.py", line 349, in train_step
    raise e
  File "/private/home/sshleifer/Transformer-Clinic/fairseq/fairseq/trainer.py", line 311, in train_step
    loss, sample_size, logging_output = self.task.train_step(
  File "/private/home/sshleifer/Transformer-Clinic/fairseq/fairseq/tasks/fairseq_task.py", line 264, in train_step
    optimizer.backward(loss)
  File "/private/home/sshleifer/Transformer-Clinic/fairseq/fairseq/optim/fp16_optimizer.py", line 103, in backward
    loss.backward()
  File "/private/home/sshleifer/.conda/envs/clinic/lib/python3.8/site-packages/torch/tensor.py", line 185, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/private/home/sshleifer/.conda/envs/clinic/lib/python3.8/site-packages/torch/autograd/__init__.py", line 125, in backward
    Variable._execution_engine.run_backward(
RuntimeError: Function 'SoftmaxBackward' returned nan values in its 0th output.

contents of profile_ratio.init: https://gist.github.com/sshleifer/b615558499b9b10bd5bee8ddf2db030a

Data directory:

image

LiyuanLucasLiu commented 3 years ago

This is weird, I haven't encountered this problem before... I will try to do some debugging on this issue in the next week. BTW, I'm wondering whether you can successfully train other Admin models in your environment (e.g., IWSLT models). Also, I blieve the log is saved at /wmt14ende/log/loss_admin-${lnum}l.log, it would be helpful if you can share the log here.

sshleifer commented 3 years ago

I've trained four models now (all with adaptive):

wmt14ende/wmt-admin-6l:

wmt14ende/wmt-admin-12l:

wmt14ende/wmt-admin-18l:

iwsltende/admin-6l:

Full Logs for all four runs are here.

Your table for reference: image

LiyuanLucasLiu commented 3 years ago

This is weird, all admin models on wmt14ende failed in your setting. I compared your logs and my logs, and their development ppls are almost the same. Seems that the training just shut down for some unknown reasons...

One random guess is half-precision training (since it should detects Nan gradients and adjust the scaling accordingly). Maybe you can load the last checkpoint and see whether full precision training can avoid the error?

Another random guess is the distributed training. Maybe you can load the last checkpoint and see whether one-gpu setting can avoid the error (with UPDATE_FREQUENCE=4)?

Also, it could be caused by OOM. Maybe you can load the last checkpoint and see whether halving the batch size and setting UPDATE_FREQUENCE=2 can avoid the error?

(sorry I'm fully occupied with job applications this week and cannot do experiments with the 1.6 and 10.1 version...)

LiyuanLucasLiu commented 3 years ago

@sshleifer any updates?

I did some preliminary experiments with 1.6 and 10.1 (the job application takes much more time than I expected), and I cannot reproduce nan values error you met (at least within the first 5 epochs of the 12-12 model).

The device I'm using is Quadro RTX 8000 x 2 (maximum memory util is ~ 25 Gb on each). For better reproductivity, I used the docker image pytorch/pytorch:1.6.0-cuda10.1-cudnn7-devel.

All commands I used are as below:

docker run -it --gpus all -v /data1/ll2/data:/ll2/data-bin -v /data1/ll2/fairseq:/ll2/fairseq -v /data1/ll2/radam_fairseq:/ll2/radam_fairseq -v /data1/ll2/cps:/ll2/cps --privileged=true --name tmp  pytorch/pytorch:1.6.0-cuda10.1-cudnn7-devel bash

cd /ll2/fairseq
pip install --editable .

cd /ll2/
export MKL_THREADING_LAYER=GNU

CUDA_VISIBLE_DEVICES=3 fairseq-train \
./data-bin/wmt14_en_de_joined_dict/ -s en -t de \
--arch transformer_wmt_en_de --share-all-embeddings \
--optimizer radam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
--lr-scheduler inverse_sqrt --max-update 500000 \
--warmup-init-lr 1e-07 --warmup-updates 8000 --lr 0.001 --min-lr 1e-09  \
--criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
--weight-decay 0.0 --attention-dropout 0.1 --relu-dropout 0.1 \
--max-tokens 8192 --update-freq 1 \
--save-dir cps/wmt-admin-12-12 --restore-file x.pt --seed 1111 \
--user-dir ./radam_fairseq --log-format simple --log-interval 500 \
--init-type adaptive-profiling --fp16 --fp16-scale-window 256 \
--encoder-layers 12 --decoder-layers 12 \
--threshold-loss-scale 0.03125 

CUDA_VISIBLE_DEVICES=4,3 fairseq-train \
./data-bin/wmt14_en_de_joined_dict/ -s en -t de \
--arch transformer_wmt_en_de --share-all-embeddings \
--optimizer radam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
--lr-scheduler inverse_sqrt --max-update 500000 \
--warmup-init-lr 1e-07 --warmup-updates 8000 --lr 0.001 --min-lr 1e-09  \
--criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
--weight-decay 0.0 --attention-dropout 0.1 --relu-dropout 0.1 \
--max-tokens 8192 --update-freq 2 \
--save-dir cps/wmt-admin-12-12 --restore-file x.pt --seed 1111 \
--user-dir ./radam_fairseq --log-format simple --log-interval 500 \
--init-type adaptive --fp16 --fp16-scale-window 256 \
--encoder-layers 12 --decoder-layers 12 \
--threshold-loss-scale 0.03125 | tee loss_admin-12-12.log
sshleifer commented 3 years ago

Thanks! I don't have access to docker, but by changing to python 3.6 I managed to avoid nan gradients. With fp16 my loss scale still gets very high, but nothing crashes. Do you recall the BLEU/valid loss scores for various checkpoint_best.pt models without ensembling?

So far I haven't gotten ADMIN init to outperform pre-ln, but I still might have an environment issue. Do these training statistics/runtimes look reasonable?

image

Training Curves: image

LiyuanLucasLiu commented 3 years ago

Glad you fix the nan gradient issue! (what is your previous python version, wondering whether you would suggest me to add the python version requirements to the readme).

Based on my understanding on fp16, a larger scale is better than a smaller scale (as overflow can be detected, but not underflow); but too large scale may also cause some problems...

As to the performance, it is weird... I haven't encountered this problem, the Pre-LN should be an easy baseline to beat (it is more stable, at the cost of performance).

As a reference, I just uploaded my log to link (the log for 12l is interruptted several times to switch to faster/idle GPUs. small lab, has to do this manually).

I can't find checkpoint_best.pt, but I find checkpoints@91-100 of Admin-6l. Their BLEU scores are as below:

checkpoint BLEU
91 27.79
92 27.72
93 27.85
94 27.82
95 27.90
96 27.59
97 27.63
98 27.67
99 27.77
100 27.77
avg@91-100 28.08
sshleifer commented 3 years ago

I would add python 3.6, torch 1.5 or torch 1.6 to the README. I think with those versions and some guidance that training takes a really long time it will make sense. The logs were really helpful, I think I am getting similar results now.

LiyuanLucasLiu commented 3 years ago

Thanks for the suggestion, and I've edited the readme accordingly.