Closed sshleifer closed 3 years ago
This is weird, I haven't encountered this problem before...
I will try to do some debugging on this issue in the next week.
BTW, I'm wondering whether you can successfully train other Admin models in your environment (e.g., IWSLT models).
Also, I blieve the log is saved at /wmt14ende/log/loss_admin-${lnum}l.log
, it would be helpful if you can share the log here.
I've trained four models now (all with adaptive
):
wmt14ende/wmt-admin-6l
:
| epoch 008: 3000 / 4691 loss=4.513, nll_loss=2.896, ppl=7.44
-checkpoint_best.pt
: BLEU 25.24wmt14ende/wmt-admin-12l
:
epoch 002 | loss 5.005 | nll_loss 3.444 | ppl 10.88
wmt14ende/wmt-admin-18l
:
epoch 002 | loss 4.937 | nll_loss 3.371 | ppl 10.34
iwsltende/admin-6l
:
checkpoint_best.pt
: BLEU4 = 35.12Full Logs for all four runs are here.
Your table for reference:
This is weird, all admin models on wmt14ende failed in your setting. I compared your logs and my logs, and their development ppls are almost the same. Seems that the training just shut down for some unknown reasons...
One random guess is half-precision training (since it should detects Nan gradients and adjust the scaling accordingly). Maybe you can load the last checkpoint and see whether full precision training can avoid the error?
Another random guess is the distributed training. Maybe you can load the last checkpoint and see whether one-gpu setting can avoid the error (with UPDATE_FREQUENCE=4
)?
Also, it could be caused by OOM. Maybe you can load the last checkpoint and see whether halving the batch size and setting UPDATE_FREQUENCE=2
can avoid the error?
(sorry I'm fully occupied with job applications this week and cannot do experiments with the 1.6 and 10.1 version...)
@sshleifer any updates?
I did some preliminary experiments with 1.6 and 10.1 (the job application takes much more time than I expected), and I cannot reproduce nan values
error you met (at least within the first 5 epochs of the 12-12 model).
The device I'm using is Quadro RTX 8000 x 2 (maximum memory util is ~ 25 Gb
on each). For better reproductivity, I used the docker image pytorch/pytorch:1.6.0-cuda10.1-cudnn7-devel
.
All commands I used are as below:
docker run -it --gpus all -v /data1/ll2/data:/ll2/data-bin -v /data1/ll2/fairseq:/ll2/fairseq -v /data1/ll2/radam_fairseq:/ll2/radam_fairseq -v /data1/ll2/cps:/ll2/cps --privileged=true --name tmp pytorch/pytorch:1.6.0-cuda10.1-cudnn7-devel bash
cd /ll2/fairseq
pip install --editable .
cd /ll2/
export MKL_THREADING_LAYER=GNU
CUDA_VISIBLE_DEVICES=3 fairseq-train \
./data-bin/wmt14_en_de_joined_dict/ -s en -t de \
--arch transformer_wmt_en_de --share-all-embeddings \
--optimizer radam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
--lr-scheduler inverse_sqrt --max-update 500000 \
--warmup-init-lr 1e-07 --warmup-updates 8000 --lr 0.001 --min-lr 1e-09 \
--criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
--weight-decay 0.0 --attention-dropout 0.1 --relu-dropout 0.1 \
--max-tokens 8192 --update-freq 1 \
--save-dir cps/wmt-admin-12-12 --restore-file x.pt --seed 1111 \
--user-dir ./radam_fairseq --log-format simple --log-interval 500 \
--init-type adaptive-profiling --fp16 --fp16-scale-window 256 \
--encoder-layers 12 --decoder-layers 12 \
--threshold-loss-scale 0.03125
CUDA_VISIBLE_DEVICES=4,3 fairseq-train \
./data-bin/wmt14_en_de_joined_dict/ -s en -t de \
--arch transformer_wmt_en_de --share-all-embeddings \
--optimizer radam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
--lr-scheduler inverse_sqrt --max-update 500000 \
--warmup-init-lr 1e-07 --warmup-updates 8000 --lr 0.001 --min-lr 1e-09 \
--criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
--weight-decay 0.0 --attention-dropout 0.1 --relu-dropout 0.1 \
--max-tokens 8192 --update-freq 2 \
--save-dir cps/wmt-admin-12-12 --restore-file x.pt --seed 1111 \
--user-dir ./radam_fairseq --log-format simple --log-interval 500 \
--init-type adaptive --fp16 --fp16-scale-window 256 \
--encoder-layers 12 --decoder-layers 12 \
--threshold-loss-scale 0.03125 | tee loss_admin-12-12.log
Thanks! I don't have access to docker, but by changing to python 3.6 I managed to avoid nan gradients. With fp16 my loss scale still gets very high, but nothing crashes.
Do you recall the BLEU/valid loss scores for various checkpoint_best.pt
models without ensembling?
So far I haven't gotten ADMIN init to outperform pre-ln, but I still might have an environment issue. Do these training statistics/runtimes look reasonable?
Training Curves:
Glad you fix the nan gradient issue! (what is your previous python version, wondering whether you would suggest me to add the python version requirements to the readme).
Based on my understanding on fp16, a larger scale is better than a smaller scale (as overflow can be detected, but not underflow); but too large scale may also cause some problems...
As to the performance, it is weird... I haven't encountered this problem, the Pre-LN should be an easy baseline to beat (it is more stable, at the cost of performance).
As a reference, I just uploaded my log to link (the log for 12l is interruptted several times to switch to faster/idle GPUs. small lab, has to do this manually).
I can't find checkpoint_best.pt
, but I find checkpoints@91-100 of Admin-6l. Their BLEU scores are as below:
checkpoint | BLEU |
---|---|
91 | 27.79 |
92 | 27.72 |
93 | 27.85 |
94 | 27.82 |
95 | 27.90 |
96 | 27.59 |
97 | 27.63 |
98 | 27.67 |
99 | 27.77 |
100 | 27.77 |
avg@91-100 | 28.08 |
I would add python 3.6, torch 1.5 or torch 1.6
to the README.
I think with those versions and some guidance that training takes a really long time it will make sense. The logs were really helpful, I think I am getting similar results now.
Thanks for the suggestion, and I've edited the readme accordingly.
I was wondering if you ever encountered nan-gradients during admin training. I'm in torch 1.6/CUDA 10.1 with no modifications to the code:
Command
The profiling command works fine, but the second command raises:
Traceback
contents of profile_ratio.init: https://gist.github.com/sshleifer/b615558499b9b10bd5bee8ddf2db030a
Data directory: