Closed yaoyiran closed 5 years ago
Can you post full repro and logs from your run? Do you use default dataset?
@jbaczek Yes, sure! I have checked the code and found that the problem occurs in line 392 of train.py: sacrebleu_score = sacrebleu.corpus_bleu(predictions, refs, lowercase=args.ignore_case). I found that len(predictions) = 2799 but len(refs) = 1. This is why the error happened. Do you know how can I fix it? Thx!
Yes, I am using the default dataset, WMT2014 and the default pre-processing code.
nohup python -m torch.distributed.launch --nproc_per_node 4 /workspace/examples/transformer/train.py /workspace/data-bin/wmt14_en_de_joined_dict \ --arch transformer_wmt_en_de_big_t2t \ --share-all-embeddings \ --optimizer adam \ --adam-betas '(0.9, 0.997)' \ --adam-eps "1e-9" \ --clip-norm 0.0 \ --lr-scheduler inverse_sqrt \ --warmup-init-lr 0.0 \ --warmup-updates 4000 \ --lr 0.0006 \ --min-lr 0.0 \ --dropout 0.1 \ --weight-decay 0.0 \ --criterion label_smoothed_cross_entropy \ --label-smoothing 0.1 \ --max-tokens 5120 \ --seed 1 \ --target-bleu 28.3 \ --ignore-case \ --fp16 \ --save-dir /workspace/checkpoints \ --distributed-init-method env:// &
I have printed predictions and refs out and found that predictions is a list (len 2997) each element being a sentence whereas refs[0] is a list with 3003 sentences. So, their shapes do not match.
refs should be a list with one element. That is how sacrebleu handles arguments. I ran this code on DGX-1 16G and everything seems fine (I didn't use nohup though). What platform do you use? Have you tried to run training without nohup?
I am using a DGX server with 8 V100 (use 4 of them), ubuntu 16.04 and cuda driver 384.111. I will think more about it but do you know why predictions has 2799 sentences and refs has 3003 sentences? Will that cause problems if the numbers do not match? On your machine do predictions and refs have the same amount of sentences?
BTW, I just pulled the image nvcr.io/nvidia/pytorch:19.05-py3, built a container directly from the image and used the code in /workspace/examples/transformer. Is the code the same as the one on github https://github.com/NVIDIA/DeepLearningExamples.git ?
It is known issue that on other than 8xV100 configurations this part of code can misbehave due to the memory limitations (this will be addressed in the next release). But this error is new to me, on my machine it doesn't appear. Try to run training on whole DGX. Yes, the code inside the 19.05 container is the same as the one on github, but if you use code from the container you still have to install all dependencies.
Thanks for your suggestions! I have got another 4 little questions. It will be very helpful if you could answer them a little bit:
--max-tokens
is the option to set batch size.--fp16
option then computation is performed in regular 32 bit floating point format. When set, option --fp16
performs mixed precision training, meaning that nearly all computation is performed in half precision and only numerically vulnerable operators are computed in regular precision. For more info see Nvidia guidelines linked in the readme.--update-freq
option with value of reciprocal ot the scaling factor. Training in fp32 mode takes nearly twice as much memory, so you need to divide batch size by 2 and use --update-freq 2
to simulate the same batch size. Also you need to scale number of warmup updates by the same amount. For example if you want to train with 4 GPUs fp16 use --update-freq 2
--warmup-updates 8000
. Also if you encounter problem with evaluation you can disable online evaluation and test model after a training with generate.py
script.Thanks a lot! Now I understand what is fp32 and fp16. For the batch size I think it is 5120 tokens per GPU per time step, so the more GPUs the larger the batch size. When using fp 32, 8 GPUs, we need to set "--update-freq 2" because fp32 takes double memories. However, when using fp16, 4 GPUs, the number of steps per epoch doubles, but do we still need to set "--update-freq 2" whereas each GPU still takes up to 5120 tokens per time step and I think the GPUs may not need to divide their batch?
If you use 4 GPUs global batch size is 4x5120, which means that is half the size of the original one. --update-freq 2
virtualy doubles it
I was training the Transformer model when a error occurred. The training process for the 1st epoch went very well but the validation raised an error, "EOFError: Source and reference streams have different lengths!" . By the way, I run "sacrebleu -t wmt14/full -l de-en --echo src > $DATASET_DIR/sacrebleu_reference.de" to generate the reference. Anyone know how to fix it?
| epoch 001 | valid on 'valid' subset | valid_loss 4.55658 | valid_nll_loss 2.8718 | valid_ppl 7.32 | num_updates 7867 | /workspace/data-bin/wmt14_en_de_joined_dict test 3003 examples | Sentences are being padded to multiples of: 1 generated batches in 0.0007243156433105469 s Traceback (most recent call last): File "/workspace/examples/transformer/train.py", line 525, in
distributed_main(args)
File "/workspace/examples/transformer/distributed_train.py", line 57, in main
single_process_main(args)
File "/workspace/examples/transformer/train.py", line 128, in main
current_bleu, current_sc_bleu = score(args, trainer, task, epoch_itr, args.gen_subset)
File "/workspace/examples/transformer/train.py", line 392, in score
sacrebleu_score = sacrebleu.corpus_bleu(predictions, refs, lowercase=args.ignore_case)
File "/opt/conda/lib/python3.6/site-packages/sacrebleu.py", line 1031, in corpus_bleu
raise EOFError("Source and reference streams have different lengths!")
EOFError: Source and reference streams have different lengths!