Source and reference streams have different lengths!

yaoyiran commented 5 years ago

I was training the Transformer model when a error occurred. The training process for the 1st epoch went very well but the validation raised an error, "EOFError: Source and reference streams have different lengths!" . By the way, I run "sacrebleu -t wmt14/full -l de-en --echo src > $DATASET_DIR/sacrebleu_reference.de" to generate the reference. Anyone know how to fix it?

| epoch 001 | valid on 'valid' subset | valid_loss 4.55658 | valid_nll_loss 2.8718 | valid_ppl 7.32 | num_updates 7867 | /workspace/data-bin/wmt14_en_de_joined_dict test 3003 examples | Sentences are being padded to multiples of: 1 generated batches in 0.0007243156433105469 s Traceback (most recent call last): File "/workspace/examples/transformer/train.py", line 525, in distributed_main(args) File "/workspace/examples/transformer/distributed_train.py", line 57, in main single_process_main(args) File "/workspace/examples/transformer/train.py", line 128, in main current_bleu, current_sc_bleu = score(args, trainer, task, epoch_itr, args.gen_subset) File "/workspace/examples/transformer/train.py", line 392, in score sacrebleu_score = sacrebleu.corpus_bleu(predictions, refs, lowercase=args.ignore_case) File "/opt/conda/lib/python3.6/site-packages/sacrebleu.py", line 1031, in corpus_bleu raise EOFError("Source and reference streams have different lengths!") EOFError: Source and reference streams have different lengths!

jbaczek commented 5 years ago

Can you post full repro and logs from your run? Do you use default dataset?

yaoyiran commented 5 years ago

@jbaczek Yes, sure! I have checked the code and found that the problem occurs in line 392 of train.py: sacrebleu_score = sacrebleu.corpus_bleu(predictions, refs, lowercase=args.ignore_case). I found that len(predictions) = 2799 but len(refs) = 1. This is why the error happened. Do you know how can I fix it? Thx!

Yes, I am using the default dataset, WMT2014 and the default pre-processing code.

Here is the log:

buted init (rank 0): env:// | distributed env init. MASTER_ADDR: 127.0.0.1, MASTER_PORT: 29500, WORLD_SIZE: 4, RANK: 3 | distributed init (rank 0): env:// | distributed env init. MASTER_ADDR: 127.0.0.1, MASTER_PORT: 29500, WORLD_SIZE: 4, RANK: 1 | distributed init (rank 0): env:// | distributed env init. MASTER_ADDR: 127.0.0.1, MASTER_PORT: 29500, WORLD_SIZE: 4, RANK: 0 | distributed init (rank 0): env:// | distributed env init. MASTER_ADDR: 127.0.0.1, MASTER_PORT: 29500, WORLD_SIZE: 4, RANK: 2 | distributed init done! | distributed init done! | distributed init done! | distributed init done! | initialized host dc90e6f8bbcc as rank 0 and device id 0 Namespace(adam_betas='(0.9, 0.997)', adam_eps=1e-09, adaptive_softmax_cutoff=None, arch='transformer_wmt_en_de_big_t2t', attention_dropout=0.1, beam=4, bpe_codes=None, clip_norm=0.0, cpu=False, criterion='label_smoothed_cross_entropy', data='/workspace/data-bin/wmt14_en_de_joined_dict', decoder_attention_heads=16, decoder_embed_dim=1024, decoder_embed_path=None, decoder_ffn_embed_dim=4096, decoder_layers=6, decoder_learned_pos=False, decoder_normalize_before=True, device_id=0, distributed_backend='nccl', distributed_init_method='env://', distributed_port=-1, distributed_rank=0, distributed_world_size=4, dropout=0.1, enable_parallel_backward_allred_opt=False, enable_parallel_backward_allred_opt_correctness_check=False, encoder_attention_heads=16, encoder_embed_dim=1024, encoder_embed_path=None, encoder_ffn_embed_dim=4096, encoder_laye MASTER_ADDR: 127.0.0.1, MASTER_PORT: 29500, WORLD_SIZE: 4, RANK: 3 | distributed init (rank 0): env:// | distributed env init. MASTER_ADDR: 127.0.0.1, MASTER_PORT: 29500, WORLD_SIZE: 4, RANK: 1 | distributed init (rank 0): env:// | distributed env init. MASTER_ADDR: 127.0.0.1, MASTER_PORT: 29500, WORLD_SIZE: 4, RANK: 0 | distributed init (rank 0): env:// | distributed env init. MASTER_ADDR: 127.0.0.1, MASTER_PORT: 29500, WORLD_SIZE: 4, RANK: 2 | distributed init done! | distributed init done! | distributed init done! | distributed init done! | initialized host dc90e6f8bbcc as rank 0 and device id 0 Namespace(adam_betas='(0.9, 0.997)', adam_eps=1e-09, adaptive_softmax_cutoff=None, arch='transformer_wmt_en_de_big_t2t', attention_dropout=0.1, beam=4, bpe_codes=None, clip_norm=0.0, cpu=False, criterion='label_smoothed_cross_entropy', data='/workspace/data-bin/wmt14_en_de_joined_dict', decoder_attention_heads=16, decoder_embed_dim=1024, decoder_embed_path=None, decoder_ffn_embed_dim=4096, decoder_layers=6, decoder_learned_pos=False, decoder_normalize_before=True, device_id=0, distributed_backend='nccl', distributed_init_method='env://', distributed_port=-1, distributed_rank=0, distributed_world_size=4, dropout=0.1, enable_parallel_backward_allred_opt=False, enable_parallel_backward_allred_opt_correctness_check=False, encoder_attention_heads=16, encoder_embed_dim=1024, encoder_embed_path=None, encoder_ffn_embed_dim=4096, encoder_layers=6, encoder_learned_pos=False, encoder_normalize_before=True, fp16=True, fuse_dropout_add=False, fuse_relu_dropout=False, gen_subset='test', ignore_case=True, keep_interval_updates=-1, label_smoothing=0.1, left_pad_source='True', left_pad_target='False', lenpen=1, local_rank=0, log_format=None, log_interval=1000, lr=[0.0006], lr_scheduler='inverse_sqrt', lr_shrink=0.1, max_epoch=0, max_len_a=0, max_len_b=200, max_sentences=None, max_sentences_valid=None, max_source_positions=1024, max_target_positions=1024, max_tokens=5120, max_update=0, min_len=1, min_loss_scale=0.0001, min_lr=0.0, model_overrides='{}', momentum=0.99, nbest=1, no_beamable_mm=False, no_early_stop=False, no_epoch_checkpoints=False, no_progress_bar=False, no_save=False, no_token_positional_embeddings=False, num_shards=1, online_eval=False, optimizer='adam', pad_sequence=1, parallel_backward_allred_opt_threshold=0, path=None, prefix_size=0, print_alignment=False, profile=None, quiet=False, raw_text=False, relu_dropout=0.1, remove_bpe=None, replace_unk=None, restore_file='checkpoint_last.pt', sampling=False, sampling_temperature=1, sampling_topk=-1, save_dir='/workspace/checkpoints', save_interval=1, save_interval_updates=0, score_reference=False, seed=1, sentence_avg=False, sentencepiece=False, shard_id=0, share_all_embeddings=True, share_decoder_input_output_embed=False, skip_invalid_size_inputs_valid_test=False, source_lang=None, target_bleu=28.3, target_lang=None, task='translation', train_subset='train', unkpen=0, unnormalized=False, update_freq=[1], valid_subset='valid', validate_interval=1, warmup_init_lr=0.0, warmup_updates=4000, weight_decay=0.0) | [en] dictionary: 33712 types | [de] dictionary: 33712 types | /workspace/data-bin/wmt14_en_de_joined_dict train 4575637 examples | Sentences are being padded to multiples of: 1 | /workspace/data-bin/wmt14_en_de_joined_dict valid 3000 examples | Sentences are being padded to multiples of: 1 | model transformer_wmt_en_de_big_t2t, criterion LabelSmoothedCrossEntropyCriterion | num. model params: 210808832 | training on 4 GPUs | max tokens per GPU = 5120 and max sentences per GPU = None generated batches in 2.2408082485198975 s | WARNING: overflow detected, setting loss scale to: 64.0 | WARNING: overflow detected, setting loss scale to: 32.0 | WARNING: overflow detected, setting loss scale to: 16.0 | epoch 001: 1000 / 7872 loss=10.728, nll_loss=10.037, ppl=1050.91, wps=112765, ups=6.1, wpb=18072, bsz=595, num_updates=998, lr=0.0001497, gnorm=88859.330, clip=100%, oom=0, loss_scale=16.000, wall=164 | epoch 001: 2000 / 7872 loss=9.220, nll_loss=8.283, ppl=311.46, wps=115974, ups=6.3, wpb=18106, bsz=588, num_updates=1998, lr=0.0002997, gnorm=71073.070, clip=100%, oom=0, loss_scale=16.000, wall=316 | epoch 001: 3000 / 7872 loss=8.221, nll_loss=7.122, ppl=139.25, wps=117589, ups=6.4, wpb=18097, bsz=586, num_updates=2998, lr=0.0004497, gnorm=75402.434, clip=100%, oom=0, loss_scale=32.000, wall=466 | WARNING: overflow detected, setting loss scale to: 16.0 | epoch 001: 4000 / 7872 loss=7.599, nll_loss=6.402, ppl=84.59, wps=118400, ups=6.5, wpb=18085, bsz=583, num_updates=3997, lr=0.00059955, gnorm=65879.740, clip=100%, oom=0, loss_scale=16.000, wall=615 | epoch 001: 5000 / 7872 loss=7.164, nll_loss=5.904, ppl=59.86, wps=118999, ups=6.5, wpb=18084, bsz=583, num_updates=4997, lr=0.000536817, gnorm=58532.792, clip=100%, oom=0, loss_scale=16.000, wall=764 | epoch 001: 6000 / 7872 loss=6.843, nll_loss=5.537, ppl=46.43, wps=119286, ups=6.6, wpb=18076, bsz=583, num_updates=5997, lr=0.00049002, gnorm=56855.371, clip=100%, oom=0, loss_scale=32.000, wall=913 | epoch 001: 7000 / 7872 loss=6.588, nll_loss=5.247, ppl=37.98, wps=119602, ups=6.6, wpb=18078, bsz=583, num_updates=6997, lr=0.000453655, gnorm=55107.830, clip=100%, oom=0, loss_scale=32.000, wall=1062 | WARNING: overflow detected, setting loss scale to: 32.0 Epoch time: 1187.9658298492432 | epoch 001 | loss 6.412 | nll_loss 5.048 | ppl 33.09 | wps 119756 | ups 6.6 | wpb 18076 | bsz 581 | num_updates 7867 | lr 0.000427835 | gnorm 57321.792 | clip 100% | oom 0 | loss_scale 32.000 | wall 1192 generated batches in 0.0007636547088623047 s | epoch 001 | valid on 'valid' subset | valid_loss 4.55658 | valid_nll_loss 2.8718 | valid_ppl 7.32 | num_updates 7867 | /workspace/data-bin/wmt14_en_de_joined_dict test 3003 examples | Sentences are being padded to multiples of: 1 generated batches in 0.0006475448608398438 s Traceback (most recent call last): File "/workspace/examples/transformer/train.py", line 525, in distributed_main(args) File "/workspace/examples/transformer/distributed_train.py", line 57, in main single_process_main(args) File "/workspace/examples/transformer/train.py", line 128, in main current_bleu, current_sc_bleu = score(args, trainer, task, epoch_itr, args.gen_subset) File "/workspace/examples/transformer/train.py", line 392, in score sacrebleu_score = sacrebleu.corpus_bleu(predictions, refs, lowercase=args.ignore_case) File "/opt/conda/lib/python3.6/site-packages/sacrebleu.py", line 1031, in corpus_bleu raise EOFError("Source and reference streams have different lengths!") EOFError: Source and reference streams have different lengths!

Here is what I run:

nohup python -m torch.distributed.launch --nproc_per_node 4 /workspace/examples/transformer/train.py /workspace/data-bin/wmt14_en_de_joined_dict \ --arch transformer_wmt_en_de_big_t2t \ --share-all-embeddings \ --optimizer adam \ --adam-betas '(0.9, 0.997)' \ --adam-eps "1e-9" \ --clip-norm 0.0 \ --lr-scheduler inverse_sqrt \ --warmup-init-lr 0.0 \ --warmup-updates 4000 \ --lr 0.0006 \ --min-lr 0.0 \ --dropout 0.1 \ --weight-decay 0.0 \ --criterion label_smoothed_cross_entropy \ --label-smoothing 0.1 \ --max-tokens 5120 \ --seed 1 \ --target-bleu 28.3 \ --ignore-case \ --fp16 \ --save-dir /workspace/checkpoints \ --distributed-init-method env:// &

yaoyiran commented 5 years ago

I have printed predictions and refs out and found that predictions is a list (len 2997) each element being a sentence whereas refs[0] is a list with 3003 sentences. So, their shapes do not match.

jbaczek commented 5 years ago

refs should be a list with one element. That is how sacrebleu handles arguments. I ran this code on DGX-1 16G and everything seems fine (I didn't use nohup though). What platform do you use? Have you tried to run training without nohup?

yaoyiran commented 5 years ago

I am using a DGX server with 8 V100 (use 4 of them), ubuntu 16.04 and cuda driver 384.111. I will think more about it but do you know why predictions has 2799 sentences and refs has 3003 sentences? Will that cause problems if the numbers do not match? On your machine do predictions and refs have the same amount of sentences?

yaoyiran commented 5 years ago

BTW, I just pulled the image nvcr.io/nvidia/pytorch:19.05-py3, built a container directly from the image and used the code in /workspace/examples/transformer. Is the code the same as the one on github https://github.com/NVIDIA/DeepLearningExamples.git ?

jbaczek commented 5 years ago

It is known issue that on other than 8xV100 configurations this part of code can misbehave due to the memory limitations (this will be addressed in the next release). But this error is new to me, on my machine it doesn't appear. Try to run training on whole DGX. Yes, the code inside the 19.05 container is the same as the one on github, but if you use code from the container you still have to install all dependencies.

yaoyiran commented 5 years ago

Thanks for your suggestions! I have got another 4 little questions. It will be very helpful if you could answer them a little bit:

How to adjust the batch size? In /Transformer/fairseq/options.py, I didn't see how hyperparameters like batch size are set. I just see "group.add_argument('--max-tokens', type=int, metavar='N', help='maximum number of tokens in a batch')". If max_tokens/batch is 5120, then the real batch size in a common sense would be only like 300 sentences/batch or so?
What do fp16 and fp32 mean?
Since the readme file says that it can also achieve good performance with 4 GPUs, I am wondering how should I run the code with 4 GPUs. In addition to setting --nproc_per_node 4 in the command line, do you know how people, who provided the results in the readme file on 4 GPUs, set their command line to run the train.py? It is mentioned in the readme that "when training in FP32 mode on 4 GPUs, use the --update-freq=4 and --warmup-updates 16000 options", but how should we do that in fp16 mode?
Do different GPUs update trainable variables Synchronously or Asynchronously? I mean, at each training step, those 8 GPUs (if we use 8 rather than 4) will take in different data, calculate 8 sets of gradients respectively, those GPUs will wait for each other utill all 8 gradient sets are calculated, and finally the optimizer will take the average over those 8 sets of gradients for backprop?

jbaczek commented 5 years ago

In NLP models you usually defines batch size in term of tokens, not whole sentences. Sentences can have different lengths, every token has its representation as a high dimensional vector in embedding space, thus amount of memory required batch will differ. Batching algorithm for transformer sorts dataset by sentence length and bathes sentences with similar length together to minimize number of padding tokens per batch. This means that some batches can have 500 sentences while others can have only 100. --max-tokens is the option to set batch size.
fp16 and fp32 are arithmetics types. When you don't specify --fp16 option then computation is performed in regular 32 bit floating point format. When set, option --fp16 performs mixed precision training, meaning that nearly all computation is performed in half precision and only numerically vulnerable operators are computed in regular precision. For more info see Nvidia guidelines linked in the readme.
The best result was achieved with fp16, batch size 5120, 8 GPUs and linear warmup of 4000. If you want to train with lower number of GPUs and get similar results use --update-freq option with value of reciprocal ot the scaling factor. Training in fp32 mode takes nearly twice as much memory, so you need to divide batch size by 2 and use --update-freq 2 to simulate the same batch size. Also you need to scale number of warmup updates by the same amount. For example if you want to train with 4 GPUs fp16 use --update-freq 2 --warmup-updates 8000. Also if you encounter problem with evaluation you can disable online evaluation and test model after a training with generate.py script.
The only synchronization point is when gradients are gathered. After then each worker updates model separately averaging them

yaoyiran commented 5 years ago

Thanks a lot! Now I understand what is fp32 and fp16. For the batch size I think it is 5120 tokens per GPU per time step, so the more GPUs the larger the batch size. When using fp 32, 8 GPUs, we need to set "--update-freq 2" because fp32 takes double memories. However, when using fp16, 4 GPUs, the number of steps per epoch doubles, but do we still need to set "--update-freq 2" whereas each GPU still takes up to 5120 tokens per time step and I think the GPUs may not need to divide their batch?

jbaczek commented 5 years ago

If you use 4 GPUs global batch size is 4x5120, which means that is half the size of the original one. --update-freq 2 virtualy doubles it

NVIDIA / DeepLearningExamples

Source and reference streams have different lengths! #87

Here is the log:

Here is what I run: