Closed Liao-YiHsiu closed 5 years ago
Hi @Liao-YiHsiu
This is just a model trained on the bitext, correct? There is no back translation yet?
Can you please attach the training log? Also, if possible can you please evaluate with the SacreBLEU, I'll try to get these numbers on our side to avoid any potential discrepancies in tokenization.
Hi @edunov,
Thank you for your reply. This is just training with bitext, no back translation yet.
Here's my training log, Namespace(activation_function='relu', adam_betas='(0.9, 0.98)', adam_eps=1e-08, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, arch='transformer_vaswani_wmt_en_de_big', attention_dropout=0.0, avg_attention=False, avg_attention_hidden_dim_factor=4, avg_attention_no_gate=False, bottleneck=False, bottleneck_dim=256, bucket_cap_mb=150, clip_norm=0.0, criterion='label_smoothed_cross_entropy', data=['/root/data'], ddp_backend='no_c10d', decoder_attention_heads=16, decoder_embed_dim=1024, decoder_embed_path=None, decoder_ffn_embed_dim=4096, decoder_input_dim=1024, decoder_layers=6, decoder_learned_pos=False, decoder_normalize_before=False, decoder_output_dim=1024, device_id=0, distributed_backend='nccl', distributed_init_method='tcp://XXX.XXX.XXX.XXX:XXXX', distributed_port=-1, distributed_rank=0, distributed_world_size=64, dropout=0.3, encoder_attention_heads=16, encoder_embed_dim=1024, encoder_embed_path=None, encoder_ffn_embed_dim=4096, encoder_layers=6, encoder_learned_pos=False, encoder_normalize_before=False, fp16=True, fp16_init_scale=128, keep_interval_updates=-1, label_smoothing=0.1, left_pad_source='True', left_pad_target='False', log_format=None, log_interval=1000, lr=[0.001], lr_scheduler='inverse_sqrt', lr_shrink=0.1, max_epoch=100, max_sentences=None, max_sentences_valid=None, max_source_positions=1024, max_target_positions=1024, max_tokens=5120, max_update=30000, min_loss_scale=0.0001, min_lr=1e-09, momentum=0.99, no_epoch_checkpoints=False, no_progress_bar=True, no_save=False, no_token_positional_embeddings=False, optimizer='adam', optimizer_overrides='{}', raw_text=False, relu_dropout=0.0, reset_lr_scheduler=False, reset_optimizer=False, restore_file='checkpoint_last.pt', save_dir='/root/model', save_interval=1, save_interval_updates=0, seed=1234567, sentence_avg=False, share_all_embeddings=True, share_decoder_input_output_embed=False, share_encoder_decoder_input_embeddings=False, simp_avg_attention=4, skip_invalid_size_inputs_valid_test=False, source_lang='src', target_lang='tgt', task='translation', train_subset='train', update_freq=[2], upsample_primary=1, valid_subset='valid', validate_interval=1, warmup_init_lr=1e-07, warmup_updates=4000, weight_decay=0.0, weight_range=0.0) | [src] dictionary: 35688 types | [tgt] dictionary: 35688 types | /root/data train 5224656 examples | /root/data valid 3003 examples | model transformer_vaswani_wmt_en_de_big, criterion LabelSmoothedCrossEntropyCriterion | num. model params: 212901888 | training on 64 GPUs | max tokens per GPU = 5120 and max sentences per GPU = None | WARNING: overflow detected, setting loss scale to: 64.0 | WARNING: overflow detected, setting loss scale to: 32.0 | WARNING: overflow detected, setting loss scale to: 16.0 | WARNING: overflow detected, setting loss scale to: 8.0 | WARNING: overflow detected, setting loss scale to: 4.0 | WARNING: overflow detected, setting loss scale to: 2.0 | WARNING: overflow detected, setting loss scale to: 1.0 | WARNING: overflow detected, setting loss scale to: 0.5 | epoch 001 | loss 12.221 | nll_loss 11.757 | ppl 3462.00 | wps 389412 | ups 0.6 | wpb 606038 | bsz 20503 | num_updates 247 | lr 6.18438e-05 | gnorm 1.589 | clip 0% | oom 0 | loss_scale 0.500 | wall 384 | train_wall 348 | epoch 001 | valid on 'valid' subset | valid_loss 11.2103 | valid_nll_loss 10.5178 | valid_ppl 1466.10 | num_updates 247 | epoch 002 | loss 9.835 | nll_loss 8.972 | ppl 502.17 | wps 437726 | ups 0.7 | wpb 606100 | bsz 20489 | num_updates 502 | lr 0.000125587 | gnorm 1.356 | clip 0% | oom 0 | loss_scale 1.000 | wall 744 | train_wall 695 | epoch 002 | valid on 'valid' subset | valid_loss 9.58073 | valid_nll_loss 8.60952 | valid_ppl 390.59 | num_updates 502 | best 9.58073 | WARNING: overflow detected, setting loss scale to: 1.0 | epoch 003 | loss 8.411 | nll_loss 7.310 | ppl 158.71 | wps 435886 | ups 0.7 | wpb 606096 | bsz 20492 | num_updates 756 | lr 0.000189081 | gnorm 1.220 | clip 0% | oom 0 | loss_scale 1.000 | wall 1106 | train_wall 1042 | epoch 003 | valid on 'valid' subset | valid_loss 8.29867 | valid_nll_loss 7.0577 | valid_ppl 133.22 | num_updates 756 | best 8.29867 | WARNING: overflow detected, setting loss scale to: 1.0 | epoch 004 | loss 7.151 | nll_loss 5.851 | ppl 57.73 | wps 434875 | ups 0.7 | wpb 606089 | bsz 20476 | num_updates 1010 | lr 0.000252575 | gnorm 1.026 | clip 0% | oom 0 | loss_scale 1.000 | wall 1469 | train_wall 1390 | epoch 004 | valid on 'valid' subset | valid_loss 7.15717 | valid_nll_loss 5.69159 | valid_ppl 51.68 | num_updates 1010 | best 7.15717 | epoch 005 | loss 6.096 | nll_loss 4.638 | ppl 24.89 | wps 434448 | ups 0.7 | wpb 606100 | bsz 20489 | num_updates 1265 | lr 0.000316318 | gnorm 0.759 | clip 0% | oom 0 | loss_scale 2.000 | wall 1833 | train_wall 1740 | epoch 005 | valid on 'valid' subset | valid_loss 6.25841 | valid_nll_loss 4.59486 | valid_ppl 24.17 | num_updates 1265 | best 6.25841 | WARNING: overflow detected, setting loss scale to: 1.0 | epoch 006 | loss 5.420 | nll_loss 3.875 | ppl 14.67 | wps 433310 | ups 0.7 | wpb 606076 | bsz 20487 | num_updates 1519 | lr 0.000379812 | gnorm 0.545 | clip 0% | oom 0 | loss_scale 1.000 | wall 2197 | train_wall 2089 | epoch 006 | valid on 'valid' subset | valid_loss 5.73377 | valid_nll_loss 4.02555 | valid_ppl 16.29 | num_updates 1519 | best 5.73377 | WARNING: overflow detected, setting loss scale to: 1.0 | epoch 007 | loss 5.015 | nll_loss 3.429 | ppl 10.77 | wps 431062 | ups 0.7 | wpb 606106 | bsz 20493 | num_updates 1773 | lr 0.000443306 | gnorm 0.425 | clip 0% | oom 0 | loss_scale 1.000 | wall 2563 | train_wall 2440 | epoch 007 | valid on 'valid' subset | valid_loss 5.34227 | valid_nll_loss 3.60416 | valid_ppl 12.16 | num_updates 1773 | best 5.34227 | epoch 008 | loss 4.752 | nll_loss 3.142 | ppl 8.83 | wps 433636 | ups 0.7 | wpb 606100 | bsz 20489 | num_updates 2028 | lr 0.000507049 | gnorm 0.367 | clip 0% | oom 0 | loss_scale 2.000 | wall 2933 | train_wall 2791 | epoch 008 | valid on 'valid' subset | valid_loss 5.12701 | valid_nll_loss 3.37626 | valid_ppl 10.38 | num_updates 2028 | best 5.12701 | WARNING: overflow detected, setting loss scale to: 1.0 | epoch 009 | loss 4.576 | nll_loss 2.952 | ppl 7.74 | wps 419789 | ups 0.7 | wpb 606104 | bsz 20484 | num_updates 2282 | lr 0.000570543 | gnorm 0.328 | clip 0% | oom 0 | loss_scale 1.000 | wall 3308 | train_wall 3151 | epoch 009 | valid on 'valid' subset | valid_loss 4.90342 | valid_nll_loss 3.17232 | valid_ppl 9.01 | num_updates 2282 | best 4.90342 | WARNING: overflow detected, setting loss scale to: 1.0 | epoch 010 | loss 4.440 | nll_loss 2.805 | ppl 6.99 | wps 402816 | ups 0.6 | wpb 606112 | bsz 20493 | num_updates 2536 | lr 0.000634037 | gnorm 0.313 | clip 0% | oom 0 | loss_scale 1.000 | wall 3699 | train_wall 3528 | epoch 010 | valid on 'valid' subset | valid_loss 4.82417 | valid_nll_loss 3.07594 | valid_ppl 8.43 | num_updates 2536 | best 4.82417 | epoch 011 | loss 4.329 | nll_loss 2.686 | ppl 6.43 | wps 406190 | ups 0.7 | wpb 606100 | bsz 20489 | num_updates 2791 | lr 0.00069778 | gnorm 0.292 | clip 0% | oom 0 | loss_scale 2.000 | wall 4088 | train_wall 3902 | epoch 011 | valid on 'valid' subset | valid_loss 4.70826 | valid_nll_loss 2.96732 | valid_ppl 7.82 | num_updates 2791 | best 4.70826 | WARNING: overflow detected, setting loss scale to: 1.0 | epoch 012 | loss 4.251 | nll_loss 2.602 | ppl 6.07 | wps 405320 | ups 0.6 | wpb 606101 | bsz 20494 | num_updates 3045 | lr 0.000761274 | gnorm 0.303 | clip 0% | oom 0 | loss_scale 1.000 | wall 4487 | train_wall 4276 | epoch 012 | valid on 'valid' subset | valid_loss 4.60751 | valid_nll_loss 2.87176 | valid_ppl 7.32 | num_updates 3045 | best 4.60751 | WARNING: overflow detected, setting loss scale to: 1.0 | epoch 013 | loss 4.184 | nll_loss 2.530 | ppl 5.78 | wps 405663 | ups 0.7 | wpb 606122 | bsz 20494 | num_updates 3299 | lr 0.000824768 | gnorm 0.298 | clip 0% | oom 0 | loss_scale 1.000 | wall 4875 | train_wall 4649 | epoch 013 | valid on 'valid' subset | valid_loss 4.54552 | valid_nll_loss 2.81338 | valid_ppl 7.03 | num_updates 3299 | best 4.54552 | WARNING: overflow detected, setting loss scale to: 1.0 | epoch 014 | loss 4.131 | nll_loss 2.473 | ppl 5.55 | wps 405048 | ups 0.7 | wpb 606119 | bsz 20493 | num_updates 3553 | lr 0.000888261 | gnorm 0.305 | clip 0% | oom 0 | loss_scale 1.000 | wall 5266 | train_wall 5023 | epoch 014 | valid on 'valid' subset | valid_loss 4.51079 | valid_nll_loss 2.78499 | valid_ppl 6.89 | num_updates 3553 | best 4.51079 | WARNING: overflow detected, setting loss scale to: 1.0 | epoch 015 | loss 4.094 | nll_loss 2.432 | ppl 5.40 | wps 404413 | ups 0.7 | wpb 606105 | bsz 20481 | num_updates 3807 | lr 0.000951755 | gnorm 0.312 | clip 0% | oom 0 | loss_scale 1.000 | wall 5655 | train_wall 5398 | epoch 015 | valid on 'valid' subset | valid_loss 4.46537 | valid_nll_loss 2.72507 | valid_ppl 6.61 | num_updates 3807 | best 4.46537 | WARNING: overflow detected, setting loss scale to: 1.0 | epoch 016 | loss 4.060 | nll_loss 2.395 | ppl 5.26 | wps 404019 | ups 0.7 | wpb 606098 | bsz 20489 | num_updates 4061 | lr 0.000992461 | gnorm 0.323 | clip 0% | oom 0 | loss_scale 1.000 | wall 6045 | train_wall 5773 | epoch 016 | valid on 'valid' subset | valid_loss 4.44079 | valid_nll_loss 2.70326 | valid_ppl 6.51 | num_updates 4061 | best 4.44079 | epoch 017 | loss 4.024 | nll_loss 2.357 | ppl 5.12 | wps 407840 | ups 0.6 | wpb 606100 | bsz 20489 | num_updates 4316 | lr 0.000962696 | gnorm 0.318 | clip 0% | oom 0 | loss_scale 2.000 | wall 6439 | train_wall 6146 | epoch 017 | valid on 'valid' subset | valid_loss 4.38388 | valid_nll_loss 2.6586 | valid_ppl 6.31 | num_updates 4316 | best 4.38388 | WARNING: overflow detected, setting loss scale to: 1.0 | epoch 018 | loss 3.988 | nll_loss 2.318 | ppl 4.99 | wps 406171 | ups 0.6 | wpb 606106 | bsz 20491 | num_updates 4570 | lr 0.000935561 | gnorm 0.312 | clip 0% | oom 0 | loss_scale 1.000 | wall 6835 | train_wall 6519 | epoch 018 | valid on 'valid' subset | valid_loss 4.35784 | valid_nll_loss 2.62567 | valid_ppl 6.17 | num_updates 4570 | best 4.35784 | WARNING: overflow detected, setting loss scale to: 1.0 | epoch 019 | loss 3.957 | nll_loss 2.284 | ppl 4.87 | wps 402550 | ups 0.6 | wpb 606103 | bsz 20488 | num_updates 4824 | lr 0.000910597 | gnorm 0.307 | clip 0% | oom 0 | loss_scale 1.000 | wall 7234 | train_wall 6895 | epoch 019 | valid on 'valid' subset | valid_loss 4.32972 | valid_nll_loss 2.59739 | valid_ppl 6.05 | num_updates 4824 | best 4.32972 | WARNING: overflow detected, setting loss scale to: 1.0 | epoch 020 | loss 3.929 | nll_loss 2.254 | ppl 4.77 | wps 405057 | ups 0.7 | wpb 606123 | bsz 20490 | num_updates 5078 | lr 0.000887531 | gnorm 0.311 | clip 0% | oom 0 | loss_scale 1.000 | wall 7622 | train_wall 7270 | epoch 020 | valid on 'valid' subset | valid_loss 4.31297 | valid_nll_loss 2.58742 | valid_ppl 6.01 | num_updates 5078 | best 4.31297 | WARNING: overflow detected, setting loss scale to: 1.0 | epoch 021 | loss 3.903 | nll_loss 2.227 | ppl 4.68 | wps 403944 | ups 0.7 | wpb 606112 | bsz 20488 | num_updates 5332 | lr 0.000866134 | gnorm 0.306 | clip 0% | oom 0 | loss_scale 1.000 | wall 8012 | train_wall 7645 | epoch 021 | valid on 'valid' subset | valid_loss 4.29806 | valid_nll_loss 2.58026 | valid_ppl 5.98 | num_updates 5332 | best 4.29806 | WARNING: overflow detected, setting loss scale to: 1.0 | epoch 022 | loss 3.882 | nll_loss 2.204 | ppl 4.61 | wps 402818 | ups 0.6 | wpb 606079 | bsz 20493 | num_updates 5586 | lr 0.000846213 | gnorm 0.309 | clip 0% | oom 0 | loss_scale 1.000 | wall 8409 | train_wall 8021 | epoch 022 | valid on 'valid' subset | valid_loss 4.28149 | valid_nll_loss 2.54953 | valid_ppl 5.85 | num_updates 5586 | best 4.28149 | WARNING: overflow detected, setting loss scale to: 1.0 | epoch 023 | loss 3.863 | nll_loss 2.183 | ppl 4.54 | wps 400871 | ups 0.6 | wpb 606113 | bsz 20494 | num_updates 5840 | lr 0.000827606 | gnorm 0.312 | clip 0% | oom 0 | loss_scale 1.000 | wall 8803 | train_wall 8399 | epoch 023 | valid on 'valid' subset | valid_loss 4.25514 | valid_nll_loss 2.5294 | valid_ppl 5.77 | num_updates 5840 | best 4.25514 | WARNING: overflow detected, setting loss scale to: 1.0 | epoch 024 | loss 3.845 | nll_loss 2.163 | ppl 4.48 | wps 367033 | ups 0.6 | wpb 606115 | bsz 20480 | num_updates 6094 | lr 0.000810175 | gnorm 0.310 | clip 0% | oom 0 | loss_scale 1.000 | wall 9231 | train_wall 8812 | epoch 024 | valid on 'valid' subset | valid_loss 4.237 | valid_nll_loss 2.50681 | valid_ppl 5.68 | num_updates 6094 | best 4.237 | WARNING: overflow detected, setting loss scale to: 1.0 | WARNING: overflow detected, setting loss scale to: 0.5 | epoch 025 | loss 3.828 | nll_loss 2.145 | ppl 4.42 | wps 401702 | ups 0.6 | wpb 606061 | bsz 20480 | num_updates 6347 | lr 0.000793863 | gnorm 0.307 | clip 0% | oom 0 | loss_scale 0.500 | wall 9622 | train_wall 9188 | epoch 025 | valid on 'valid' subset | valid_loss 4.22679 | valid_nll_loss 2.50461 | valid_ppl 5.67 | num_updates 6347 | best 4.22679 | epoch 026 | loss 3.813 | nll_loss 2.129 | ppl 4.37 | wps 404188 | ups 0.7 | wpb 606100 | bsz 20489 | num_updates 6602 | lr 0.000778381 | gnorm 0.311 | clip 0% | oom 0 | loss_scale 1.000 | wall 10013 | train_wall 9564 | epoch 026 | valid on 'valid' subset | valid_loss 4.24526 | valid_nll_loss 2.51925 | valid_ppl 5.73 | num_updates 6602 | best 4.22679 | WARNING: overflow detected, setting loss scale to: 1.0 | epoch 027 | loss 3.799 | nll_loss 2.114 | ppl 4.33 | wps 404867 | ups 0.7 | wpb 606083 | bsz 20492 | num_updates 6856 | lr 0.000763826 | gnorm 0.306 | clip 0% | oom 0 | loss_scale 1.000 | wall 10399 | train_wall 9938 | epoch 027 | valid on 'valid' subset | valid_loss 4.21843 | valid_nll_loss 2.4851 | valid_ppl 5.60 | num_updates 6856 | best 4.21843 | WARNING: overflow detected, setting loss scale to: 0.5 | epoch 028 | loss 3.786 | nll_loss 2.100 | ppl 4.29 | wps 402139 | ups 0.6 | wpb 606101 | bsz 20488 | num_updates 7110 | lr 0.000750059 | gnorm 0.305 | clip 0% | oom 0 | loss_scale 0.500 | wall 10790 | train_wall 10315 | epoch 028 | valid on 'valid' subset | valid_loss 4.20653 | valid_nll_loss 2.49173 | valid_ppl 5.62 | num_updates 7110 | best 4.20653 | WARNING: overflow detected, setting loss scale to: 0.5 | epoch 029 | loss 3.774 | nll_loss 2.086 | ppl 4.25 | wps 403379 | ups 0.7 | wpb 606111 | bsz 20485 | num_updates 7364 | lr 0.00073701 | gnorm 0.302 | clip 0% | oom 0 | loss_scale 0.500 | wall 11181 | train_wall 10691 | epoch 029 | valid on 'valid' subset | valid_loss 4.19726 | valid_nll_loss 2.47443 | valid_ppl 5.56 | num_updates 7364 | best 4.19726 | epoch 030 | loss 3.763 | nll_loss 2.074 | ppl 4.21 | wps 405790 | ups 0.7 | wpb 606100 | bsz 20489 | num_updates 7619 | lr 0.000724571 | gnorm 0.308 | clip 0% | oom 0 | loss_scale 1.000 | wall 11570 | train_wall 11065 | epoch 030 | valid on 'valid' subset | valid_loss 4.18927 | valid_nll_loss 2.46933 | valid_ppl 5.54 | num_updates 7619 | best 4.18927 | WARNING: overflow detected, setting loss scale to: 0.5 | epoch 031 | loss 3.753 | nll_loss 2.063 | ppl 4.18 | wps 402874 | ups 0.6 | wpb 606093 | bsz 20486 | num_updates 7873 | lr 0.000712787 | gnorm 0.299 | clip 0% | oom 0 | loss_scale 0.500 | wall 11964 | train_wall 11441 | epoch 031 | valid on 'valid' subset | valid_loss 4.19064 | valid_nll_loss 2.46808 | valid_ppl 5.53 | num_updates 7873 | best 4.18927 | epoch 032 | loss 3.743 | nll_loss 2.052 | ppl 4.15 | wps 406652 | ups 0.7 | wpb 606100 | bsz 20489 | num_updates 8128 | lr 0.000701517 | gnorm 0.306 | clip 0% | oom 0 | loss_scale 1.000 | wall 12350 | train_wall 11816 | epoch 032 | valid on 'valid' subset | valid_loss 4.20155 | valid_nll_loss 2.46975 | valid_ppl 5.54 | num_updates 8128 | best 4.18927 | WARNING: overflow detected, setting loss scale to: 1.0 | WARNING: overflow detected, setting loss scale to: 0.5 | epoch 033 | loss 3.733 | nll_loss 2.041 | ppl 4.12 | wps 404132 | ups 0.7 | wpb 606052 | bsz 20489 | num_updates 8381 | lr 0.000690847 | gnorm 0.301 | clip 0% | oom 0 | loss_scale 0.500 | wall 12735 | train_wall 12189 | epoch 033 | valid on 'valid' subset | valid_loss 4.16803 | valid_nll_loss 2.45352 | valid_ppl 5.48 | num_updates 8381 | best 4.16803 | WARNING: overflow detected, setting loss scale to: 0.5 | epoch 034 | loss 3.724 | nll_loss 2.032 | ppl 4.09 | wps 403531 | ups 0.7 | wpb 606089 | bsz 20489 | num_updates 8635 | lr 0.000680611 | gnorm 0.298 | clip 0% | oom 0 | loss_scale 0.500 | wall 13125 | train_wall 12564 | epoch 034 | valid on 'valid' subset | valid_loss 4.17312 | valid_nll_loss 2.44827 | valid_ppl 5.46 | num_updates 8635 | best 4.16803 | epoch 035 | loss 3.716 | nll_loss 2.023 | ppl 4.06 | wps 407235 | ups 0.7 | wpb 606100 | bsz 20489 | num_updates 8890 | lr 0.000670778 | gnorm 0.300 | clip 0% | oom 0 | loss_scale 1.000 | wall 13510 | train_wall 12938 | epoch 035 | valid on 'valid' subset | valid_loss 4.1455 | valid_nll_loss 2.43207 | valid_ppl 5.40 | num_updates 8890 | best 4.1455 | WARNING: overflow detected, setting loss scale to: 0.5 | epoch 036 | loss 3.708 | nll_loss 2.014 | ppl 4.04 | wps 403051 | ups 0.7 | wpb 606088 | bsz 20489 | num_updates 9144 | lr 0.000661396 | gnorm 0.297 | clip 0% | oom 0 | loss_scale 0.500 | wall 13901 | train_wall 13314 | epoch 036 | valid on 'valid' subset | valid_loss 4.1671 | valid_nll_loss 2.44066 | valid_ppl 5.43 | num_updates 9144 | best 4.1455 | epoch 037 | loss 3.700 | nll_loss 2.006 | ppl 4.02 | wps 404907 | ups 0.7 | wpb 606100 | bsz 20489 | num_updates 9399 | lr 0.000652363 | gnorm 0.291 | clip 0% | oom 0 | loss_scale 1.000 | wall 14288 | train_wall 13689 | epoch 037 | valid on 'valid' subset | valid_loss 4.15302 | valid_nll_loss 2.4325 | valid_ppl 5.40 | num_updates 9399 | best 4.1455 | WARNING: overflow detected, setting loss scale to: 0.5 | epoch 038 | loss 3.694 | nll_loss 1.998 | ppl 3.99 | wps 404799 | ups 0.7 | wpb 606086 | bsz 20487 | num_updates 9653 | lr 0.000643723 | gnorm 0.300 | clip 0% | oom 0 | loss_scale 0.500 | wall 14674 | train_wall 14064 | epoch 038 | valid on 'valid' subset | valid_loss 4.15593 | valid_nll_loss 2.43347 | valid_ppl 5.40 | num_updates 9653 | best 4.1455 | epoch 039 | loss 3.686 | nll_loss 1.990 | ppl 3.97 | wps 405052 | ups 0.7 | wpb 606100 | bsz 20489 | num_updates 9908 | lr 0.000635385 | gnorm 0.294 | clip 0% | oom 0 | loss_scale 1.000 | wall 15061 | train_wall 14440 | epoch 039 | valid on 'valid' subset | valid_loss 4.15409 | valid_nll_loss 2.43318 | valid_ppl 5.40 | num_updates 9908 | best 4.1455 | WARNING: overflow detected, setting loss scale to: 0.5 | epoch 040 | loss 3.680 | nll_loss 1.983 | ppl 3.95 | wps 402112 | ups 0.7 | wpb 606092 | bsz 20488 | num_updates 10162 | lr 0.000627394 | gnorm 0.294 | clip 0% | oom 0 | loss_scale 0.500 | wall 15449 | train_wall 14816 | epoch 040 | valid on 'valid' subset | valid_loss 4.14832 | valid_nll_loss 2.43019 | valid_ppl 5.39 | num_updates 10162 | best 4.1455 | epoch 041 | loss 3.673 | nll_loss 1.976 | ppl 3.93 | wps 406289 | ups 0.7 | wpb 606100 | bsz 20489 | num_updates 10417 | lr 0.000619667 | gnorm 0.289 | clip 0% | oom 0 | loss_scale 1.000 | wall 15835 | train_wall 15191 | epoch 041 | valid on 'valid' subset | valid_loss 4.14953 | valid_nll_loss 2.42941 | valid_ppl 5.39 | num_updates 10417 | best 4.1455 | WARNING: overflow detected, setting loss scale to: 0.5 | epoch 042 | loss 3.667 | nll_loss 1.969 | ppl 3.92 | wps 397438 | ups 0.7 | wpb 606108 | bsz 20487 | num_updates 10671 | lr 0.000612248 | gnorm 0.293 | clip 0% | oom 0 | loss_scale 0.500 | wall 16223 | train_wall 15567 | epoch 042 | valid on 'valid' subset | valid_loss 4.15328 | valid_nll_loss 2.43172 | valid_ppl 5.40 | num_updates 10671 | best 4.1455 | WARNING: overflow detected, setting loss scale to: 0.5 | epoch 043 | loss 3.661 | nll_loss 1.963 | ppl 3.90 | wps 405294 | ups 0.7 | wpb 606108 | bsz 20486 | num_updates 10925 | lr 0.000605089 | gnorm 0.289 | clip 0% | oom 0 | loss_scale 0.500 | wall 16608 | train_wall 15941 | epoch 043 | valid on 'valid' subset | valid_loss 4.13367 | valid_nll_loss 2.41361 | valid_ppl 5.33 | num_updates 10925 | best 4.13367 | epoch 044 | loss 3.656 | nll_loss 1.957 | ppl 3.88 | wps 406794 | ups 0.7 | wpb 606100 | bsz 20489 | num_updates 11180 | lr 0.000598149 | gnorm 0.287 | clip 0% | oom 0 | loss_scale 1.000 | wall 16997 | train_wall 16315 | epoch 044 | valid on 'valid' subset | valid_loss 4.12971 | valid_nll_loss 2.40841 | valid_ppl 5.31 | num_updates 11180 | best 4.12971 | WARNING: overflow detected, setting loss scale to: 0.5 | epoch 045 | loss 3.650 | nll_loss 1.951 | ppl 3.87 | wps 402617 | ups 0.6 | wpb 606088 | bsz 20487 | num_updates 11434 | lr 0.000591468 | gnorm 0.282 | clip 0% | oom 0 | loss_scale 0.500 | wall 17388 | train_wall 16691 | epoch 045 | valid on 'valid' subset | valid_loss 4.1229 | valid_nll_loss 2.407 | valid_ppl 5.30 | num_updates 11434 | best 4.1229 | epoch 046 | loss 3.645 | nll_loss 1.945 | ppl 3.85 | wps 405582 | ups 0.6 | wpb 606100 | bsz 20489 | num_updates 11689 | lr 0.00058498 | gnorm 0.287 | clip 0% | oom 0 | loss_scale 1.000 | wall 17784 | train_wall 17066 | epoch 046 | valid on 'valid' subset | valid_loss 4.13637 | valid_nll_loss 2.41384 | valid_ppl 5.33 | num_updates 11689 | best 4.1229 | WARNING: overflow detected, setting loss scale to: 0.5 | epoch 047 | loss 3.640 | nll_loss 1.939 | ppl 3.84 | wps 402807 | ups 0.7 | wpb 606086 | bsz 20487 | num_updates 11943 | lr 0.000578726 | gnorm 0.282 | clip 0% | oom 0 | loss_scale 0.500 | wall 18171 | train_wall 17442 | epoch 047 | valid on 'valid' subset | valid_loss 4.1332 | valid_nll_loss 2.41974 | valid_ppl 5.35 | num_updates 11943 | best 4.1229 | WARNING: overflow detected, setting loss scale to: 0.5 | epoch 048 | loss 3.636 | nll_loss 1.934 | ppl 3.82 | wps 404634 | ups 0.7 | wpb 606079 | bsz 20490 | num_updates 12197 | lr 0.000572669 | gnorm 0.283 | clip 0% | oom 0 | loss_scale 0.500 | wall 18557 | train_wall 17817 | epoch 048 | valid on 'valid' subset | valid_loss 4.13304 | valid_nll_loss 2.41033 | valid_ppl 5.32 | num_updates 12197 | best 4.1229 | epoch 049 | loss 3.631 | nll_loss 1.929 | ppl 3.81 | wps 405587 | ups 0.7 | wpb 606100 | bsz 20489 | num_updates 12452 | lr 0.000566775 | gnorm 0.277 | clip 0% | oom 0 | loss_scale 1.000 | wall 18944 | train_wall 18192 | epoch 049 | valid on 'valid' subset | valid_loss 4.13519 | valid_nll_loss 2.41345 | valid_ppl 5.33 | num_updates 12452 | best 4.1229 | WARNING: overflow detected, setting loss scale to: 1.0 | epoch 050 | loss 3.626 | nll_loss 1.923 | ppl 3.79 | wps 402649 | ups 0.7 | wpb 606132 | bsz 20490 | num_updates 12706 | lr 0.000561081 | gnorm 0.279 | clip 0% | oom 0 | loss_scale 1.000 | wall 19332 | train_wall 18568 | epoch 050 | valid on 'valid' subset | valid_loss 4.12566 | valid_nll_loss 2.40819 | valid_ppl 5.31 | num_updates 12706 | best 4.1229 | WARNING: overflow detected, setting loss scale to: 0.5 | epoch 051 | loss 3.622 | nll_loss 1.919 | ppl 3.78 | wps 397237 | ups 0.7 | wpb 606082 | bsz 20487 | num_updates 12960 | lr 0.000555556 | gnorm 0.279 | clip 0% | oom 0 | loss_scale 0.500 | wall 19720 | train_wall 18944 | epoch 051 | valid on 'valid' subset | valid_loss 4.12552 | valid_nll_loss 2.40853 | valid_ppl 5.31 | num_updates 12960 | best 4.1229 | epoch 052 | loss 3.617 | nll_loss 1.914 | ppl 3.77 | wps 403744 | ups 0.7 | wpb 606100 | bsz 20489 | num_updates 13215 | lr 0.000550169 | gnorm 0.270 | clip 0% | oom 0 | loss_scale 1.000 | wall 20108 | train_wall 19321 | epoch 052 | valid on 'valid' subset | valid_loss 4.11779 | valid_nll_loss 2.4055 | valid_ppl 5.30 | num_updates 13215 | best 4.11779 | WARNING: overflow detected, setting loss scale to: 1.0 | epoch 053 | loss 3.614 | nll_loss 1.910 | ppl 3.76 | wps 403961 | ups 0.7 | wpb 606100 | bsz 20482 | num_updates 13469 | lr 0.000544957 | gnorm 0.273 | clip 0% | oom 0 | loss_scale 1.000 | wall 20498 | train_wall 19696 | epoch 053 | valid on 'valid' subset | valid_loss 4.12068 | valid_nll_loss 2.40177 | valid_ppl 5.28 | num_updates 13469 | best 4.11779 | WARNING: overflow detected, setting loss scale to: 0.5 | epoch 054 | loss 3.610 | nll_loss 1.905 | ppl 3.75 | wps 396807 | ups 0.7 | wpb 606088 | bsz 20481 | num_updates 13723 | lr 0.00053989 | gnorm 0.272 | clip 0% | oom 0 | loss_scale 0.500 | wall 20886 | train_wall 20072 | epoch 054 | valid on 'valid' subset | valid_loss 4.11751 | valid_nll_loss 2.40379 | valid_ppl 5.29 | num_updates 13723 | best 4.11751 | WARNING: overflow detected, setting loss scale to: 0.5 | epoch 055 | loss 3.606 | nll_loss 1.901 | ppl 3.73 | wps 404431 | ups 0.6 | wpb 606097 | bsz 20483 | num_updates 13977 | lr 0.000534962 | gnorm 0.277 | clip 0% | oom 0 | loss_scale 0.500 | wall 21281 | train_wall 20447 | epoch 055 | valid on 'valid' subset | valid_loss 4.1238 | valid_nll_loss 2.40867 | valid_ppl 5.31 | num_updates 13977 | best 4.11751 | epoch 056 | loss 3.602 | nll_loss 1.897 | ppl 3.72 | wps 407260 | ups 0.7 | wpb 606100 | bsz 20489 | num_updates 14232 | lr 0.000530148 | gnorm 0.268 | clip 0% | oom 0 | loss_scale 1.000 | wall 21666 | train_wall 20821 | epoch 056 | valid on 'valid' subset | valid_loss 4.11862 | valid_nll_loss 2.40323 | valid_ppl 5.29 | num_updates 14232 | best 4.11751 | WARNING: overflow detected, setting loss scale to: 1.0 | epoch 057 | loss 3.598 | nll_loss 1.892 | ppl 3.71 | wps 405549 | ups 0.7 | wpb 606108 | bsz 20489 | num_updates 14486 | lr 0.000525479 | gnorm 0.266 | clip 0% | oom 0 | loss_scale 1.000 | wall 22051 | train_wall 21194 | epoch 057 | valid on 'valid' subset | valid_loss 4.12106 | valid_nll_loss 2.39969 | valid_ppl 5.28 | num_updates 14486 | best 4.11751 | WARNING: overflow detected, setting loss scale to: 1.0 | epoch 058 | loss 3.594 | nll_loss 1.889 | ppl 3.70 | wps 407947 | ups 0.7 | wpb 606097 | bsz 20490 | num_updates 14740 | lr 0.000520932 | gnorm 0.265 | clip 0% | oom 0 | loss_scale 1.000 | wall 22434 | train_wall 21566 | epoch 058 | valid on 'valid' subset | valid_loss 4.11306 | valid_nll_loss 2.39861 | valid_ppl 5.27 | num_updates 14740 | best 4.11306 | WARNING: overflow detected, setting loss scale to: 0.5 | epoch 059 | loss 3.591 | nll_loss 1.885 | ppl 3.69 | wps 395661 | ups 0.7 | wpb 606098 | bsz 20484 | num_updates 14994 | lr 0.000516501 | gnorm 0.264 | clip 0% | oom 0 | loss_scale 0.500 | wall 22823 | train_wall 21934 | epoch 059 | valid on 'valid' subset | valid_loss 4.11986 | valid_nll_loss 2.40823 | valid_ppl 5.31 | num_updates 14994 | best 4.11306 | WARNING: overflow detected, setting loss scale to: 0.5 | epoch 060 | loss 3.588 | nll_loss 1.881 | ppl 3.68 | wps 411638 | ups 0.7 | wpb 606076 | bsz 20487 | num_updates 15248 | lr 0.000512181 | gnorm 0.266 | clip 0% | oom 0 | loss_scale 0.500 | wall 23203 | train_wall 22301 | epoch 060 | valid on 'valid' subset | valid_loss 4.11128 | valid_nll_loss 2.40526 | valid_ppl 5.30 | num_updates 15248 | best 4.11128 | WARNING: overflow detected, setting loss scale to: 0.5 | epoch 061 | loss 3.584 | nll_loss 1.877 | ppl 3.67 | wps 410493 | ups 0.7 | wpb 606088 | bsz 20483 | num_updates 15502 | lr 0.000507968 | gnorm 0.260 | clip 0% | oom 0 | loss_scale 0.500 | wall 23586 | train_wall 22670 | epoch 061 | valid on 'valid' subset | valid_loss 4.12279 | valid_nll_loss 2.40674 | valid_ppl 5.30 | num_updates 15502 | best 4.11128 | epoch 062 | loss 3.581 | nll_loss 1.874 | ppl 3.67 | wps 412961 | ups 0.7 | wpb 606100 | bsz 20489 | num_updates 15757 | lr 0.000503841 | gnorm 0.256 | clip 0% | oom 0 | loss_scale 1.000 | wall 23966 | train_wall 23039 | epoch 062 | valid on 'valid' subset | valid_loss 4.12896 | valid_nll_loss 2.41013 | valid_ppl 5.32 | num_updates 15757 | best 4.11128 | WARNING: overflow detected, setting loss scale to: 0.5 | epoch 063 | loss 3.577 | nll_loss 1.870 | ppl 3.65 | wps 413195 | ups 0.7 | wpb 606088 | bsz 20490 | num_updates 16011 | lr 0.000499828 | gnorm 0.262 | clip 0% | oom 0 | loss_scale 0.500 | wall 24344 | train_wall 23405 | epoch 063 | valid on 'valid' subset | valid_loss 4.11309 | valid_nll_loss 2.39933 | valid_ppl 5.28 | num_updates 16011 | best 4.11128 | epoch 064 | loss 3.575 | nll_loss 1.867 | ppl 3.65 | wps 414002 | ups 0.7 | wpb 606100 | bsz 20489 | num_updates 16266 | lr 0.000495895 | gnorm 0.252 | clip 0% | oom 0 | loss_scale 1.000 | wall 24723 | train_wall 23773 | epoch 064 | valid on 'valid' subset | valid_loss 4.12299 | valid_nll_loss 2.40721 | valid_ppl 5.30 | num_updates 16266 | best 4.11128 | WARNING: overflow detected, setting loss scale to: 1.0 | epoch 065 | loss 3.572 | nll_loss 1.864 | ppl 3.64 | wps 412019 | ups 0.7 | wpb 606096 | bsz 20493 | num_updates 16520 | lr 0.000492068 | gnorm 0.260 | clip 0% | oom 0 | loss_scale 1.000 | wall 25102 | train_wall 24140 | epoch 065 | valid on 'valid' subset | valid_loss 4.13079 | valid_nll_loss 2.41223 | valid_ppl 5.32 | num_updates 16520 | best 4.11128 | WARNING: overflow detected, setting loss scale to: 1.0 | epoch 066 | loss 3.569 | nll_loss 1.860 | ppl 3.63 | wps 413142 | ups 0.7 | wpb 606099 | bsz 20498 | num_updates 16774 | lr 0.000488328 | gnorm 0.251 | clip 0% | oom 0 | loss_scale 1.000 | wall 25480 | train_wall 24507 | epoch 066 | valid on 'valid' subset | valid_loss 4.11249 | valid_nll_loss 2.39975 | valid_ppl 5.28 | num_updates 16774 | best 4.11128 | WARNING: overflow detected, setting loss scale to: 1.0 | WARNING: overflow detected, setting loss scale to: 0.5 | epoch 067 | loss 3.565 | nll_loss 1.856 | ppl 3.62 | wps 409957 | ups 0.7 | wpb 606077 | bsz 20489 | num_updates 17027 | lr 0.000484687 | gnorm 0.249 | clip 0% | oom 0 | loss_scale 0.500 | wall 25859 | train_wall 24875 | epoch 067 | valid on 'valid' subset | valid_loss 4.10913 | valid_nll_loss 2.3947 | valid_ppl 5.26 | num_updates 17027 | best 4.10913 | epoch 068 | loss 3.563 | nll_loss 1.854 | ppl 3.62 | wps 412876 | ups 0.7 | wpb 606100 | bsz 20489 | num_updates 17282 | lr 0.000481097 | gnorm 0.251 | clip 0% | oom 0 | loss_scale 1.000 | wall 26243 | train_wall 25244 | epoch 068 | valid on 'valid' subset | valid_loss 4.11036 | valid_nll_loss 2.39503 | valid_ppl 5.26 | num_updates 17282 | best 4.10913 | WARNING: overflow detected, setting loss scale to: 1.0 | epoch 069 | loss 3.561 | nll_loss 1.851 | ppl 3.61 | wps 414163 | ups 0.7 | wpb 606089 | bsz 20486 | num_updates 17536 | lr 0.0004776 | gnorm 0.248 | clip 0% | oom 0 | loss_scale 1.000 | wall 26620 | train_wall 25610 | epoch 069 | valid on 'valid' subset | valid_loss 4.11465 | valid_nll_loss 2.39934 | valid_ppl 5.28 | num_updates 17536 | best 4.10913 | WARNING: overflow detected, setting loss scale to: 1.0 | epoch 070 | loss 3.558 | nll_loss 1.848 | ppl 3.60 | wps 410446 | ups 0.7 | wpb 606094 | bsz 20478 | num_updates 17790 | lr 0.000474179 | gnorm 0.253 | clip 0% | oom 0 | loss_scale 1.000 | wall 27000 | train_wall 25979 | epoch 070 | valid on 'valid' subset | valid_loss 4.11325 | valid_nll_loss 2.39843 | valid_ppl 5.27 | num_updates 17790 | best 4.10913 | WARNING: overflow detected, setting loss scale to: 1.0 | epoch 071 | loss 3.556 | nll_loss 1.845 | ppl 3.59 | wps 411936 | ups 0.7 | wpb 606073 | bsz 20484 | num_updates 18044 | lr 0.000470829 | gnorm 0.248 | clip 0% | oom 0 | loss_scale 1.000 | wall 27380 | train_wall 26347 | epoch 071 | valid on 'valid' subset | valid_loss 4.1299 | valid_nll_loss 2.41658 | valid_ppl 5.34 | num_updates 18044 | best 4.10913 | WARNING: overflow detected, setting loss scale to: 1.0 | epoch 072 | loss 3.552 | nll_loss 1.842 | ppl 3.58 | wps 413323 | ups 0.7 | wpb 606079 | bsz 20492 | num_updates 18298 | lr 0.00046755 | gnorm 0.241 | clip 0% | oom 0 | loss_scale 1.000 | wall 27758 | train_wall 26713 | epoch 072 | valid on 'valid' subset | valid_loss 4.11753 | valid_nll_loss 2.39991 | valid_ppl 5.28 | num_updates 18298 | best 4.10913 | WARNING: overflow detected, setting loss scale to: 1.0 | epoch 073 | loss 3.550 | nll_loss 1.840 | ppl 3.58 | wps 411341 | ups 0.7 | wpb 606100 | bsz 20489 | num_updates 18552 | lr 0.000464338 | gnorm 0.242 | clip 0% | oom 0 | loss_scale 1.000 | wall 28137 | train_wall 27081 | epoch 073 | valid on 'valid' subset | valid_loss 4.11183 | valid_nll_loss 2.39994 | valid_ppl 5.28 | num_updates 18552 | best 4.10913 | WARNING: overflow detected, setting loss scale to: 1.0 | epoch 074 | loss 3.548 | nll_loss 1.837 | ppl 3.57 | wps 412358 | ups 0.7 | wpb 606120 | bsz 20470 | num_updates 18806 | lr 0.000461192 | gnorm 0.234 | clip 0% | oom 0 | loss_scale 1.000 | wall 28516 | train_wall 27448 | epoch 074 | valid on 'valid' subset | valid_loss 4.11562 | valid_nll_loss 2.40093 | valid_ppl 5.28 | num_updates 18806 | best 4.10913 | WARNING: overflow detected, setting loss scale to: 0.5 | epoch 075 | loss 3.546 | nll_loss 1.834 | ppl 3.57 | wps 415480 | ups 0.7 | wpb 606094 | bsz 20486 | num_updates 19060 | lr 0.000458109 | gnorm 0.242 | clip 0% | oom 0 | loss_scale 0.500 | wall 28897 | train_wall 27812 | epoch 075 | valid on 'valid' subset | valid_loss 4.12204 | valid_nll_loss 2.4051 | valid_ppl 5.30 | num_updates 19060 | best 4.10913 | WARNING: overflow detected, setting loss scale to: 0.5 | epoch 076 | loss 3.543 | nll_loss 1.831 | ppl 3.56 | wps 420688 | ups 0.7 | wpb 606088 | bsz 20491 | num_updates 19314 | lr 0.000455086 | gnorm 0.239 | clip 0% | oom 0 | loss_scale 0.500 | wall 29268 | train_wall 28172 | epoch 076 | valid on 'valid' subset | valid_loss 4.12509 | valid_nll_loss 2.41195 | valid_ppl 5.32 | num_updates 19314 | best 4.10913 | epoch 077 | loss 3.541 | nll_loss 1.829 | ppl 3.55 | wps 419437 | ups 0.7 | wpb 606100 | bsz 20489 | num_updates 19569 | lr 0.000452112 | gnorm 0.239 | clip 0% | oom 0 | loss_scale 1.000 | wall 29642 | train_wall 28535 | epoch 077 | valid on 'valid' subset | valid_loss 4.12446 | valid_nll_loss 2.4112 | valid_ppl 5.32 | num_updates 19569 | best 4.10913 | WARNING: overflow detected, setting loss scale to: 1.0 | epoch 078 | loss 3.538 | nll_loss 1.826 | ppl 3.55 | wps 416912 | ups 0.7 | wpb 606064 | bsz 20476 | num_updates 19823 | lr 0.000449206 | gnorm 0.238 | clip 0% | oom 0 | loss_scale 1.000 | wall 30017 | train_wall 28898 | epoch 078 | valid on 'valid' subset | valid_loss 4.11633 | valid_nll_loss 2.40426 | valid_ppl 5.29 | num_updates 19823 | best 4.10913 | WARNING: overflow detected, setting loss scale to: 1.0 | epoch 079 | loss 3.536 | nll_loss 1.824 | ppl 3.54 | wps 418885 | ups 0.7 | wpb 606086 | bsz 20493 | num_updates 20077 | lr 0.000446355 | gnorm 0.234 | clip 0% | oom 0 | loss_scale 1.000 | wall 30390 | train_wall 29260 | epoch 079 | valid on 'valid' subset | valid_loss 4.12115 | valid_nll_loss 2.4071 | valid_ppl 5.30 | num_updates 20077 | best 4.10913 | WARNING: overflow detected, setting loss scale to: 1.0 | epoch 080 | loss 3.534 | nll_loss 1.821 | ppl 3.53 | wps 417653 | ups 0.7 | wpb 606079 | bsz 20495 | num_updates 20331 | lr 0.000443558 | gnorm 0.234 | clip 0% | oom 0 | loss_scale 1.000 | wall 30764 | train_wall 29622 | epoch 080 | valid on 'valid' subset | valid_loss 4.11404 | valid_nll_loss 2.40123 | valid_ppl 5.28 | num_updates 20331 | best 4.10913 | WARNING: overflow detected, setting loss scale to: 1.0 | epoch 081 | loss 3.532 | nll_loss 1.819 | ppl 3.53 | wps 417246 | ups 0.7 | wpb 606074 | bsz 20484 | num_updates 20585 | lr 0.000440813 | gnorm 0.234 | clip 0% | oom 0 | loss_scale 1.000 | wall 31138 | train_wall 29985 | epoch 081 | valid on 'valid' subset | valid_loss 4.1126 | valid_nll_loss 2.40073 | valid_ppl 5.28 | num_updates 20585 | best 4.10913 | WARNING: overflow detected, setting loss scale to: 0.5 | epoch 082 | loss 3.530 | nll_loss 1.816 | ppl 3.52 | wps 416037 | ups 0.7 | wpb 606106 | bsz 20497 | num_updates 20839 | lr 0.000438118 | gnorm 0.238 | clip 0% | oom 0 | loss_scale 0.500 | wall 31514 | train_wall 30349 | epoch 082 | valid on 'valid' subset | valid_loss 4.13176 | valid_nll_loss 2.41634 | valid_ppl 5.34 | num_updates 20839 | best 4.10913 | epoch 083 | loss 3.528 | nll_loss 1.815 | ppl 3.52 | wps 419487 | ups 0.7 | wpb 606100 | bsz 20489 | num_updates 21094 | lr 0.000435462 | gnorm 0.231 | clip 0% | oom 0 | loss_scale 1.000 | wall 31888 | train_wall 30711 | epoch 083 | valid on 'valid' subset | valid_loss 4.11252 | valid_nll_loss 2.39742 | valid_ppl 5.27 | num_updates 21094 | best 4.10913 | WARNING: overflow detected, setting loss scale to: 1.0 | WARNING: overflow detected, setting loss scale to: 0.5 | epoch 084 | loss 3.525 | nll_loss 1.812 | ppl 3.51 | wps 414776 | ups 0.7 | wpb 606122 | bsz 20501 | num_updates 21347 | lr 0.000432874 | gnorm 0.235 | clip 0% | oom 0 | loss_scale 0.500 | wall 32263 | train_wall 31075 | epoch 084 | valid on 'valid' subset | valid_loss 4.11151 | valid_nll_loss 2.4061 | valid_ppl 5.30 | num_updates 21347 | best 4.10913 | epoch 085 | loss 3.524 | nll_loss 1.810 | ppl 3.51 | wps 417739 | ups 0.7 | wpb 606100 | bsz 20489 | num_updates 21602 | lr 0.000430312 | gnorm 0.230 | clip 0% | oom 0 | loss_scale 1.000 | wall 32638 | train_wall 31439 | epoch 085 | valid on 'valid' subset | valid_loss 4.11544 | valid_nll_loss 2.40064 | valid_ppl 5.28 | num_updates 21602 | best 4.10913 | WARNING: overflow detected, setting loss scale to: 0.5 | epoch 086 | loss 3.522 | nll_loss 1.808 | ppl 3.50 | wps 418419 | ups 0.7 | wpb 606134 | bsz 20490 | num_updates 21856 | lr 0.000427804 | gnorm 0.228 | clip 0% | oom 0 | loss_scale 0.500 | wall 33012 | train_wall 31801 | epoch 086 | valid on 'valid' subset | valid_loss 4.11642 | valid_nll_loss 2.40221 | valid_ppl 5.29 | num_updates 21856 | best 4.10913 | epoch 087 | loss 3.520 | nll_loss 1.806 | ppl 3.50 | wps 416812 | ups 0.7 | wpb 606100 | bsz 20489 | num_updates 22111 | lr 0.00042533 | gnorm 0.231 | clip 0% | oom 0 | loss_scale 1.000 | wall 33388 | train_wall 32165 | epoch 087 | valid on 'valid' subset | valid_loss 4.10402 | valid_nll_loss 2.39342 | valid_ppl 5.25 | num_updates 22111 | best 4.10402 | WARNING: overflow detected, setting loss scale to: 0.5 | epoch 088 | loss 3.518 | nll_loss 1.803 | ppl 3.49 | wps 405982 | ups 0.7 | wpb 606103 | bsz 20492 | num_updates 22365 | lr 0.000422908 | gnorm 0.230 | clip 0% | oom 0 | loss_scale 0.500 | wall 33767 | train_wall 32530 | epoch 088 | valid on 'valid' subset | valid_loss 4.12316 | valid_nll_loss 2.40788 | valid_ppl 5.31 | num_updates 22365 | best 4.10402 | WARNING: overflow detected, setting loss scale to: 0.5 | epoch 089 | loss 3.516 | nll_loss 1.801 | ppl 3.48 | wps 417705 | ups 0.7 | wpb 606079 | bsz 20481 | num_updates 22619 | lr 0.000420526 | gnorm 0.228 | clip 0% | oom 0 | loss_scale 0.500 | wall 34142 | train_wall 32892 | epoch 089 | valid on 'valid' subset | valid_loss 4.12686 | valid_nll_loss 2.41216 | valid_ppl 5.32 | num_updates 22619 | best 4.10402 | epoch 090 | loss 3.514 | nll_loss 1.799 | ppl 3.48 | wps 418590 | ups 0.7 | wpb 606100 | bsz 20489 | num_updates 22874 | lr 0.000418176 | gnorm 0.226 | clip 0% | oom 0 | loss_scale 1.000 | wall 34516 | train_wall 33255 | epoch 090 | valid on 'valid' subset | valid_loss 4.11654 | valid_nll_loss 2.40041 | valid_ppl 5.28 | num_updates 22874 | best 4.10402 | WARNING: overflow detected, setting loss scale to: 1.0 | epoch 091 | loss 3.512 | nll_loss 1.797 | ppl 3.48 | wps 417598 | ups 0.7 | wpb 606097 | bsz 20492 | num_updates 23128 | lr 0.000415873 | gnorm 0.229 | clip 0% | oom 0 | loss_scale 1.000 | wall 34891 | train_wall 33618 | epoch 091 | valid on 'valid' subset | valid_loss 4.11103 | valid_nll_loss 2.40214 | valid_ppl 5.29 | num_updates 23128 | best 4.10402 | WARNING: overflow detected, setting loss scale to: 1.0 | epoch 092 | loss 3.511 | nll_loss 1.795 | ppl 3.47 | wps 417188 | ups 0.7 | wpb 606096 | bsz 20491 | num_updates 23382 | lr 0.000413608 | gnorm 0.223 | clip 0% | oom 0 | loss_scale 1.000 | wall 35265 | train_wall 33981 | epoch 092 | valid on 'valid' subset | valid_loss 4.11728 | valid_nll_loss 2.4035 | valid_ppl 5.29 | num_updates 23382 | best 4.10402 | WARNING: overflow detected, setting loss scale to: 1.0 | epoch 093 | loss 3.509 | nll_loss 1.793 | ppl 3.47 | wps 416782 | ups 0.7 | wpb 606086 | bsz 20497 | num_updates 23636 | lr 0.00041138 | gnorm 0.223 | clip 0% | oom 0 | loss_scale 1.000 | wall 35640 | train_wall 34345 | epoch 093 | valid on 'valid' subset | valid_loss 4.11036 | valid_nll_loss 2.40381 | valid_ppl 5.29 | num_updates 23636 | best 4.10402 | WARNING: overflow detected, setting loss scale to: 1.0 | epoch 094 | loss 3.507 | nll_loss 1.792 | ppl 3.46 | wps 417133 | ups 0.7 | wpb 606054 | bsz 20496 | num_updates 23890 | lr 0.000409187 | gnorm 0.222 | clip 0% | oom 0 | loss_scale 1.000 | wall 36015 | train_wall 34708 | epoch 094 | valid on 'valid' subset | valid_loss 4.11424 | valid_nll_loss 2.40188 | valid_ppl 5.28 | num_updates 23890 | best 4.10402 | WARNING: overflow detected, setting loss scale to: 1.0 | epoch 095 | loss 3.505 | nll_loss 1.789 | ppl 3.46 | wps 415008 | ups 0.7 | wpb 606135 | bsz 20488 | num_updates 24144 | lr 0.000407029 | gnorm 0.224 | clip 0% | oom 0 | loss_scale 1.000 | wall 36391 | train_wall 35073 | epoch 095 | valid on 'valid' subset | valid_loss 4.11066 | valid_nll_loss 2.40221 | valid_ppl 5.29 | num_updates 24144 | best 4.10402 | WARNING: overflow detected, setting loss scale to: 1.0 | epoch 096 | loss 3.504 | nll_loss 1.787 | ppl 3.45 | wps 416824 | ups 0.7 | wpb 606076 | bsz 20500 | num_updates 24398 | lr 0.000404905 | gnorm 0.223 | clip 0% | oom 0 | loss_scale 1.000 | wall 36766 | train_wall 35436 | epoch 096 | valid on 'valid' subset | valid_loss 4.11719 | valid_nll_loss 2.41252 | valid_ppl 5.32 | num_updates 24398 | best 4.10402 | WARNING: overflow detected, setting loss scale to: 1.0 | epoch 097 | loss 3.502 | nll_loss 1.786 | ppl 3.45 | wps 416670 | ups 0.7 | wpb 606092 | bsz 20483 | num_updates 24652 | lr 0.000402813 | gnorm 0.224 | clip 0% | oom 0 | loss_scale 1.000 | wall 37142 | train_wall 35800 | epoch 097 | valid on 'valid' subset | valid_loss 4.13188 | valid_nll_loss 2.41883 | valid_ppl 5.35 | num_updates 24652 | best 4.10402 | WARNING: overflow detected, setting loss scale to: 1.0 | epoch 098 | loss 3.501 | nll_loss 1.784 | ppl 3.44 | wps 415951 | ups 0.7 | wpb 606041 | bsz 20485 | num_updates 24906 | lr 0.000400754 | gnorm 0.220 | clip 0% | oom 0 | loss_scale 1.000 | wall 37517 | train_wall 36164 | epoch 098 | valid on 'valid' subset | valid_loss 4.12521 | valid_nll_loss 2.41424 | valid_ppl 5.33 | num_updates 24906 | best 4.10402 | WARNING: overflow detected, setting loss scale to: 0.5 | epoch 099 | loss 3.499 | nll_loss 1.782 | ppl 3.44 | wps 409216 | ups 0.7 | wpb 606073 | bsz 20488 | num_updates 25160 | lr 0.000398726 | gnorm 0.224 | clip 0% | oom 0 | loss_scale 0.500 | wall 37899 | train_wall 36534 | epoch 099 | valid on 'valid' subset | valid_loss 4.12852 | valid_nll_loss 2.42151 | valid_ppl 5.36 | num_updates 25160 | best 4.10402 | epoch 100 | loss 3.497 | nll_loss 1.780 | ppl 3.44 | wps 408100 | ups 0.7 | wpb 606100 | bsz 20489 | num_updates 25415 | lr 0.000396721 | gnorm 0.223 | clip 0% | oom 0 | loss_scale 1.000 | wall 38283 | train_wall 36907 | epoch 100 | valid on 'valid' subset | valid_loss 4.12674 | valid_nll_loss 2.41782 | valid_ppl 5.34 | num_updates 25415 | best 4.10402 | epoch 101 | loss 3.496 | nll_loss 1.779 | ppl 3.43 | wps 436073 | ups 0.7 | wpb 606100 | bsz 20489 | num_updates 25670 | lr 0.000394745 | gnorm 0.220 | clip 0% | oom 0 | loss_scale 1.000 | wall 357 | train_wall 37259 | epoch 101 | valid on 'valid' subset | valid_loss 4.10908 | valid_nll_loss 2.40401 | valid_ppl 5.29 | num_updates 25670 | best 4.10402 | WARNING: overflow detected, setting loss scale to: 1.0 | WARNING: overflow detected, setting loss scale to: 0.5 | epoch 102 | loss 3.494 | nll_loss 1.777 | ppl 3.43 | wps 433598 | ups 0.7 | wpb 606118 | bsz 20489 | num_updates 25923 | lr 0.000392814 | gnorm 0.218 | clip 0% | oom 0 | loss_scale 0.500 | wall 718 | train_wall 37606 | epoch 102 | valid on 'valid' subset | valid_loss 4.11085 | valid_nll_loss 2.40357 | valid_ppl 5.29 | num_updates 25923 | best 4.10402 | WARNING: overflow detected, setting loss scale to: 0.5 | epoch 103 | loss 3.493 | nll_loss 1.775 | ppl 3.42 | wps 433563 | ups 0.7 | wpb 606060 | bsz 20496 | num_updates 26177 | lr 0.000390904 | gnorm 0.221 | clip 0% | oom 0 | loss_scale 0.500 | wall 1078 | train_wall 37955 | epoch 103 | valid on 'valid' subset | valid_loss 4.12682 | valid_nll_loss 2.41511 | valid_ppl 5.33 | num_updates 26177 | best 4.10402 | epoch 104 | loss 3.491 | nll_loss 1.774 | ppl 3.42 | wps 432747 | ups 0.7 | wpb 606100 | bsz 20489 | num_updates 26432 | lr 0.000389014 | gnorm 0.217 | clip 0% | oom 0 | loss_scale 1.000 | wall 1442 | train_wall 38306 | epoch 104 | valid on 'valid' subset | valid_loss 4.12065 | valid_nll_loss 2.411 | valid_ppl 5.32 | num_updates 26432 | best 4.10402 | epoch 105 | loss 3.490 | nll_loss 1.772 | ppl 3.42 | wps 434173 | ups 0.7 | wpb 606100 | bsz 20489 | num_updates 26687 | lr 0.000387151 | gnorm 0.218 | clip 0% | oom 0 | loss_scale 2.000 | wall 1804 | train_wall 38656 | epoch 105 | valid on 'valid' subset | valid_loss 4.11338 | valid_nll_loss 2.40723 | valid_ppl 5.30 | num_updates 26687 | best 4.10402 | WARNING: overflow detected, setting loss scale to: 1.0 | epoch 106 | loss 3.489 | nll_loss 1.770 | ppl 3.41 | wps 429243 | ups 0.7 | wpb 606072 | bsz 20480 | num_updates 26941 | lr 0.000385321 | gnorm 0.211 | clip 0% | oom 0 | loss_scale 1.000 | wall 2168 | train_wall 39009 | epoch 106 | valid on 'valid' subset | valid_loss 4.1265 | valid_nll_loss 2.41423 | valid_ppl 5.33 | num_updates 26941 | best 4.10402 | WARNING: overflow detected, setting loss scale to: 1.0 | epoch 107 | loss 3.487 | nll_loss 1.769 | ppl 3.41 | wps 428200 | ups 0.7 | wpb 606093 | bsz 20485 | num_updates 27195 | lr 0.000383518 | gnorm 0.216 | clip 0% | oom 0 | loss_scale 1.000 | wall 2533 | train_wall 39362 | epoch 107 | valid on 'valid' subset | valid_loss 4.11955 | valid_nll_loss 2.40804 | valid_ppl 5.31 | num_updates 27195 | best 4.10402 | WARNING: overflow detected, setting loss scale to: 1.0 | epoch 108 | loss 3.485 | nll_loss 1.767 | ppl 3.40 | wps 429234 | ups 0.7 | wpb 606081 | bsz 20492 | num_updates 27449 | lr 0.000381739 | gnorm 0.212 | clip 0% | oom 0 | loss_scale 1.000 | wall 2897 | train_wall 39715 | epoch 108 | valid on 'valid' subset | valid_loss 4.11841 | valid_nll_loss 2.4091 | valid_ppl 5.31 | num_updates 27449 | best 4.10402 | WARNING: overflow detected, setting loss scale to: 1.0 | WARNING: overflow detected, setting loss scale to: 0.5 | epoch 109 | loss 3.484 | nll_loss 1.765 | ppl 3.40 | wps 426984 | ups 0.7 | wpb 606124 | bsz 20484 | num_updates 27702 | lr 0.000379992 | gnorm 0.213 | clip 0% | oom 0 | loss_scale 0.500 | wall 3262 | train_wall 40068 | epoch 109 | valid on 'valid' subset | valid_loss 4.11931 | valid_nll_loss 2.40665 | valid_ppl 5.30 | num_updates 27702 | best 4.10402 | WARNING: overflow detected, setting loss scale to: 0.5 | epoch 110 | loss 3.483 | nll_loss 1.764 | ppl 3.40 | wps 428958 | ups 0.7 | wpb 606117 | bsz 20491 | num_updates 27956 | lr 0.000378262 | gnorm 0.220 | clip 0% | oom 0 | loss_scale 0.500 | wall 3631 | train_wall 40421 | epoch 110 | valid on 'valid' subset | valid_loss 4.12171 | valid_nll_loss 2.41519 | valid_ppl 5.33 | num_updates 27956 | best 4.10402 | epoch 111 | loss 3.481 | nll_loss 1.762 | ppl 3.39 | wps 431554 | ups 0.7 | wpb 606100 | bsz 20489 | num_updates 28211 | lr 0.000376548 | gnorm 0.220 | clip 0% | oom 0 | loss_scale 1.000 | wall 3994 | train_wall 40774 | epoch 111 | valid on 'valid' subset | valid_loss 4.11622 | valid_nll_loss 2.41447 | valid_ppl 5.33 | num_updates 28211 | best 4.10402 | epoch 112 | loss 3.480 | nll_loss 1.761 | ppl 3.39 | wps 431402 | ups 0.7 | wpb 606100 | bsz 20489 | num_updates 28466 | lr 0.000374858 | gnorm 0.216 | clip 0% | oom 0 | loss_scale 2.000 | wall 4358 | train_wall 41126 | epoch 112 | valid on 'valid' subset | valid_loss 4.13289 | valid_nll_loss 2.42718 | valid_ppl 5.38 | num_updates 28466 | best 4.10402 | WARNING: overflow detected, setting loss scale to: 1.0 | WARNING: overflow detected, setting loss scale to: 0.5 | epoch 113 | loss 3.479 | nll_loss 1.760 | ppl 3.39 | wps 422126 | ups 0.7 | wpb 606078 | bsz 20473 | num_updates 28719 | lr 0.000373203 | gnorm 0.214 | clip 0% | oom 0 | loss_scale 0.500 | wall 4721 | train_wall 41477 | epoch 113 | valid on 'valid' subset | valid_loss 4.12228 | valid_nll_loss 2.41173 | valid_ppl 5.32 | num_updates 28719 | best 4.10402 | epoch 114 | loss 3.478 | nll_loss 1.758 | ppl 3.38 | wps 431317 | ups 0.7 | wpb 606100 | bsz 20489 | num_updates 28974 | lr 0.000371557 | gnorm 0.211 | clip 0% | oom 0 | loss_scale 1.000 | wall 5085 | train_wall 41830 | epoch 114 | valid on 'valid' subset | valid_loss 4.12547 | valid_nll_loss 2.41713 | valid_ppl 5.34 | num_updates 28974 | best 4.10402 | WARNING: overflow detected, setting loss scale to: 1.0 | epoch 115 | loss 3.476 | nll_loss 1.757 | ppl 3.38 | wps 428816 | ups 0.7 | wpb 606056 | bsz 20491 | num_updates 29228 | lr 0.000369939 | gnorm 0.219 | clip 0% | oom 0 | loss_scale 1.000 | wall 5450 | train_wall 42183 | epoch 115 | valid on 'valid' subset | valid_loss 4.12171 | valid_nll_loss 2.41477 | valid_ppl 5.33 | num_updates 29228 | best 4.10402 | WARNING: overflow detected, setting loss scale to: 0.5 | epoch 116 | loss 3.475 | nll_loss 1.755 | ppl 3.38 | wps 427869 | ups 0.7 | wpb 606056 | bsz 20494 | num_updates 29482 | lr 0.000368342 | gnorm 0.214 | clip 0% | oom 0 | loss_scale 0.500 | wall 5815 | train_wall 42536 | epoch 116 | valid on 'valid' subset | valid_loss 4.11343 | valid_nll_loss 2.40536 | valid_ppl 5.30 | num_updates 29482 | best 4.10402 | WARNING: overflow detected, setting loss scale to: 0.5 | epoch 117 | loss 3.474 | nll_loss 1.754 | ppl 3.37 | wps 430485 | ups 0.7 | wpb 606101 | bsz 20472 | num_updates 29736 | lr 0.000366766 | gnorm 0.217 | clip 0% | oom 0 | loss_scale 0.500 | wall 6178 | train_wall 42888 | epoch 117 | valid on 'valid' subset | valid_loss 4.12915 | valid_nll_loss 2.41987 | valid_ppl 5.35 | num_updates 29736 | best 4.10402 | WARNING: overflow detected, setting loss scale to: 0.5 | epoch 118 | loss 3.473 | nll_loss 1.753 | ppl 3.37 | wps 429630 | ups 0.7 | wpb 606082 | bsz 20478 | num_updates 29990 | lr 0.000365209 | gnorm 0.212 | clip 0% | oom 0 | loss_scale 0.500 | wall 6542 | train_wall 43240 | epoch 118 | valid on 'valid' subset | valid_loss 4.1187 | valid_nll_loss 2.41071 | valid_ppl 5.32 | num_updates 29990 | best 4.10402 | epoch 119 | loss 3.464 | nll_loss 1.742 | ppl 3.34 | wps 430499 | ups 0.5 | wpb 608365 | bsz 20567 | num_updates 30000 | lr 0.000365148 | gnorm 0.263 | clip 0% | oom 0 | loss_scale 0.500 | wall 6561 | train_wall 43254 | epoch 119 | valid on 'valid' subset | valid_loss 4.11931 | valid_nll_loss 2.41024 | valid_ppl 5.32 | num_updates 30000 | best 4.10402
And here're my sacreBLEU scores, | nt13 | nt14 | nt15 | nt16 | nt17 | | 27.3 | 28.7 | 30.9 | 34.3 | 28.9 |
Cool! Your sacrebleu is still a bit lower than what we've got. The biggest difference in our setups is the batch size, you have --max-tokens 5120 while we had --max-tokens 3584
Also, did you average checkpoints or did you use checkpoint_best.pt for generation?
Here are the sacrebleu scores on our side:
test set | sacrebleu newstest13 | 27.53 newstest14 | 29.03 newstest15 | 31.05 newstest16 | 34.83 newstest17 | 28.85
It'd make sense to compare sacrebleu scores in the future to avoid any potential differences in tokenization,
Thanks for your reply! I took the last 10 epoch checkpoints and average them as the final model. I did notice the last few epochs' valid_loss are slightly higher than the best valid_loss (overfitting maybe?), and I am a little bit concern about blindly average the last 10 epochs.
BTW, do you mind to share your training log? And is there any empirical rules of choosing '--max-tokens'? when you set max-tokens = 3584 with 128 GPU, does that means 'wpb' equals to 458,752?
Sure, here is the training log. We used different validation set during training so valid loss is different from what you see. But training losses and everything else should be comparable. WPB was ~ 417k quite a bit lower than what you have.
Namespace(adam_betas='(0.9, 0.98)', adam_eps=1e-08, arch='transformer_wmt_en_de_big', attention_dropout=0.1, clip_norm=0.0, criterion='label_smoothed_cross_entropy', data='/private/home/edunov/wmt18_en_de_bpej32k/processed_jd', decoder_attention_heads=16, decoder_embed_dim=1024, decoder_ffn_embed_dim=4096, decoder_layers=6, decoder_learned_pos=False, decoder_normalize_before=False, device_id=0, distributed_backend='nccl', distributed_init_method='tcp://learnfair0250:12597', distributed_port=12597, distributed_rank=0, distributed_world_size=128, dropout=0.3, encoder_attention_heads=16, encoder_embed_dim=1024, encoder_ffn_embed_dim=4096, encoder_layers=6, encoder_learned_pos=False, encoder_normalize_before=False, fp16=True, label_smoothing=0.1, log_format=None, log_interval=1000, lr=[0.001], lr_scheduler='inverse_sqrt', lr_shrink=0.1, max_epoch=0, max_sentences=None, max_sentences_valid=None, max_source_positions=1024, max_target_positions=1024, max_tokens=3584, max_update=30000, min_lr=1e-09, momentum=0.99, no_epoch_checkpoints=False, no_progress_bar=False, no_save=False, optimizer='adam', relu_dropout=0.0, restore_file='checkpoint_last.pt', sample_without_replacement=0, save_dir='/checkpoint/edunov/wmt18en2de32k.128/', save_interval=1, seed=1, sentence_avg=False, share_all_embeddings=True, share_decoder_input_output_embed=False, skip_invalid_size_inputs_valid_test=False, source_lang=None, target_lang=None, train_subset='train', update_freq=[1.0], valid_subset='valid', validate_interval=1, warmup_init_lr=1e-07, warmup_updates=4000, weight_decay=0.0) | [en] dictionary: 35662 types | [de] dictionary: 35662 types | /private/home/edunov/wmt18_en_de_bpej32k/processed_jd train 5186259 examples | /private/home/edunov/wmt18_en_de_bpej32k/processed_jd valid 52385 examples | model transformer_wmt_en_de_big, criterion LabelSmoothedCrossEntropyCriterion | num. model params: 212875264 | training on 128 GPUs | max tokens per GPU = 3584 and max sentences per GPU = None | WARNING: overflow detected, setting loss scale to: 64.0 | WARNING: overflow detected, setting loss scale to: 32.0 | WARNING: overflow detected, setting loss scale to: 16.0 | epoch 001 | loss 11.640 | nll_loss 11.079 | ppl 2163.18 | wps 1.25011e+06 | ups 2.9 | wpb 417527 | bsz 14129 | num_updates 364 | lr 9.10909e-05 | gnorm 1.511 | clip 100% | oom 0 | loss_scale 16.000 | wall 153 | epoch 001 | valid on 'valid' subset | valid_loss 9.71483 | valid_nll_loss 8.79178 | valid_ppl 443.19 | epoch 002 | loss 8.941 | nll_loss 7.926 | ppl 243.28 | wps 1.26106e+06 | ups 3.0 | wpb 417527 | bsz 14131 | num_updates 731 | lr 0.000182832 | gnorm 1.385 | clip 100% | oom 0 | loss_scale 16.000 | wall 297 | epoch 002 | valid on 'valid' subset | valid_loss 7.68072 | valid_nll_loss 6.37039 | valid_ppl 82.73 | epoch 003 | loss 7.185 | nll_loss 5.891 | ppl 59.34 | wps 1.25327e+06 | ups 3.0 | wpb 417527 | bsz 14131 | num_updates 1098 | lr 0.000274573 | gnorm 1.259 | clip 100% | oom 0 | loss_scale 16.000 | wall 442 | epoch 003 | valid on 'valid' subset | valid_loss 6.20804 | valid_nll_loss 4.57014 | valid_ppl 23.75 | epoch 004 | loss 5.894 | nll_loss 4.406 | ppl 21.20 | wps 1.19568e+06 | ups 2.9 | wpb 417527 | bsz 14131 | num_updates 1465 | lr 0.000366313 | gnorm 1.119 | clip 100% | oom 0 | loss_scale 16.000 | wall 592 | epoch 004 | valid on 'valid' subset | valid_loss 5.29065 | valid_nll_loss 3.50494 | valid_ppl 11.35 | epoch 005 | loss 5.172 | nll_loss 3.599 | ppl 12.11 | wps 1.23791e+06 | ups 3.0 | wpb 417527 | bsz 14131 | num_updates 1832 | lr 0.000458054 | gnorm 0.990 | clip 100% | oom 0 | loss_scale 16.000 | wall 739 | epoch 005 | valid on 'valid' subset | valid_loss 4.83572 | valid_nll_loss 3.03028 | valid_ppl 8.17 | epoch 006 | loss 4.792 | nll_loss 3.185 | ppl 9.09 | wps 1.23232e+06 | ups 3.0 | wpb 417527 | bsz 14131 | num_updates 2199 | lr 0.000549795 | gnorm 0.886 | clip 100% | oom 0 | loss_scale 32.000 | wall 886 | epoch 006 | valid on 'valid' subset | valid_loss 4.53614 | valid_nll_loss 2.72333 | valid_ppl 6.60 | epoch 007 | loss 4.549 | nll_loss 2.922 | ppl 7.58 | wps 1.19250e+06 | ups 2.9 | wpb 417527 | bsz 14131 | num_updates 2566 | lr 0.000641536 | gnorm 0.807 | clip 100% | oom 0 | loss_scale 32.000 | wall 1036 | epoch 007 | valid on 'valid' subset | valid_loss 4.36993 | valid_nll_loss 2.55868 | valid_ppl 5.89 | epoch 008 | loss 4.387 | nll_loss 2.746 | ppl 6.71 | wps 1.23403e+06 | ups 3.0 | wpb 417527 | bsz 14131 | num_updates 2933 | lr 0.000733277 | gnorm 0.746 | clip 100% | oom 0 | loss_scale 32.000 | wall 1183 | epoch 008 | valid on 'valid' subset | valid_loss 4.24912 | valid_nll_loss 2.44602 | valid_ppl 5.45 | WARNING: overflow detected, setting loss scale to: 16.0 | epoch 009 | loss 4.284 | nll_loss 2.637 | ppl 6.22 | wps 1.22207e+06 | ups 2.9 | wpb 417514 | bsz 14125 | num_updates 3299 | lr 0.000824768 | gnorm 0.699 | clip 100% | oom 0 | loss_scale 16.000 | wall 1330 | epoch 009 | valid on 'valid' subset | valid_loss 4.1818 | valid_nll_loss 2.3791 | valid_ppl 5.20 | WARNING: overflow detected, setting loss scale to: 8.0 | epoch 010 | loss 4.204 | nll_loss 2.551 | ppl 5.86 | wps 1.19654e+06 | ups 2.9 | wpb 417536 | bsz 14131 | num_updates 3665 | lr 0.000916258 | gnorm 0.663 | clip 100% | oom 0 | loss_scale 8.000 | wall 1480 | epoch 010 | valid on 'valid' subset | valid_loss 4.08806 | valid_nll_loss 2.30713 | valid_ppl 4.95 | epoch 011 | loss 4.152 | nll_loss 2.495 | ppl 5.64 | wps 1.23957e+06 | ups 3.0 | wpb 417527 | bsz 14131 | num_updates 4032 | lr 0.000996024 | gnorm 0.635 | clip 100% | oom 0 | loss_scale 8.000 | wall 1626 | epoch 011 | valid on 'valid' subset | valid_loss 4.06357 | valid_nll_loss 2.2798 | valid_ppl 4.86 | epoch 012 | loss 4.102 | nll_loss 2.441 | ppl 5.43 | wps 1.21411e+06 | ups 2.9 | wpb 417527 | bsz 14131 | num_updates 4399 | lr 0.000953571 | gnorm 0.612 | clip 100% | oom 0 | loss_scale 8.000 | wall 1775 | epoch 012 | valid on 'valid' subset | valid_loss 4.0131 | valid_nll_loss 2.22744 | valid_ppl 4.68 | epoch 013 | loss 4.051 | nll_loss 2.386 | ppl 5.23 | wps 1.19611e+06 | ups 2.9 | wpb 417527 | bsz 14131 | num_updates 4766 | lr 0.000916121 | gnorm 0.592 | clip 100% | oom 0 | loss_scale 8.000 | wall 1925 | epoch 013 | valid on 'valid' subset | valid_loss 3.98663 | valid_nll_loss 2.19515 | valid_ppl 4.58 | epoch 014 | loss 4.010 | nll_loss 2.341 | ppl 5.07 | wps 1.23297e+06 | ups 3.0 | wpb 417527 | bsz 14131 | num_updates 5133 | lr 0.000882763 | gnorm 0.574 | clip 100% | oom 0 | loss_scale 8.000 | wall 2072 | epoch 014 | valid on 'valid' subset | valid_loss 3.94049 | valid_nll_loss 2.16237 | valid_ppl 4.48 | epoch 015 | loss 3.974 | nll_loss 2.303 | ppl 4.93 | wps 1.20442e+06 | ups 2.9 | wpb 417527 | bsz 14131 | num_updates 5500 | lr 0.000852803 | gnorm 0.559 | clip 100% | oom 0 | loss_scale 8.000 | wall 2222 | epoch 015 | valid on 'valid' subset | valid_loss 3.90732 | valid_nll_loss 2.1337 | valid_ppl 4.39 | WARNING: overflow detected, setting loss scale to: 8.0 | epoch 016 | loss 3.944 | nll_loss 2.270 | ppl 4.82 | wps 1.17817e+06 | ups 2.8 | wpb 417532 | bsz 14132 | num_updates 5866 | lr 0.00082577 | gnorm 0.545 | clip 100% | oom 0 | loss_scale 8.000 | wall 2374 | epoch 016 | valid on 'valid' subset | valid_loss 3.89654 | valid_nll_loss 2.11581 | valid_ppl 4.33 | epoch 017 | loss 3.917 | nll_loss 2.242 | ppl 4.73 | wps 1.22846e+06 | ups 2.9 | wpb 417527 | bsz 14131 | num_updates 6233 | lr 0.00080109 | gnorm 0.533 | clip 100% | oom 0 | loss_scale 8.000 | wall 2521 | epoch 017 | valid on 'valid' subset | valid_loss 3.86626 | valid_nll_loss 2.09566 | valid_ppl 4.27 | epoch 018 | loss 3.894 | nll_loss 2.217 | ppl 4.65 | wps 1.20855e+06 | ups 2.9 | wpb 417527 | bsz 14131 | num_updates 6600 | lr 0.000778499 | gnorm 0.521 | clip 100% | oom 0 | loss_scale 8.000 | wall 2670 | epoch 018 | valid on 'valid' subset | valid_loss 3.86693 | valid_nll_loss 2.08613 | valid_ppl 4.25 | epoch 019 | loss 3.874 | nll_loss 2.195 | ppl 4.58 | wps 1.19836e+06 | ups 2.9 | wpb 417527 | bsz 14131 | num_updates 6967 | lr 0.000757717 | gnorm 0.511 | clip 100% | oom 0 | loss_scale 8.000 | wall 2813 | epoch 019 | valid on 'valid' subset | valid_loss 3.844 | valid_nll_loss 2.06793 | valid_ppl 4.19 | epoch 020 | loss 3.855 | nll_loss 2.174 | ppl 4.51 | wps 1.22883e+06 | ups 2.9 | wpb 417527 | bsz 14131 | num_updates 7334 | lr 0.000738515 | gnorm 0.501 | clip 100% | oom 0 | loss_scale 8.000 | wall 2960 | epoch 020 | valid on 'valid' subset | valid_loss 3.83733 | valid_nll_loss 2.05955 | valid_ppl 4.17 | WARNING: overflow detected, setting loss scale to: 4.0 | epoch 021 | loss 3.838 | nll_loss 2.156 | ppl 4.46 | wps 1.18095e+06 | ups 2.8 | wpb 417535 | bsz 14130 | num_updates 7700 | lr 0.00072075 | gnorm 0.492 | clip 100% | oom 0 | loss_scale 4.000 | wall 3111 | epoch 021 | valid on 'valid' subset | valid_loss 3.82871 | valid_nll_loss 2.05135 | valid_ppl 4.14 | WARNING: overflow detected, setting loss scale to: 2.0 | epoch 022 | loss 3.824 | nll_loss 2.140 | ppl 4.41 | wps 1.2099e+06 | ups 2.9 | wpb 417541 | bsz 14129 | num_updates 8066 | lr 0.000704208 | gnorm 0.484 | clip 100% | oom 0 | loss_scale 2.000 | wall 3260 | epoch 022 | valid on 'valid' subset | valid_loss 3.80924 | valid_nll_loss 2.03893 | valid_ppl 4.11 | epoch 023 | loss 3.810 | nll_loss 2.125 | ppl 4.36 | wps 1.22117e+06 | ups 2.9 | wpb 417527 | bsz 14131 | num_updates 8433 | lr 0.000688714 | gnorm 0.476 | clip 100% | oom 0 | loss_scale 2.000 | wall 3408 | epoch 023 | valid on 'valid' subset | valid_loss 3.79757 | valid_nll_loss 2.03064 | valid_ppl 4.09 | epoch 024 | loss 3.797 | nll_loss 2.111 | ppl 4.32 | wps 1.17552e+06 | ups 2.8 | wpb 417527 | bsz 14131 | num_updates 8800 | lr 0.0006742 | gnorm 0.469 | clip 100% | oom 0 | loss_scale 2.000 | wall 3561 | epoch 024 | valid on 'valid' subset | valid_loss 3.79332 | valid_nll_loss 2.02798 | valid_ppl 4.08 | epoch 025 | loss 3.785 | nll_loss 2.098 | ppl 4.28 | wps 1.2025e+06 | ups 2.9 | wpb 417527 | bsz 14131 | num_updates 9167 | lr 0.000660566 | gnorm 0.463 | clip 100% | oom 0 | loss_scale 2.000 | wall 3711 | epoch 025 | valid on 'valid' subset | valid_loss 3.79102 | valid_nll_loss 2.01968 | valid_ppl 4.05 | epoch 026 | loss 3.774 | nll_loss 2.086 | ppl 4.25 | wps 1.20569e+06 | ups 2.9 | wpb 417527 | bsz 14131 | num_updates 9534 | lr 0.000647728 | gnorm 0.456 | clip 100% | oom 0 | loss_scale 2.000 | wall 3860 | epoch 026 | valid on 'valid' subset | valid_loss 3.7795 | valid_nll_loss 2.00929 | valid_ppl 4.03 | epoch 027 | loss 3.764 | nll_loss 2.075 | ppl 4.21 | wps 1.17741e+06 | ups 2.8 | wpb 417527 | bsz 14131 | num_updates 9901 | lr 0.00063561 | gnorm 0.450 | clip 100% | oom 0 | loss_scale 2.000 | wall 4012 | epoch 027 | valid on 'valid' subset | valid_loss 3.7687 | valid_nll_loss 2.00443 | valid_ppl 4.01 | WARNING: overflow detected, setting loss scale to: 2.0 | epoch 028 | loss 3.754 | nll_loss 2.064 | ppl 4.18 | wps 1.19294e+06 | ups 2.9 | wpb 417530 | bsz 14130 | num_updates 10267 | lr 0.000624178 | gnorm 0.444 | clip 100% | oom 0 | loss_scale 2.000 | wall 4162 | epoch 028 | valid on 'valid' subset | valid_loss 3.76825 | valid_nll_loss 2.00362 | valid_ppl 4.01 | epoch 029 | loss 3.745 | nll_loss 2.054 | ppl 4.15 | wps 1.2047e+06 | ups 2.9 | wpb 417527 | bsz 14131 | num_updates 10634 | lr 0.000613312 | gnorm 0.439 | clip 100% | oom 0 | loss_scale 2.000 | wall 4312 | epoch 029 | valid on 'valid' subset | valid_loss 3.75727 | valid_nll_loss 1.99507 | valid_ppl 3.99 | epoch 030 | loss 3.736 | nll_loss 2.045 | ppl 4.13 | wps 1.18305e+06 | ups 2.8 | wpb 417527 | bsz 14131 | num_updates 11001 | lr 0.000602995 | gnorm 0.434 | clip 100% | oom 0 | loss_scale 2.000 | wall 4464 | epoch 030 | valid on 'valid' subset | valid_loss 3.75939 | valid_nll_loss 1.99226 | valid_ppl 3.98 | epoch 031 | loss 3.728 | nll_loss 2.036 | ppl 4.10 | wps 1.21293e+06 | ups 2.9 | wpb 417527 | bsz 14131 | num_updates 11368 | lr 0.000593182 | gnorm 0.429 | clip 100% | oom 0 | loss_scale 2.000 | wall 4606 | epoch 031 | valid on 'valid' subset | valid_loss 3.74884 | valid_nll_loss 1.9865 | valid_ppl 3.96 | epoch 032 | loss 3.720 | nll_loss 2.027 | ppl 4.08 | wps 1.2014e+06 | ups 2.9 | wpb 417527 | bsz 14131 | num_updates 11735 | lr 0.000583833 | gnorm 0.424 | clip 100% | oom 0 | loss_scale 2.000 | wall 4756 | epoch 032 | valid on 'valid' subset | valid_loss 3.74646 | valid_nll_loss 1.97965 | valid_ppl 3.94 | epoch 033 | loss 3.713 | nll_loss 2.019 | ppl 4.05 | wps 1.18342e+06 | ups 2.8 | wpb 417527 | bsz 14131 | num_updates 12102 | lr 0.000574912 | gnorm 0.419 | clip 100% | oom 0 | loss_scale 2.000 | wall 4908 | epoch 033 | valid on 'valid' subset | valid_loss 3.75254 | valid_nll_loss 1.98196 | valid_ppl 3.95 | epoch 034 | loss 3.706 | nll_loss 2.012 | ppl 4.03 | wps 1.21022e+06 | ups 2.9 | wpb 417527 | bsz 14131 | num_updates 12469 | lr 0.000566388 | gnorm 0.414 | clip 100% | oom 0 | loss_scale 4.000 | wall 5049 | epoch 034 | valid on 'valid' subset | valid_loss 3.74087 | valid_nll_loss 1.97679 | valid_ppl 3.94 | epoch 035 | loss 3.699 | nll_loss 2.004 | ppl 4.01 | wps 1.19165e+06 | ups 2.9 | wpb 417527 | bsz 14131 | num_updates 12836 | lr 0.000558233 | gnorm 0.410 | clip 100% | oom 0 | loss_scale 4.000 | wall 5200 | epoch 035 | valid on 'valid' subset | valid_loss 3.73453 | valid_nll_loss 1.97166 | valid_ppl 3.92 | epoch 036 | loss 3.693 | nll_loss 1.997 | ppl 3.99 | wps 1.18019e+06 | ups 2.8 | wpb 417527 | bsz 14131 | num_updates 13203 | lr 0.000550419 | gnorm 0.406 | clip 100% | oom 0 | loss_scale 4.000 | wall 5352 | epoch 036 | valid on 'valid' subset | valid_loss 3.73513 | valid_nll_loss 1.97161 | valid_ppl 3.92 | epoch 037 | loss 3.687 | nll_loss 1.991 | ppl 3.97 | wps 1.20795e+06 | ups 2.9 | wpb 417527 | bsz 14131 | num_updates 13570 | lr 0.000542925 | gnorm 0.402 | clip 100% | oom 0 | loss_scale 4.000 | wall 5493 | epoch 037 | valid on 'valid' subset | valid_loss 3.73009 | valid_nll_loss 1.96275 | valid_ppl 3.90 | epoch 038 | loss 3.681 | nll_loss 1.984 | ppl 3.96 | wps 1.18936e+06 | ups 2.9 | wpb 417527 | bsz 14131 | num_updates 13937 | lr 0.000535729 | gnorm 0.398 | clip 100% | oom 0 | loss_scale 4.000 | wall 5644 | epoch 038 | valid on 'valid' subset | valid_loss 3.72677 | valid_nll_loss 1.962 | valid_ppl 3.90 | epoch 039 | loss 3.676 | nll_loss 1.978 | ppl 3.94 | wps 1.19059e+06 | ups 2.9 | wpb 417527 | bsz 14131 | num_updates 14304 | lr 0.000528812 | gnorm 0.395 | clip 100% | oom 0 | loss_scale 8.000 | wall 5796 | epoch 039 | valid on 'valid' subset | valid_loss 3.73099 | valid_nll_loss 1.96415 | valid_ppl 3.90 | WARNING: overflow detected, setting loss scale to: 4.0 | WARNING: overflow detected, setting loss scale to: 2.0 | epoch 040 | loss 3.670 | nll_loss 1.972 | ppl 3.92 | wps 1.19271e+06 | ups 2.9 | wpb 417549 | bsz 14136 | num_updates 14669 | lr 0.000522191 | gnorm 0.391 | clip 100% | oom 0 | loss_scale 2.000 | wall 5938 | epoch 040 | valid on 'valid' subset | valid_loss 3.72095 | valid_nll_loss 1.95941 | valid_ppl 3.89 | WARNING: overflow detected, setting loss scale to: 1.0 | epoch 041 | loss 3.665 | nll_loss 1.967 | ppl 3.91 | wps 1.18796e+06 | ups 2.8 | wpb 417531 | bsz 14132 | num_updates 15035 | lr 0.000515796 | gnorm 0.388 | clip 100% | oom 0 | loss_scale 1.000 | wall 6089 | epoch 041 | valid on 'valid' subset | valid_loss 3.71671 | valid_nll_loss 1.95491 | valid_ppl 3.88 | epoch 042 | loss 3.660 | nll_loss 1.961 | ppl 3.89 | wps 1.18676e+06 | ups 2.8 | wpb 417527 | bsz 14131 | num_updates 15402 | lr 0.000509614 | gnorm 0.384 | clip 100% | oom 0 | loss_scale 1.000 | wall 6240 | epoch 042 | valid on 'valid' subset | valid_loss 3.72246 | valid_nll_loss 1.95729 | valid_ppl 3.88 | epoch 043 | loss 3.655 | nll_loss 1.956 | ppl 3.88 | wps 1.19254e+06 | ups 2.9 | wpb 417527 | bsz 14131 | num_updates 15769 | lr 0.000503649 | gnorm 0.381 | clip 100% | oom 0 | loss_scale 1.000 | wall 6384 | epoch 043 | valid on 'valid' subset | valid_loss 3.71311 | valid_nll_loss 1.95362 | valid_ppl 3.87 | epoch 044 | loss 3.651 | nll_loss 1.951 | ppl 3.87 | wps 1.18713e+06 | ups 2.8 | wpb 417527 | bsz 14131 | num_updates 16136 | lr 0.000497888 | gnorm 0.378 | clip 100% | oom 0 | loss_scale 1.000 | wall 6536 | epoch 044 | valid on 'valid' subset | valid_loss 3.71501 | valid_nll_loss 1.95122 | valid_ppl 3.87 | epoch 045 | loss 3.646 | nll_loss 1.946 | ppl 3.85 | wps 1.19732e+06 | ups 2.9 | wpb 417527 | bsz 14131 | num_updates 16503 | lr 0.000492321 | gnorm 0.375 | clip 100% | oom 0 | loss_scale 1.000 | wall 6679 | epoch 045 | valid on 'valid' subset | valid_loss 3.71263 | valid_nll_loss 1.94888 | valid_ppl 3.86 | epoch 046 | loss 3.642 | nll_loss 1.941 | ppl 3.84 | wps 1.18602e+06 | ups 2.8 | wpb 417527 | bsz 14131 | num_updates 16870 | lr 0.000486937 | gnorm 0.372 | clip 100% | oom 0 | loss_scale 1.000 | wall 6831 | epoch 046 | valid on 'valid' subset | valid_loss 3.71175 | valid_nll_loss 1.95164 | valid_ppl 3.87 | epoch 047 | loss 3.638 | nll_loss 1.936 | ppl 3.83 | wps 1.19057e+06 | ups 2.9 | wpb 417527 | bsz 14131 | num_updates 17237 | lr 0.000481725 | gnorm 0.369 | clip 100% | oom 0 | loss_scale 2.000 | wall 6982 | epoch 047 | valid on 'valid' subset | valid_loss 3.71237 | valid_nll_loss 1.94909 | valid_ppl 3.86 | epoch 048 | loss 3.633 | nll_loss 1.932 | ppl 3.82 | wps 1.19919e+06 | ups 2.9 | wpb 417527 | bsz 14131 | num_updates 17604 | lr 0.000476677 | gnorm 0.366 | clip 100% | oom 0 | loss_scale 2.000 | wall 7125 | epoch 048 | valid on 'valid' subset | valid_loss 3.70898 | valid_nll_loss 1.94564 | valid_ppl 3.85 | epoch 049 | loss 3.630 | nll_loss 1.928 | ppl 3.80 | wps 1.19642e+06 | ups 2.9 | wpb 417527 | bsz 14131 | num_updates 17971 | lr 0.000471785 | gnorm 0.363 | clip 100% | oom 0 | loss_scale 2.000 | wall 7275 | epoch 049 | valid on 'valid' subset | valid_loss 3.70973 | valid_nll_loss 1.94521 | valid_ppl 3.85 | epoch 050 | loss 3.626 | nll_loss 1.923 | ppl 3.79 | wps 1.19462e+06 | ups 2.9 | wpb 417527 | bsz 14131 | num_updates 18338 | lr 0.00046704 | gnorm 0.361 | clip 100% | oom 0 | loss_scale 2.000 | wall 7418 | epoch 050 | valid on 'valid' subset | valid_loss 3.70251 | valid_nll_loss 1.94036 | valid_ppl 3.84 | epoch 051 | loss 3.622 | nll_loss 1.919 | ppl 3.78 | wps 1.20464e+06 | ups 2.9 | wpb 417527 | bsz 14131 | num_updates 18705 | lr 0.000462435 | gnorm 0.358 | clip 100% | oom 0 | loss_scale 2.000 | wall 7567 | epoch 051 | valid on 'valid' subset | valid_loss 3.70612 | valid_nll_loss 1.94026 | valid_ppl 3.84 | epoch 052 | loss 3.619 | nll_loss 1.915 | ppl 3.77 | wps 1.19926e+06 | ups 2.9 | wpb 417527 | bsz 14131 | num_updates 19072 | lr 0.000457965 | gnorm 0.356 | clip 100% | oom 0 | loss_scale 4.000 | wall 7710 | epoch 052 | valid on 'valid' subset | valid_loss 3.69991 | valid_nll_loss 1.94119 | valid_ppl 3.84 | epoch 053 | loss 3.615 | nll_loss 1.911 | ppl 3.76 | wps 1.18598e+06 | ups 2.8 | wpb 417527 | bsz 14131 | num_updates 19439 | lr 0.000453621 | gnorm 0.354 | clip 100% | oom 0 | loss_scale 4.000 | wall 7861 | epoch 053 | valid on 'valid' subset | valid_loss 3.70723 | valid_nll_loss 1.94351 | valid_ppl 3.85 | epoch 054 | loss 3.612 | nll_loss 1.908 | ppl 3.75 | wps 1.20073e+06 | ups 2.9 | wpb 417527 | bsz 14131 | num_updates 19806 | lr 0.000449398 | gnorm 0.351 | clip 100% | oom 0 | loss_scale 4.000 | wall 8004 | epoch 054 | valid on 'valid' subset | valid_loss 3.69568 | valid_nll_loss 1.9391 | valid_ppl 3.83 | WARNING: overflow detected, setting loss scale to: 2.0 | WARNING: overflow detected, setting loss scale to: 1.0 | epoch 055 | loss 3.608 | nll_loss 1.904 | ppl 3.74 | wps 1.1882e+06 | ups 2.8 | wpb 417519 | bsz 14139 | num_updates 20171 | lr 0.000445314 | gnorm 0.349 | clip 100% | oom 0 | loss_scale 1.000 | wall 8154 | epoch 055 | valid on 'valid' subset | valid_loss 3.69511 | valid_nll_loss 1.93727 | valid_ppl 3.83 | epoch 056 | loss 3.605 | nll_loss 1.901 | ppl 3.73 | wps 1.18361e+06 | ups 2.8 | wpb 417527 | bsz 14131 | num_updates 20538 | lr 0.000441317 | gnorm 0.347 | clip 100% | oom 0 | loss_scale 1.000 | wall 8306 | epoch 056 | valid on 'valid' subset | valid_loss 3.69092 | valid_nll_loss 1.93069 | valid_ppl 3.81 | epoch 057 | loss 3.602 | nll_loss 1.897 | ppl 3.72 | wps 1.19562e+06 | ups 2.9 | wpb 417527 | bsz 14131 | num_updates 20905 | lr 0.000437426 | gnorm 0.345 | clip 100% | oom 0 | loss_scale 1.000 | wall 8456 | epoch 057 | valid on 'valid' subset | valid_loss 3.69559 | valid_nll_loss 1.93252 | valid_ppl 3.82 | epoch 058 | loss 3.599 | nll_loss 1.894 | ppl 3.72 | wps 1.19241e+06 | ups 2.9 | wpb 417527 | bsz 14131 | num_updates 21272 | lr 0.000433637 | gnorm 0.343 | clip 100% | oom 0 | loss_scale 1.000 | wall 8600 | epoch 058 | valid on 'valid' subset | valid_loss 3.68833 | valid_nll_loss 1.933 | valid_ppl 3.82 | epoch 059 | loss 3.596 | nll_loss 1.891 | ppl 3.71 | wps 1.19256e+06 | ups 2.9 | wpb 417527 | bsz 14131 | num_updates 21639 | lr 0.000429944 | gnorm 0.341 | clip 100% | oom 0 | loss_scale 1.000 | wall 8751 | epoch 059 | valid on 'valid' subset | valid_loss 3.68828 | valid_nll_loss 1.93201 | valid_ppl 3.82 | epoch 060 | loss 3.593 | nll_loss 1.887 | ppl 3.70 | wps 1.19842e+06 | ups 2.9 | wpb 417527 | bsz 14131 | num_updates 22006 | lr 0.000426343 | gnorm 0.339 | clip 100% | oom 0 | loss_scale 1.000 | wall 8902 | epoch 060 | valid on 'valid' subset | valid_loss 3.68802 | valid_nll_loss 1.93372 | valid_ppl 3.82 | epoch 061 | loss 3.590 | nll_loss 1.884 | ppl 3.69 | wps 1.18828e+06 | ups 2.8 | wpb 417527 | bsz 14131 | num_updates 22373 | lr 0.000422832 | gnorm 0.337 | clip 100% | oom 0 | loss_scale 2.000 | wall 9053 | epoch 061 | valid on 'valid' subset | valid_loss 3.68782 | valid_nll_loss 1.93286 | valid_ppl 3.82 | epoch 062 | loss 3.588 | nll_loss 1.881 | ppl 3.68 | wps 1.19117e+06 | ups 2.9 | wpb 417527 | bsz 14131 | num_updates 22740 | lr 0.000419406 | gnorm 0.335 | clip 100% | oom 0 | loss_scale 2.000 | wall 9203 | epoch 062 | valid on 'valid' subset | valid_loss 3.6891 | valid_nll_loss 1.93548 | valid_ppl 3.83 | epoch 063 | loss 3.585 | nll_loss 1.878 | ppl 3.68 | wps 1.20138e+06 | ups 2.9 | wpb 417527 | bsz 14131 | num_updates 23107 | lr 0.000416062 | gnorm 0.333 | clip 100% | oom 0 | loss_scale 2.000 | wall 9346 | epoch 063 | valid on 'valid' subset | valid_loss 3.68706 | valid_nll_loss 1.92768 | valid_ppl 3.80 | epoch 064 | loss 3.582 | nll_loss 1.875 | ppl 3.67 | wps 1.18878e+06 | ups 2.8 | wpb 417527 | bsz 14131 | num_updates 23474 | lr 0.000412797 | gnorm 0.331 | clip 100% | oom 0 | loss_scale 2.000 | wall 9496 | epoch 064 | valid on 'valid' subset | valid_loss 3.68834 | valid_nll_loss 1.9309 | valid_ppl 3.81 | epoch 065 | loss 3.580 | nll_loss 1.873 | ppl 3.66 | wps 1.19358e+06 | ups 2.9 | wpb 417527 | bsz 14131 | num_updates 23841 | lr 0.000409607 | gnorm 0.329 | clip 100% | oom 0 | loss_scale 2.000 | wall 9640 | epoch 065 | valid on 'valid' subset | valid_loss 3.68952 | valid_nll_loss 1.93163 | valid_ppl 3.81 | WARNING: overflow detected, setting loss scale to: 1.0 | epoch 066 | loss 3.577 | nll_loss 1.869 | ppl 3.65 | wps 1.19893e+06 | ups 2.9 | wpb 417527 | bsz 14128 | num_updates 24207 | lr 0.000406499 | gnorm 0.328 | clip 100% | oom 0 | loss_scale 1.000 | wall 9782 | epoch 066 | valid on 'valid' subset | valid_loss 3.68436 | valid_nll_loss 1.92702 | valid_ppl 3.80 | epoch 067 | loss 3.575 | nll_loss 1.867 | ppl 3.65 | wps 1.18887e+06 | ups 2.8 | wpb 417527 | bsz 14131 | num_updates 24574 | lr 0.000403452 | gnorm 0.326 | clip 100% | oom 0 | loss_scale 1.000 | wall 9933 | epoch 067 | valid on 'valid' subset | valid_loss 3.68274 | valid_nll_loss 1.92629 | valid_ppl 3.80 | WARNING: overflow detected, setting loss scale to: 0.5 | epoch 068 | loss 3.573 | nll_loss 1.864 | ppl 3.64 | wps 1.19112e+06 | ups 2.9 | wpb 417537 | bsz 14129 | num_updates 24940 | lr 0.000400481 | gnorm 0.324 | clip 100% | oom 0 | loss_scale 0.500 | wall 10083 | epoch 068 | valid on 'valid' subset | valid_loss 3.68672 | valid_nll_loss 1.9272 | valid_ppl 3.80 | epoch 069 | loss 3.570 | nll_loss 1.862 | ppl 3.63 | wps 1.19853e+06 | ups 2.9 | wpb 417527 | bsz 14131 | num_updates 25307 | lr 0.000397566 | gnorm 0.323 | clip 100% | oom 0 | loss_scale 0.500 | wall 10226 | epoch 069 | valid on 'valid' subset | valid_loss 3.68245 | valid_nll_loss 1.9233 | valid_ppl 3.79 | epoch 070 | loss 3.568 | nll_loss 1.859 | ppl 3.63 | wps 1.19216e+06 | ups 2.9 | wpb 417527 | bsz 14131 | num_updates 25674 | lr 0.000394715 | gnorm 0.321 | clip 100% | oom 0 | loss_scale 0.500 | wall 10376 | epoch 070 | valid on 'valid' subset | valid_loss 3.6806 | valid_nll_loss 1.9243 | valid_ppl 3.80 | epoch 071 | loss 3.566 | nll_loss 1.857 | ppl 3.62 | wps 1.19508e+06 | ups 2.9 | wpb 417527 | bsz 14131 | num_updates 26041 | lr 0.000391923 | gnorm 0.320 | clip 100% | oom 0 | loss_scale 0.500 | wall 10527 | epoch 071 | valid on 'valid' subset | valid_loss 3.679 | valid_nll_loss 1.92612 | valid_ppl 3.80 | epoch 072 | loss 3.563 | nll_loss 1.854 | ppl 3.62 | wps 1.20711e+06 | ups 2.9 | wpb 417527 | bsz 14131 | num_updates 26408 | lr 0.000389191 | gnorm 0.318 | clip 100% | oom 0 | loss_scale 0.500 | wall 10676 | epoch 072 | valid on 'valid' subset | valid_loss 3.68233 | valid_nll_loss 1.92642 | valid_ppl 3.80 | epoch 073 | loss 3.561 | nll_loss 1.852 | ppl 3.61 | wps 1.18421e+06 | ups 2.8 | wpb 417527 | bsz 14131 | num_updates 26775 | lr 0.000386514 | gnorm 0.317 | clip 100% | oom 0 | loss_scale 0.500 | wall 10820 | epoch 073 | valid on 'valid' subset | valid_loss 3.68365 | valid_nll_loss 1.92648 | valid_ppl 3.80 | epoch 074 | loss 3.559 | nll_loss 1.849 | ppl 3.60 | wps 1.20372e+06 | ups 2.9 | wpb 417527 | bsz 14131 | num_updates 27142 | lr 0.000383892 | gnorm 0.315 | clip 100% | oom 0 | loss_scale 1.000 | wall 10962 | epoch 074 | valid on 'valid' subset | valid_loss 3.67809 | valid_nll_loss 1.92431 | valid_ppl 3.80 | epoch 075 | loss 3.557 | nll_loss 1.847 | ppl 3.60 | wps 1.1912e+06 | ups 2.9 | wpb 417527 | bsz 14131 | num_updates 27509 | lr 0.000381323 | gnorm 0.314 | clip 100% | oom 0 | loss_scale 1.000 | wall 11113 | epoch 075 | valid on 'valid' subset | valid_loss 3.6789 | valid_nll_loss 1.92091 | valid_ppl 3.79 | epoch 076 | loss 3.555 | nll_loss 1.845 | ppl 3.59 | wps 1.18644e+06 | ups 2.8 | wpb 417527 | bsz 14131 | num_updates 27876 | lr 0.000378804 | gnorm 0.313 | clip 100% | oom 0 | loss_scale 1.000 | wall 11257 | epoch 076 | valid on 'valid' subset | valid_loss 3.67806 | valid_nll_loss 1.92732 | valid_ppl 3.80 | epoch 077 | loss 3.553 | nll_loss 1.843 | ppl 3.59 | wps 1.20484e+06 | ups 2.9 | wpb 417527 | bsz 14131 | num_updates 28243 | lr 0.000376335 | gnorm 0.311 | clip 100% | oom 0 | loss_scale 1.000 | wall 11406 | epoch 077 | valid on 'valid' subset | valid_loss 3.67593 | valid_nll_loss 1.92596 | valid_ppl 3.80 | epoch 078 | loss 3.551 | nll_loss 1.840 | ppl 3.58 | wps 1.19576e+06 | ups 2.9 | wpb 417527 | bsz 14131 | num_updates 28610 | lr 0.000373913 | gnorm 0.310 | clip 100% | oom 0 | loss_scale 1.000 | wall 11557 | epoch 078 | valid on 'valid' subset | valid_loss 3.67519 | valid_nll_loss 1.92406 | valid_ppl 3.79 | epoch 079 | loss 3.549 | nll_loss 1.838 | ppl 3.58 | wps 1.19275e+06 | ups 2.9 | wpb 417527 | bsz 14131 | num_updates 28977 | lr 0.000371538 | gnorm 0.309 | clip 100% | oom 0 | loss_scale 2.000 | wall 11708 | epoch 079 | valid on 'valid' subset | valid_loss 3.67164 | valid_nll_loss 1.92446 | valid_ppl 3.80 | epoch 080 | loss 3.547 | nll_loss 1.836 | ppl 3.57 | wps 1.20161e+06 | ups 2.9 | wpb 417527 | bsz 14131 | num_updates 29344 | lr 0.000369207 | gnorm 0.308 | clip 100% | oom 0 | loss_scale 2.000 | wall 11858 | epoch 080 | valid on 'valid' subset | valid_loss 3.6826 | valid_nll_loss 1.926 | valid_ppl 3.80 | epoch 081 | loss 3.546 | nll_loss 1.834 | ppl 3.57 | wps 1.18998e+06 | ups 2.8 | wpb 417527 | bsz 14131 | num_updates 29711 | lr 0.00036692 | gnorm 0.306 | clip 100% | oom 0 | loss_scale 2.000 | wall 12003 | epoch 081 | valid on 'valid' subset | valid_loss 3.67319 | valid_nll_loss 1.92212 | valid_ppl 3.79 | epoch 082 | loss 3.539 | nll_loss 1.826 | ppl 3.55 | wps 1.18456e+06 | ups 2.8 | wpb 418263 | bsz 14169 | num_updates 30000 | lr 0.000365148 | gnorm 0.305 | clip 100% | oom 0 | loss_scale 2.000 | wall 12120 | epoch 082 | valid on 'valid' subset | valid_loss 3.67481 | valid_nll_loss 1.92092 | valid_ppl 3.79 | done training in 12132.2 seconds
Thanks a million! I notice that you're using 'transformer_wmt_en_de_big' which is also different from mine. I am re-running the experiment, and hopefully I can get the same number as yours.
Hi, I re-run the setup with 'transformer_wmt_en_de_big' and here's my latest result,
test set | sacrebleu newstest13 | 27.5 newstest14 | 28.9 newstest15 | 31.4 newstest16 | 35.0 newstest17 | 29.1
Surprisingly change the model to 'transformer_wmt_en_de_big' seems to close the gap, and the results are more comparable now. Again, thank you very much for your help.
Here's my latest training options, in case someone needs it. Namespace(adam_betas='(0.9, 0.98)', adam_eps=1e-08, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, arch='transformer_wmt_en_de_big', attention_dropout=0.1, bucket_cap_mb=150, clip_norm=0.0, criterion='label_smoothed_cross_entropy', data=['/root/data'], ddp_backend='no_c10d', decoder_attention_heads=16, decoder_embed_dim=1024, decoder_embed_path=None, decoder_ffn_embed_dim=4096, decoder_input_dim=1024, decoder_layers=6, decoder_learned_pos=False, decoder_normalize_before=False, decoder_output_dim=1024, device_id=0, distributed_backend='nccl', distributed_init_method='XXX.XXX.XXX.XXX:XXX', distributed_port=-1, distributed_rank=0, distributed_world_size=64, dropout=0.3, encoder_attention_heads=16, encoder_embed_dim=1024, encoder_embed_path=None, encoder_ffn_embed_dim=4096, encoder_layers=6, encoder_learned_pos=False, encoder_normalize_before=False, fp16=True, fp16_init_scale=128, keep_interval_updates=-1, label_smoothing=0.1, left_pad_source='True', left_pad_target='False', log_format=None, log_interval=1000, lr=[0.001], lr_scheduler='inverse_sqrt', lr_shrink=0.1, max_epoch=200, max_sentences=None, max_sentences_valid=None, max_source_positions=1024, max_target_positions=1024, max_tokens=7168, max_update=30000, min_loss_scale=0.0001, min_lr=1e-09, momentum=0.99, no_epoch_checkpoints=False, no_progress_bar=True, no_save=False, no_token_positional_embeddings=False, optimizer='adam', optimizer_overrides='{}', raw_text=False, relu_dropout=0.0, reset_lr_scheduler=False, reset_optimizer=False, restore_file='checkpoint_last.pt', save_dir='/root/model', save_interval=1, save_interval_updates=0, seed=1, sentence_avg=False, share_all_embeddings=True, share_decoder_input_output_embed=False, share_encoder_decoder_input_embeddings=False, simp_avg_attention=4, skip_invalid_size_inputs_valid_test=False, source_lang='src', target_lang='tgt', task='translation', train_subset='train', update_freq=[1], upsample_primary=1, valid_subset='valid', validate_interval=1, warmup_init_lr=1e-07, warmup_updates=4000, weight_decay=0.0, weight_range=0.0)
@edunov one last question: which newstest2014 version are you using? full (3003)?
Cheers, Stephan
@stephanpeitz yes I'm using wmt14/full
Hi Fair Team,
I am interested in reproducing the En-De results in the paper "Understanding back-translation at Scale" you published last year. I followed the description in the paper to use all available bitext(WMT2018) except ParaCrawl, and preprocess the data similar to https://github.com/pytorch/fairseq/blob/master/examples/translation/prepare-wmt14en2de.sh, which do NORM_PUNC -> REM_NON_PRINT_CHAR -> TOKENIZER for training&valid data, and only do TOKENIZER on testing data.
Then I trained the model with fairseq v0.6.0 with the following parameter on 64 GPUs (max updates = 30k),
python pytorch/train.py ~/data --source-lang src --target-lang tgt --distributed-world-size 64 --distributed-backend 'nccl' --distributed-init-method 'tcp://XXX.XXX.XXX.XXX:XXXX' --distributed-rank 0 --device-id 0 --ddp-backend no_c10d --arch transformer_vaswani_wmt_en_de_big --share-all-embeddings --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 --lr 0.001 --min-lr 1e-09 --weight-decay 0.0 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 --save-dir ~/model --no-progress-bar --log-interval 1000 --dropout 0.3 --max-update 30000 --max-epoch 150 --max-tokens 5120 --update-freq 2 --fp16 --seed 1234567
However, I got the following results with all of them worse than paper's results, especially newstest14 which is 1.5 BLEU worse.
source | newstest2013 | newstest2014 | newstest2015 | newstest2016 | newstest2017 | paper | 27.84 | 30.88 | 31.82 | 34.98 | 29.46 | my | 27.64 | 29.34 | 31.16 | 34.36 | 29.09 |
I am wondering if you can sure some insights, and probably give me some advises to move forward? Thank you a lot in advance!