facebookresearch / DisCo

DisCo Transformer for Non-autoregressive MT
Other
78 stars 9 forks source link

need suggestions for ende training #1

Closed alphadl closed 4 years ago

alphadl commented 4 years ago

Hi, Kasai ~ Great work!

For EnDe, I can got ~27.3 with your pretrained model, but when I trained from scratch with your provided distilled data, I can get merely ~24 BLEU with weird loss curve. Can you give me some advices to properly reproduce your results?

I trained it on 8 V100 GPUs with following script:

python train.py ./wmt16.en-de.disco.dist/ --arch disco_transformer \
--criterion label_smoothed_length_cross_entropy \
--label-smoothing 0.1 \
--lr 5e-4 --warmup-init-lr 1e-7 --min-lr 1e-9 --lr-scheduler inverse_sqrt \
--warmup-updates 10000 \
--optimizer adam --adam-betas '(0.9, 0.999)' --adam-eps 1e-6 \
--task translation_self \
--max-tokens 16000 \
--weight-decay 0.01 \
--dropout 0.2 \
--encoder-layers 6 --encoder-embed-dim 512 --decoder-layers 6 --decoder-embed-dim 512 \
--max-source-positions 10000  --max-target-positions 10000 \
--max-update 100000 --seed 1 \
--save-dir checkpoints/wmt16.en-de.nat_authordata \
--dynamic-masking --ignore-eos-loss --share-all-embeddings \
--keep-last-epochs 20  \
--no-progress-bar --log-format simple --log-interval 100 --save-interval-updates 2000 \
--fp16 --ddp-backend=c10d --update-freq 4

The valid loss during training looks weird:

Namespace(adam_betas='(0.9, 0.999)', adam_eps=1e-06, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, arch='disco_transformer', at_only=False, at_rm=False, attention_dropout=0.0, best_checkpoint_metric='loss', bilm_add_bos=False, bilm_attention_dropout=0.0, bilm_mask_last_state=False, bilm_model_dropout=0.1, bilm_relu_dropout=0.0, bucket_cap_mb=25, clip_norm=25, cpu=False, criterion='label_smoothed_length_cross_entropy', curriculum=0, data=['wmt16.en-de.disco.dist'], dataset_impl=None, ddp_backend='c10d', decoder_attention_heads=8, decoder_embed_dim=512, decoder_embed_path=None, decoder_embed_scale=None, decoder_ffn_embed_dim=2048, decoder_input_dim=512, decoder_layers=6, decoder_learned_pos=False, decoder_normalize_before=False, decoder_output_dim=512, device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method='tcp://localhost:12097', distributed_no_spawn=False, distributed_port=-1, distributed_rank=0, distributed_world_size=8, dropout=0.2, dynamic_length=False, dynamic_masking=True, embedding_only=False, encoder_attention_heads=8, encoder_embed_dim=512, encoder_embed_path=None, encoder_embed_scale=None, encoder_ffn_embed_dim=2048, encoder_layers=6, encoder_learned_pos=False, encoder_normalize_before=False, find_unused_parameters=False, fix_batches_to_gpus=False, fp16=True, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, full_masking=False, ignore_eos_loss=True, keep_interval_updates=-1, keep_last_epochs=20, label_smoothing=0.1, left_pad_source='True', left_pad_target='False', log_format='simple', log_interval=100, lr=[0.0005], lr_scheduler='inverse_sqrt', mask_range=False, maskp=False, max_epoch=0, max_sentences=None, max_sentences_valid=None, max_source_positions=10000, max_target_positions=10000, max_tokens=16000, max_tokens_valid=16000, max_update=300000, maximize_best_checkpoint_metric=False, memory_efficient_fp16=False, min_loss_scale=0.0001, min_lr=1e-09, mix_masking=False, no_dec_token_positional_embeddings=False, no_enc_token_positional_embeddings=False, no_epoch_checkpoints=False, no_last_checkpoints=False, no_progress_bar=True, no_save=False, no_save_optimizer_state=False, num_workers=0, optimizer='adam', optimizer_overrides='{}', perm_only=False, raw_text=False, relu_dropout=0.0, required_batch_size_multiple=8, reset_dataloader=False, reset_lr_scheduler=False, reset_meters=False, reset_optimizer=False, restore_file='checkpoint_last.pt', save_dir='checkpoints/wmt16.en-de.nat_authordata', save_interval=1, save_interval_updates=2000, seed=1, self_target=False, sentence_avg=False, share_all_embeddings=True, share_decoder_input_output_embed=False, share_layers=False, skip_eos=False, skip_invalid_size_inputs_valid_test=False, source_lang=None, target_lang=None, task='translation_self', tbmf_wrapper=False, tensorboard_logdir='', threshold_loss_scale=None, train_subset='train', update_freq=[4], upsample_primary=1, use_bmuf=False, user_dir=None, valid_subset='valid', validate_interval=1, warmup_init_lr=1e-07, warmup_updates=10000, weight_decay=0.01)
| $path/wmt16.en-de.disco.dist/ valid 3000 examples
| epoch 001 | valid on 'valid' subset | loss 13.232 | nll_loss 12.570 | ppl 6079.24 | num_updates 275 | length_loss 11.3722
| epoch 002 | valid on 'valid' subset | loss 12.160 | nll_loss 11.408 | ppl 2717.66 | num_updates 555 | best_loss 12.1599 | length_loss 8.86187
| epoch 003 | valid on 'valid' subset | loss 11.816 | nll_loss 11.056 | ppl 2129.57 | num_updates 834 | best_loss 11.8161 | length_loss 8.1737
| epoch 004 | valid on 'valid' subset | loss 11.327 | nll_loss 10.515 | ppl 1463.34 | num_updates 1113 | best_loss 11.3275 | length_loss 7.40185
| epoch 005 | valid on 'valid' subset | loss 10.940 | nll_loss 10.061 | ppl 1068.39 | num_updates 1393 | best_loss 10.9401 | length_loss 6.81688
| epoch 006 | valid on 'valid' subset | loss 10.546 | nll_loss 9.588 | ppl 769.47 | num_updates 1671 | best_loss 10.5464 | length_loss 6.52779
| epoch 007 | valid on 'valid' subset | loss 10.192 | nll_loss 9.159 | ppl 571.58 | num_updates 1951 | best_loss 10.1923 | length_loss 6.09935
| epoch 008 | valid on 'valid' subset | loss 10.177 | nll_loss 9.126 | ppl 558.89 | num_updates 2000 | best_loss 10.177 | length_loss 6.11036
| epoch 008 | valid on 'valid' subset | loss 9.937 | nll_loss 8.814 | ppl 450.07 | num_updates 2231 | best_loss 9.93728 | length_loss 5.94912
| epoch 009 | valid on 'valid' subset | loss 9.726 | nll_loss 8.553 | ppl 375.56 | num_updates 2510 | best_loss 9.72613 | length_loss 5.94019
| epoch 010 | valid on 'valid' subset | loss 9.608 | nll_loss 8.381 | ppl 333.39 | num_updates 2790 | best_loss 9.60826 | length_loss 5.63923
| epoch 011 | valid on 'valid' subset | loss 9.316 | nll_loss 8.035 | ppl 262.27 | num_updates 3069 | best_loss 9.31551 | length_loss 5.4725
| epoch 012 | valid on 'valid' subset | loss 8.915 | nll_loss 7.551 | ppl 187.55 | num_updates 3349 | best_loss 8.91503 | length_loss 5.32048
| epoch 013 | valid on 'valid' subset | loss 8.565 | nll_loss 7.135 | ppl 140.53 | num_updates 3629 | best_loss 8.56478 | length_loss 5.32925
| epoch 014 | valid on 'valid' subset | loss 8.234 | nll_loss 6.740 | ppl 106.93 | num_updates 3909 | best_loss 8.23377 | length_loss 5.26823
| epoch 015 | valid on 'valid' subset | loss 8.124 | nll_loss 6.609 | ppl 97.63 | num_updates 4000 | best_loss 8.1238 | length_loss 5.40804
| epoch 015 | valid on 'valid' subset | loss 7.963 | nll_loss 6.422 | ppl 85.73 | num_updates 4187 | best_loss 7.96343 | length_loss 5.74557
| epoch 016 | valid on 'valid' subset | loss 7.689 | nll_loss 6.106 | ppl 68.87 | num_updates 4467 | best_loss 7.68925 | length_loss 5.02578
| epoch 017 | valid on 'valid' subset | loss 7.498 | nll_loss 5.862 | ppl 58.16 | num_updates 4747 | best_loss 7.49752 | length_loss 5.48479
| epoch 018 | valid on 'valid' subset | loss 7.295 | nll_loss 5.646 | ppl 50.07 | num_updates 5027 | best_loss 7.2946 | length_loss 5.25694
| epoch 019 | valid on 'valid' subset | loss 7.158 | nll_loss 5.503 | ppl 45.35 | num_updates 5307 | best_loss 7.1578 | length_loss 5.05596
| epoch 020 | valid on 'valid' subset | loss 7.089 | nll_loss 5.404 | ppl 42.33 | num_updates 5586 | best_loss 7.08919 | length_loss 5.22389
| epoch 021 | valid on 'valid' subset | loss 7.011 | nll_loss 5.330 | ppl 40.23 | num_updates 5866 | best_loss 7.01102 | length_loss 5.30964
| epoch 022 | valid on 'valid' subset | loss 6.965 | nll_loss 5.272 | ppl 38.65 | num_updates 6000 | best_loss 6.96542 | length_loss 5.11045
| epoch 022 | valid on 'valid' subset | loss 6.877 | nll_loss 5.184 | ppl 36.34 | num_updates 6144 | best_loss 6.87745 | length_loss 5.0194
| epoch 023 | valid on 'valid' subset | loss 6.811 | nll_loss 5.097 | ppl 34.22 | num_updates 6424 | best_loss 6.81102 | length_loss 5.7039
| epoch 024 | valid on 'valid' subset | loss 6.763 | nll_loss 5.062 | ppl 33.41 | num_updates 6704 | best_loss 6.76268 | length_loss 5.10046
| epoch 025 | valid on 'valid' subset | loss 6.677 | nll_loss 4.964 | ppl 31.22 | num_updates 6983 | best_loss 6.67652 | length_loss 5.17359
| epoch 026 | valid on 'valid' subset | loss 6.647 | nll_loss 4.923 | ppl 30.34 | num_updates 7263 | best_loss 6.64697 | length_loss 5.66601
| epoch 027 | valid on 'valid' subset | loss 6.578 | nll_loss 4.859 | ppl 29.01 | num_updates 7542 | best_loss 6.57783 | length_loss 5.3032
| epoch 028 | valid on 'valid' subset | loss 6.451 | nll_loss 4.706 | ppl 26.10 | num_updates 7821 | best_loss 6.45056 | length_loss 5.37859
| epoch 029 | valid on 'valid' subset | loss 6.489 | nll_loss 4.750 | ppl 26.92 | num_updates 8000 | best_loss 6.45056 | length_loss 5.34189
| epoch 029 | valid on 'valid' subset | loss 6.447 | nll_loss 4.718 | ppl 26.32 | num_updates 8101 | best_loss 6.44725 | length_loss 5.08036
| epoch 030 | valid on 'valid' subset | loss 6.417 | nll_loss 4.684 | ppl 25.71 | num_updates 8381 | best_loss 6.41685 | length_loss 4.99973
| epoch 031 | valid on 'valid' subset | loss 6.345 | nll_loss 4.602 | ppl 24.28 | num_updates 8659 | best_loss 6.34513 | length_loss 5.10561
| epoch 032 | valid on 'valid' subset | loss 6.365 | nll_loss 4.612 | ppl 24.45 | num_updates 8939 | best_loss 6.34513 | length_loss 5.76396
| epoch 033 | valid on 'valid' subset | loss 6.295 | nll_loss 4.563 | ppl 23.63 | num_updates 9219 | best_loss 6.29542 | length_loss 4.95261
| epoch 034 | valid on 'valid' subset | loss 6.218 | nll_loss 4.461 | ppl 22.03 | num_updates 9499 | best_loss 6.21752 | length_loss 5.3258
| epoch 035 | valid on 'valid' subset | loss 6.199 | nll_loss 4.415 | ppl 21.33 | num_updates 9779 | best_loss 6.19946 | length_loss 6.13781
| epoch 036 | valid on 'valid' subset | loss 6.169 | nll_loss 4.426 | ppl 21.49 | num_updates 10000 | best_loss 6.16876 | length_loss 5.08362
| epoch 036 | valid on 'valid' subset | loss 6.192 | nll_loss 4.444 | ppl 21.77 | num_updates 10059 | best_loss 6.16876 | length_loss 4.96137
| epoch 037 | valid on 'valid' subset | loss 6.142 | nll_loss 4.395 | ppl 21.05 | num_updates 10338 | best_loss 6.14181 | length_loss 5.11
| epoch 038 | valid on 'valid' subset | loss 6.100 | nll_loss 4.360 | ppl 20.53 | num_updates 10617 | best_loss 6.10049 | length_loss 5.06722
| epoch 039 | valid on 'valid' subset | loss 6.117 | nll_loss 4.377 | ppl 20.77 | num_updates 10897 | best_loss 6.10049 | length_loss 5.19192
| epoch 040 | valid on 'valid' subset | loss 6.057 | nll_loss 4.315 | ppl 19.90 | num_updates 11177 | best_loss 6.05656 | length_loss 5.11511
| epoch 041 | valid on 'valid' subset | loss 6.034 | nll_loss 4.293 | ppl 19.61 | num_updates 11456 | best_loss 6.03448 | length_loss 4.86498
| epoch 042 | valid on 'valid' subset | loss 5.988 | nll_loss 4.254 | ppl 19.08 | num_updates 11736 | best_loss 5.98752 | length_loss 4.73533
| epoch 043 | valid on 'valid' subset | loss 5.989 | nll_loss 4.234 | ppl 18.82 | num_updates 12000 | best_loss 5.98752 | length_loss 5.12521
| epoch 043 | valid on 'valid' subset | loss 6.040 | nll_loss 4.292 | ppl 19.59 | num_updates 12016 | best_loss 5.98752 | length_loss 4.97918
| epoch 044 | valid on 'valid' subset | loss 6.018 | nll_loss 4.270 | ppl 19.30 | num_updates 12295 | best_loss 5.98752 | length_loss 4.90497
| epoch 045 | valid on 'valid' subset | loss 5.954 | nll_loss 4.196 | ppl 18.33 | num_updates 12575 | best_loss 5.95385 | length_loss 5.30288
| epoch 046 | valid on 'valid' subset | loss 5.990 | nll_loss 4.250 | ppl 19.03 | num_updates 12854 | best_loss 5.95385 | length_loss 4.72608
| epoch 047 | valid on 'valid' subset | loss 12.413 | nll_loss 11.636 | ppl 3183.69 | num_updates 13133 | best_loss 5.95385 | length_loss 11.388
| epoch 048 | valid on 'valid' subset | loss 11.669 | nll_loss 10.782 | ppl 1761.11 | num_updates 13413 | best_loss 5.95385 | length_loss 9.24891
| epoch 049 | valid on 'valid' subset | loss 9.240 | nll_loss 7.688 | ppl 206.19 | num_updates 13693 | best_loss 5.95385 | length_loss 11.8834
| epoch 050 | valid on 'valid' subset | loss 8.942 | nll_loss 7.535 | ppl 185.45 | num_updates 13973 | best_loss 5.95385 | length_loss 8.7949
| epoch 051 | valid on 'valid' subset | loss 8.918 | nll_loss 7.521 | ppl 183.67 | num_updates 14000 | best_loss 5.95385 | length_loss 8.71842
| epoch 051 | valid on 'valid' subset | loss 8.832 | nll_loss 7.443 | ppl 174.05 | num_updates 14253 | best_loss 5.95385 | length_loss 8.87974
| epoch 052 | valid on 'valid' subset | loss 8.751 | nll_loss 7.362 | ppl 164.50 | num_updates 14533 | best_loss 5.95385 | length_loss 8.94963
| epoch 053 | valid on 'valid' subset | loss 11.789 | nll_loss 11.018 | ppl 2074.01 | num_updates 14813 | best_loss 5.95385 | length_loss 9.54743
| epoch 054 | valid on 'valid' subset | loss 11.798 | nll_loss 11.021 | ppl 2078.72 | num_updates 15093 | best_loss 5.95385 | length_loss 9.2394
| epoch 055 | valid on 'valid' subset | loss 11.751 | nll_loss 10.963 | ppl 1995.98 | num_updates 15373 | best_loss 5.95385 | length_loss 9.27329
| epoch 056 | valid on 'valid' subset | loss 11.731 | nll_loss 10.935 | ppl 1957.32 | num_updates 15653 | best_loss 5.95385 | length_loss 9.23337
| epoch 057 | valid on 'valid' subset | loss 11.747 | nll_loss 10.945 | ppl 1971.58 | num_updates 15931 | best_loss 5.95385 | length_loss 9.15427
| epoch 058 | valid on 'valid' subset | loss 11.756 | nll_loss 10.948 | ppl 1975.56 | num_updates 16000 | best_loss 5.95385 | length_loss 9.1642
| epoch 058 | valid on 'valid' subset | loss 11.666 | nll_loss 10.847 | ppl 1841.53 | num_updates 16211 | best_loss 5.95385 | length_loss 9.10518
| epoch 059 | valid on 'valid' subset | loss 11.636 | nll_loss 10.811 | ppl 1796.58 | num_updates 16490 | best_loss 5.95385 | length_loss 9.11344
| epoch 060 | valid on 'valid' subset | loss 11.525 | nll_loss 10.682 | ppl 1643.20 | num_updates 16770 | best_loss 5.95385 | length_loss 9.14741
| epoch 061 | valid on 'valid' subset | loss 11.321 | nll_loss 10.452 | ppl 1400.71 | num_updates 17049 | best_loss 5.95385 | length_loss 9.16941
| epoch 062 | valid on 'valid' subset | loss 10.823 | nll_loss 9.900 | ppl 955.17 | num_updates 17328 | best_loss 5.95385 | length_loss 8.71178
| epoch 063 | valid on 'valid' subset | loss 10.626 | nll_loss 9.679 | ppl 819.49 | num_updates 17608 | best_loss 5.95385 | length_loss 8.95637
| epoch 064 | valid on 'valid' subset | loss 10.565 | nll_loss 9.603 | ppl 777.90 | num_updates 17888 | best_loss 5.95385 | length_loss 9.092
| epoch 065 | valid on 'valid' subset | loss 10.546 | nll_loss 9.586 | ppl 768.39 | num_updates 18000 | best_loss 5.95385 | length_loss 8.9947
| epoch 065 | valid on 'valid' subset | loss 10.622 | nll_loss 9.670 | ppl 814.79 | num_updates 18168 | best_loss 5.95385 | length_loss 9.0684
| epoch 066 | valid on 'valid' subset | loss 10.550 | nll_loss 9.624 | ppl 788.86 | num_updates 18447 | best_loss 5.95385 | length_loss 9.13353
| epoch 067 | valid on 'valid' subset | loss 10.585 | nll_loss 9.679 | ppl 819.54 | num_updates 18726 | best_loss 5.95385 | length_loss 8.87808
| epoch 068 | valid on 'valid' subset | loss 10.725 | nll_loss 9.841 | ppl 916.99 | num_updates 19006 | best_loss 5.95385 | length_loss 9.05298
| epoch 069 | valid on 'valid' subset | loss 10.739 | nll_loss 9.871 | ppl 936.56 | num_updates 19286 | best_loss 5.95385 | length_loss 8.99715
| epoch 070 | valid on 'valid' subset | loss 10.854 | nll_loss 10.007 | ppl 1029.08 | num_updates 19566 | best_loss 5.95385 | length_loss 9.01016
| epoch 071 | valid on 'valid' subset | loss 10.864 | nll_loss 10.032 | ppl 1046.96 | num_updates 19845 | best_loss 5.95385 | length_loss 8.86554
| epoch 072 | valid on 'valid' subset | loss 8.418 | nll_loss 7.003 | ppl 128.30 | num_updates 20000 | best_loss 5.95385 | length_loss 8.79381
| epoch 072 | valid on 'valid' subset | loss 8.412 | nll_loss 6.999 | ppl 127.91 | num_updates 20123 | best_loss 5.95385 | length_loss 8.84561
| epoch 073 | valid on 'valid' subset | loss 10.533 | nll_loss 9.672 | ppl 815.50 | num_updates 20403 | best_loss 5.95385 | length_loss 9.18969
| epoch 074 | valid on 'valid' subset | loss 10.665 | nll_loss 9.823 | ppl 905.86 | num_updates 20683 | best_loss 5.95385 | length_loss 9.22083
| epoch 075 | valid on 'valid' subset | loss 10.880 | nll_loss 10.068 | ppl 1073.60 | num_updates 20963 | best_loss 5.95385 | length_loss 9.157
| epoch 076 | valid on 'valid' subset | loss 11.071 | nll_loss 10.286 | ppl 1248.62 | num_updates 21243 | best_loss 5.95385 | length_loss 8.96544
| epoch 077 | valid on 'valid' subset | loss 11.182 | nll_loss 10.408 | ppl 1359.12 | num_updates 21523 | best_loss 5.95385 | length_loss 9.07412
| epoch 078 | valid on 'valid' subset | loss 11.236 | nll_loss 10.477 | ppl 1425.37 | num_updates 21803 | best_loss 5.95385 | length_loss 8.85913
| epoch 079 | valid on 'valid' subset | loss 10.301 | nll_loss 9.398 | ppl 674.81 | num_updates 22000 | best_loss 5.95385 | length_loss 8.96468
| epoch 079 | valid on 'valid' subset | loss 9.806 | nll_loss 8.855 | ppl 462.96 | num_updates 22082 | best_loss 5.95385 | length_loss 8.93285
| epoch 080 | valid on 'valid' subset | loss 10.276 | nll_loss 9.401 | ppl 676.18 | num_updates 22362 | best_loss 5.95385 | length_loss 9.12087
| epoch 081 | valid on 'valid' subset | loss 10.474 | nll_loss 9.630 | ppl 792.14 | num_updates 22641 | best_loss 5.95385 | length_loss 9.18552
| epoch 082 | valid on 'valid' subset | loss 10.584 | nll_loss 9.755 | ppl 863.88 | num_updates 22921 | best_loss 5.95385 | length_loss 8.87157
| epoch 083 | valid on 'valid' subset | loss 10.820 | nll_loss 10.015 | ppl 1034.79 | num_updates 23200 | best_loss 5.95385 | length_loss 8.92633
| epoch 084 | valid on 'valid' subset | loss 11.004 | nll_loss 10.226 | ppl 1198.01 | num_updates 23480 | best_loss 5.95385 | length_loss 9.01677
| epoch 085 | valid on 'valid' subset | loss 11.027 | nll_loss 10.249 | ppl 1217.05 | num_updates 23758 | best_loss 5.95385 | length_loss 8.90608
| epoch 086 | valid on 'valid' subset | loss 11.099 | nll_loss 10.324 | ppl 1282.21 | num_updates 24000 | best_loss 5.95385 | length_loss 9.01892
| epoch 086 | valid on 'valid' subset | loss 11.159 | nll_loss 10.400 | ppl 1351.02 | num_updates 24037 | best_loss 5.95385 | length_loss 8.86138
| epoch 087 | valid on 'valid' subset | loss 10.857 | nll_loss 10.056 | ppl 1064.75 | num_updates 24317 | best_loss 5.95385 | length_loss 8.76016
| epoch 088 | valid on 'valid' subset | loss 11.008 | nll_loss 10.226 | ppl 1197.49 | num_updates 24597 | best_loss 5.95385 | length_loss 8.96673
| epoch 089 | valid on 'valid' subset | loss 11.241 | nll_loss 10.479 | ppl 1426.87 | num_updates 24877 | best_loss 5.95385 | length_loss 8.93496
| epoch 090 | valid on 'valid' subset | loss 11.131 | nll_loss 10.350 | ppl 1305.47 | num_updates 25157 | best_loss 5.95385 | length_loss 9.13093
| epoch 091 | valid on 'valid' subset | loss 11.027 | nll_loss 10.242 | ppl 1211.21 | num_updates 25437 | best_loss 5.95385 | length_loss 8.79867
| epoch 092 | valid on 'valid' subset | loss 11.170 | nll_loss 10.394 | ppl 1346.02 | num_updates 25716 | best_loss 5.95385 | length_loss 8.89253
| epoch 093 | valid on 'valid' subset | loss 11.423 | nll_loss 10.670 | ppl 1629.57 | num_updates 25996 | best_loss 5.95385 | length_loss 8.90915
| epoch 094 | valid on 'valid' subset | loss 11.373 | nll_loss 10.618 | ppl 1571.63 | num_updates 26000 | best_loss 5.95385 | length_loss 8.93119
| epoch 094 | valid on 'valid' subset | loss 11.429 | nll_loss 10.679 | ppl 1639.33 | num_updates 26274 | best_loss 5.95385 | length_loss 8.89474
| epoch 095 | valid on 'valid' subset | loss 11.142 | nll_loss 10.341 | ppl 1296.98 | num_updates 26554 | best_loss 5.95385 | length_loss 8.87344
| epoch 096 | valid on 'valid' subset | loss 11.532 | nll_loss 10.778 | ppl 1756.30 | num_updates 26834 | best_loss 5.95385 | length_loss 8.97212
| epoch 097 | valid on 'valid' subset | loss 11.188 | nll_loss 10.393 | ppl 1344.82 | num_updates 27114 | best_loss 5.95385 | length_loss 8.93483
| epoch 098 | valid on 'valid' subset | loss 10.440 | nll_loss 9.561 | ppl 755.15 | num_updates 27393 | best_loss 5.95385 | length_loss 9.11795
| epoch 099 | valid on 'valid' subset | loss 8.302 | nll_loss 6.880 | ppl 117.75 | num_updates 27672 | best_loss 5.95385 | length_loss 8.83351
| epoch 100 | valid on 'valid' subset | loss 9.792 | nll_loss 8.838 | ppl 457.54 | num_updates 27952 | best_loss 5.95385 | length_loss 8.92019
| epoch 101 | valid on 'valid' subset | loss 9.602 | nll_loss 8.621 | ppl 393.71 | num_updates 28000 | best_loss 5.95385 | length_loss 8.90594
| epoch 101 | valid on 'valid' subset | loss 8.294 | nll_loss 6.871 | ppl 117.07 | num_updates 28229 | best_loss 5.95385 | length_loss 8.7386
| epoch 102 | valid on 'valid' subset | loss 8.288 | nll_loss 6.862 | ppl 116.29 | num_updates 28509 | best_loss 5.95385 | length_loss 8.86053
| epoch 103 | valid on 'valid' subset | loss 8.282 | nll_loss 6.858 | ppl 116.03 | num_updates 28789 | best_loss 5.95385 | length_loss 8.84397
| epoch 104 | valid on 'valid' subset | loss 8.285 | nll_loss 6.862 | ppl 116.30 | num_updates 29069 | best_loss 5.95385 | length_loss 8.85538
| epoch 105 | valid on 'valid' subset | loss 8.285 | nll_loss 6.858 | ppl 116.02 | num_updates 29349 | best_loss 5.95385 | length_loss 8.87207
| epoch 106 | valid on 'valid' subset | loss 8.287 | nll_loss 6.858 | ppl 115.97 | num_updates 29629 | best_loss 5.95385 | length_loss 8.8873
| epoch 107 | valid on 'valid' subset | loss 8.287 | nll_loss 6.858 | ppl 116.03 | num_updates 29909 | best_loss 5.95385 | length_loss 8.86653
| epoch 108 | valid on 'valid' subset | loss 8.278 | nll_loss 6.853 | ppl 115.63 | num_updates 30000 | best_loss 5.95385 | length_loss 8.87427
| epoch 108 | valid on 'valid' subset | loss 8.263 | nll_loss 6.846 | ppl 115.03 | num_updates 30189 | best_loss 5.95385 | length_loss 8.84151
| epoch 109 | valid on 'valid' subset | loss 8.265 | nll_loss 6.843 | ppl 114.83 | num_updates 30468 | best_loss 5.95385 | length_loss 8.85339
| epoch 110 | valid on 'valid' subset | loss 8.272 | nll_loss 6.845 | ppl 114.99 | num_updates 30747 | best_loss 5.95385 | length_loss 8.89397
| epoch 111 | valid on 'valid' subset | loss 8.263 | nll_loss 6.839 | ppl 114.51 | num_updates 31026 | best_loss 5.95385 | length_loss 8.84809
| epoch 112 | valid on 'valid' subset | loss 8.263 | nll_loss 6.838 | ppl 114.44 | num_updates 31306 | best_loss 5.95385 | length_loss 8.83498
| epoch 113 | valid on 'valid' subset | loss 8.268 | nll_loss 6.843 | ppl 114.76 | num_updates 31586 | best_loss 5.95385 | length_loss 8.85753
| epoch 114 | valid on 'valid' subset | loss 8.258 | nll_loss 6.833 | ppl 114.01 | num_updates 31866 | best_loss 5.95385 | length_loss 8.81941
| epoch 115 | valid on 'valid' subset | loss 8.274 | nll_loss 6.843 | ppl 114.83 | num_updates 32000 | best_loss 5.95385 | length_loss 8.78144
| epoch 115 | valid on 'valid' subset | loss 8.266 | nll_loss 6.839 | ppl 114.49 | num_updates 32146 | best_loss 5.95385 | length_loss 8.88354
| epoch 116 | valid on 'valid' subset | loss 8.265 | nll_loss 6.838 | ppl 114.38 | num_updates 32426 | best_loss 5.95385 | length_loss 8.84121
| epoch 117 | valid on 'valid' subset | loss 8.259 | nll_loss 6.831 | ppl 113.89 | num_updates 32705 | best_loss 5.95385 | length_loss 8.84998
| epoch 118 | valid on 'valid' subset | loss 8.255 | nll_loss 6.832 | ppl 113.95 | num_updates 32985 | best_loss 5.95385 | length_loss 8.82119
| epoch 119 | valid on 'valid' subset | loss 8.267 | nll_loss 6.836 | ppl 114.27 | num_updates 33264 | best_loss 5.95385 | length_loss 8.83913
| epoch 120 | valid on 'valid' subset | loss 8.254 | nll_loss 6.826 | ppl 113.43 | num_updates 33544 | best_loss 5.95385 | length_loss 8.86863
| epoch 121 | valid on 'valid' subset | loss 8.256 | nll_loss 6.826 | ppl 113.48 | num_updates 33823 | best_loss 5.95385 | length_loss 8.84371
| epoch 122 | valid on 'valid' subset | loss 8.248 | nll_loss 6.817 | ppl 112.79 | num_updates 34000 | best_loss 5.95385 | length_loss 8.85724
| epoch 122 | valid on 'valid' subset | loss 8.261 | nll_loss 6.831 | ppl 113.83 | num_updates 34103 | best_loss 5.95385 | length_loss 8.83933
| epoch 123 | valid on 'valid' subset | loss 8.251 | nll_loss 6.822 | ppl 113.17 | num_updates 34382 | best_loss 5.95385 | length_loss 8.82543
| epoch 124 | valid on 'valid' subset | loss 8.251 | nll_loss 6.823 | ppl 113.23 | num_updates 34661 | best_loss 5.95385 | length_loss 8.83107
| epoch 125 | valid on 'valid' subset | loss 8.239 | nll_loss 6.814 | ppl 112.52 | num_updates 34940 | best_loss 5.95385 | length_loss 8.84941
| epoch 126 | valid on 'valid' subset | loss 8.243 | nll_loss 6.812 | ppl 112.37 | num_updates 35219 | best_loss 5.95385 | length_loss 8.83323
| epoch 127 | valid on 'valid' subset | loss 8.240 | nll_loss 6.815 | ppl 112.57 | num_updates 35499 | best_loss 5.95385 | length_loss 8.81804
| epoch 128 | valid on 'valid' subset | loss 8.240 | nll_loss 6.811 | ppl 112.25 | num_updates 35779 | best_loss 5.95385 | length_loss 8.83056
| epoch 129 | valid on 'valid' subset | loss 8.248 | nll_loss 6.816 | ppl 112.70 | num_updates 36000 | best_loss 5.95385 | length_loss 8.8526
| epoch 129 | valid on 'valid' subset | loss 8.244 | nll_loss 6.816 | ppl 112.69 | num_updates 36059 | best_loss 5.95385 | length_loss 8.81925
| epoch 130 | valid on 'valid' subset | loss 8.232 | nll_loss 6.807 | ppl 111.98 | num_updates 36339 | best_loss 5.95385 | length_loss 8.83944
| epoch 131 | valid on 'valid' subset | loss 8.239 | nll_loss 6.809 | ppl 112.13 | num_updates 36619 | best_loss 5.95385 | length_loss 8.87711
| epoch 132 | valid on 'valid' subset | loss 8.241 | nll_loss 6.810 | ppl 112.24 | num_updates 36899 | best_loss 5.95385 | length_loss 8.87148
| epoch 133 | valid on 'valid' subset | loss 8.238 | nll_loss 6.808 | ppl 112.09 | num_updates 37177 | best_loss 5.95385 | length_loss 8.87225
| epoch 134 | valid on 'valid' subset | loss 8.240 | nll_loss 6.810 | ppl 112.22 | num_updates 37457 | best_loss 5.95385 | length_loss 8.84351
| epoch 135 | valid on 'valid' subset | loss 8.244 | nll_loss 6.813 | ppl 112.42 | num_updates 37737 | best_loss 5.95385 | length_loss 8.84976
| epoch 136 | valid on 'valid' subset | loss 8.236 | nll_loss 6.806 | ppl 111.92 | num_updates 38000 | best_loss 5.95385 | length_loss 8.85678
| epoch 136 | valid on 'valid' subset | loss 8.235 | nll_loss 6.804 | ppl 111.73 | num_updates 38016 | best_loss 5.95385 | length_loss 8.86879
| epoch 137 | valid on 'valid' subset | loss 8.237 | nll_loss 6.807 | ppl 111.95 | num_updates 38296 | best_loss 5.95385 | length_loss 8.85905
| epoch 138 | valid on 'valid' subset | loss 8.234 | nll_loss 6.808 | ppl 112.03 | num_updates 38575 | best_loss 5.95385 | length_loss 8.88407
| epoch 139 | valid on 'valid' subset | loss 8.231 | nll_loss 6.800 | ppl 111.43 | num_updates 38855 | best_loss 5.95385 | length_loss 8.86341
| epoch 140 | valid on 'valid' subset | loss 8.220 | nll_loss 6.794 | ppl 111.01 | num_updates 39135 | best_loss 5.95385 | length_loss 8.84081
| epoch 141 | valid on 'valid' subset | loss 8.236 | nll_loss 6.804 | ppl 111.73 | num_updates 39414 | best_loss 5.95385 | length_loss 8.82266
| epoch 142 | valid on 'valid' subset | loss 8.222 | nll_loss 6.795 | ppl 111.04 | num_updates 39694 | best_loss 5.95385 | length_loss 8.84459
| epoch 143 | valid on 'valid' subset | loss 8.226 | nll_loss 6.797 | ppl 111.16 | num_updates 39974 | best_loss 5.95385 | length_loss 8.88616
| epoch 144 | valid on 'valid' subset | loss 8.218 | nll_loss 6.792 | ppl 110.79 | num_updates 40000 | best_loss 5.95385 | length_loss 8.88367
| epoch 144 | valid on 'valid' subset | loss 8.230 | nll_loss 6.800 | ppl 111.44 | num_updates 40253 | best_loss 5.95385 | length_loss 8.8395
| epoch 145 | valid on 'valid' subset | loss 8.247 | nll_loss 6.809 | ppl 112.09 | num_updates 40533 | best_loss 5.95385 | length_loss 8.86908
| epoch 146 | valid on 'valid' subset | loss 8.219 | nll_loss 6.790 | ppl 110.67 | num_updates 40813 | best_loss 5.95385 | length_loss 8.86673
| epoch 147 | valid on 'valid' subset | loss 8.226 | nll_loss 6.790 | ppl 110.68 | num_updates 41093 | best_loss 5.95385 | length_loss 8.83123
| epoch 148 | valid on 'valid' subset | loss 8.222 | nll_loss 6.794 | ppl 110.98 | num_updates 41372 | best_loss 5.95385 | length_loss 8.85367
| epoch 149 | valid on 'valid' subset | loss 8.221 | nll_loss 6.789 | ppl 110.55 | num_updates 41652 | best_loss 5.95385 | length_loss 8.84598
| epoch 150 | valid on 'valid' subset | loss 8.226 | nll_loss 6.794 | ppl 110.99 | num_updates 41932 | best_loss 5.95385 | length_loss 8.835
| epoch 151 | valid on 'valid' subset | loss 8.218 | nll_loss 6.784 | ppl 110.21 | num_updates 42000 | best_loss 5.95385 | length_loss 8.82967
| epoch 151 | valid on 'valid' subset | loss 8.215 | nll_loss 6.783 | ppl 110.12 | num_updates 42211 | best_loss 5.95385 | length_loss 8.83456
| epoch 152 | valid on 'valid' subset | loss 9.292 | nll_loss 8.233 | ppl 300.93 | num_updates 42491 | best_loss 5.95385 | length_loss 8.91676
| epoch 153 | valid on 'valid' subset | loss 9.855 | nll_loss 8.861 | ppl 465.13 | num_updates 42770 | best_loss 5.95385 | length_loss 8.94109
| epoch 154 | valid on 'valid' subset | loss 10.298 | nll_loss 9.339 | ppl 647.77 | num_updates 43049 | best_loss 5.95385 | length_loss 8.89363
| epoch 155 | valid on 'valid' subset | loss 10.358 | nll_loss 9.405 | ppl 677.85 | num_updates 43329 | best_loss 5.95385 | length_loss 8.80955
| epoch 156 | valid on 'valid' subset | loss 10.896 | nll_loss 9.991 | ppl 1017.96 | num_updates 43609 | best_loss 5.95385 | length_loss 9.08136
| epoch 157 | valid on 'valid' subset | loss 10.711 | nll_loss 9.779 | ppl 878.42 | num_updates 43889 | best_loss 5.95385 | length_loss 8.95622
| epoch 158 | valid on 'valid' subset | loss 10.923 | nll_loss 9.994 | ppl 1019.84 | num_updates 44000 | best_loss 5.95385 | length_loss 8.91165
| epoch 158 | valid on 'valid' subset | loss 10.706 | nll_loss 9.774 | ppl 875.50 | num_updates 44169 | best_loss 5.95385 | length_loss 8.89074
| epoch 159 | valid on 'valid' subset | loss 10.983 | nll_loss 10.055 | ppl 1063.43 | num_updates 44448 | best_loss 5.95385 | length_loss 8.99308
| epoch 160 | valid on 'valid' subset | loss 11.061 | nll_loss 10.145 | ppl 1132.42 | num_updates 44728 | best_loss 5.95385 | length_loss 8.83523
| epoch 161 | valid on 'valid' subset | loss 11.176 | nll_loss 10.264 | ppl 1229.85 | num_updates 45007 | best_loss 5.95385 | length_loss 8.98247
| epoch 162 | valid on 'valid' subset | loss 11.354 | nll_loss 10.453 | ppl 1401.46 | num_updates 45287 | best_loss 5.95385 | length_loss 9.01076
| epoch 163 | valid on 'valid' subset | loss 11.283 | nll_loss 10.376 | ppl 1328.46 | num_updates 45567 | best_loss 5.95385 | length_loss 9.13525
| epoch 164 | valid on 'valid' subset | loss 11.396 | nll_loss 10.496 | ppl 1444.11 | num_updates 45847 | best_loss 5.95385 | length_loss 8.98389
| epoch 165 | valid on 'valid' subset | loss 11.258 | nll_loss 10.361 | ppl 1315.40 | num_updates 46000 | best_loss 5.95385 | length_loss 8.96401
| epoch 165 | valid on 'valid' subset | loss 11.380 | nll_loss 10.480 | ppl 1428.03 | num_updates 46125 | best_loss 5.95385 | length_loss 9.03602
| epoch 166 | valid on 'valid' subset | loss 11.378 | nll_loss 10.473 | ppl 1421.48 | num_updates 46405 | best_loss 5.95385 | length_loss 8.99594
| epoch 167 | valid on 'valid' subset | loss 11.364 | nll_loss 10.456 | ppl 1404.80 | num_updates 46684 | best_loss 5.95385 | length_loss 8.91066
| epoch 168 | valid on 'valid' subset | loss 11.408 | nll_loss 10.515 | ppl 1463.64 | num_updates 46964 | best_loss 5.95385 | length_loss 8.9694
| epoch 169 | valid on 'valid' subset | loss 11.339 | nll_loss 10.450 | ppl 1399.20 | num_updates 47243 | best_loss 5.95385 | length_loss 8.95181
| epoch 170 | valid on 'valid' subset | loss 11.419 | nll_loss 10.521 | ppl 1469.72 | num_updates 47523 | best_loss 5.95385 | length_loss 8.93635
| epoch 171 | valid on 'valid' subset | loss 11.667 | nll_loss 10.783 | ppl 1761.77 | num_updates 47803 | best_loss 5.95385 | length_loss 8.98119
| epoch 172 | valid on 'valid' subset | loss 11.565 | nll_loss 10.672 | ppl 1631.77 | num_updates 48000 | best_loss 5.95385 | length_loss 9.00285
| epoch 172 | valid on 'valid' subset | loss 11.728 | nll_loss 10.840 | ppl 1832.38 | num_updates 48083 | best_loss 5.95385 | length_loss 8.91423
| epoch 173 | valid on 'valid' subset | loss 11.612 | nll_loss 10.728 | ppl 1695.63 | num_updates 48363 | best_loss 5.95385 | length_loss 8.96803
| epoch 174 | valid on 'valid' subset | loss 11.691 | nll_loss 10.800 | ppl 1782.33 | num_updates 48641 | best_loss 5.95385 | length_loss 8.94515
| epoch 175 | valid on 'valid' subset | loss 11.562 | nll_loss 10.678 | ppl 1638.23 | num_updates 48921 | best_loss 5.95385 | length_loss 9.14294
| epoch 176 | valid on 'valid' subset | loss 11.579 | nll_loss 10.696 | ppl 1658.43 | num_updates 49200 | best_loss 5.95385 | length_loss 8.99614
| epoch 177 | valid on 'valid' subset | loss 8.192 | nll_loss 6.761 | ppl 108.43 | num_updates 49480 | best_loss 5.95385 | length_loss 8.85377
| epoch 178 | valid on 'valid' subset | loss 11.600 | nll_loss 10.707 | ppl 1671.80 | num_updates 49760 | best_loss 5.95385 | length_loss 8.9394
| epoch 179 | valid on 'valid' subset | loss 11.534 | nll_loss 10.642 | ppl 1597.87 | num_updates 50000 | best_loss 5.95385 | length_loss 8.74317
| epoch 179 | valid on 'valid' subset | loss 8.202 | nll_loss 6.770 | ppl 109.13 | num_updates 50039 | best_loss 5.95385 | length_loss 8.85002
| epoch 180 | valid on 'valid' subset | loss 11.481 | nll_loss 10.586 | ppl 1536.68 | num_updates 50319 | best_loss 5.95385 | length_loss 8.89714
| epoch 181 | valid on 'valid' subset | loss 8.193 | nll_loss 6.759 | ppl 108.33 | num_updates 50599 | best_loss 5.95385 | length_loss 8.85359
| epoch 182 | valid on 'valid' subset | loss 11.492 | nll_loss 10.597 | ppl 1548.90 | num_updates 50878 | best_loss 5.95385 | length_loss 8.95791
| epoch 183 | valid on 'valid' subset | loss 11.591 | nll_loss 10.708 | ppl 1672.66 | num_updates 51158 | best_loss 5.95385 | length_loss 9.00685
| epoch 184 | valid on 'valid' subset | loss 8.188 | nll_loss 6.755 | ppl 108.01 | num_updates 51437 | best_loss 5.95385 | length_loss 8.85051
| epoch 185 | valid on 'valid' subset | loss 8.201 | nll_loss 6.766 | ppl 108.81 | num_updates 51717 | best_loss 5.95385 | length_loss 8.85741
| epoch 186 | valid on 'valid' subset | loss 11.735 | nll_loss 10.846 | ppl 1841.29 | num_updates 51996 | best_loss 5.95385 | length_loss 8.93106
| epoch 187 | valid on 'valid' subset | loss 11.512 | nll_loss 10.628 | ppl 1582.03 | num_updates 52000 | best_loss 5.95385 | length_loss 8.8988
| epoch 187 | valid on 'valid' subset | loss 11.522 | nll_loss 10.627 | ppl 1581.80 | num_updates 52276 | best_loss 5.95385 | length_loss 8.9317
alphadl commented 4 years ago

The perplexity aka. PPL seems to show a cyclical fluctuation trend, very unstable. @jungokasai

jungokasai commented 4 years ago

The valid loss divergence after 100+ epochs looks strange indeed. So the only differences I can see from my setting are:

  1. --max-tokens 16000 vs. --max-tokens 8192
  2. --distributed-world-size 8 vs. --distributed-world-size 16
  3. --update-freq 4 vs. --update-freq 1

I would guess 3. might have the biggest impact. Could you try setting --update-freq 1 instead? For reference, the following is the exact command I used to produce the results.

python train.py <PATH_TO_DATA> --arch disco_transformer --criterion label_smoothed_length_cross_entropy --label-smoothing 0.1 --lr 5e-4 --warmup-init-lr 1e-7 \
--min-lr 1e-9 --lr-scheduler inverse_sqrt --warmup-updates 10000 --optimizer adam --adam-betas '(0.9, 0.999)' --adam-eps 1e-6 \
--task translation_self --max-tokens 8192 --weight-decay 0.01 --dropout 0.2 \
--encoder-layers 6 --encoder-embed-dim 512 --decoder-layers 6 --decoder-embed-dim 512 \
--fp16 --max-source-positions 10000 --max-target-positions 10000 --max-update 300000 --seed 1 \
--save-dir <SAVE_DIR> --dynamic-masking  --ignore-eos-loss \
--share-all-embeddings \
--distributed-world-size 16 --distributed-port 54100 \
jungokasai commented 4 years ago

Alternatively, it might be due to the fact that the optimization hyperparameters used in CMLMs are less robust to different configurations. It worked fine with their exact setting, which I simply followed for DisCo as well. Several people found it doesn't work well with the transformer large configuration. If the problem still persists, could you try something like this?

python train.py <PATH_TO_DATA> --arch disco_transformer --criterion label_smoothed_length_cross_entropy --label-smoothing 0.1 --lr 5e-4 --warmup-init-lr 1e-7 \
--min-lr 1e-9 --lr-scheduler inverse_sqrt --warmup-updates 4000 --optimizer adam --adam-betas '(0.9, 0.98)' --adam-eps 1e-6 \
--task translation_self --max-tokens <Your_Batch_Size> --weight-decay 0.01 --dropout 0.2 \
--encoder-layers 6 --encoder-embed-dim 512 --decoder-layers 6 --decoder-embed-dim 512 \
--fp16 --max-source-positions 10000 --max-target-positions 10000 --max-update 300000 --seed 1 \
--save-dir <SAVE_DIR> --dynamic-masking  --ignore-eos-loss \
--share-all-embeddings \
--distributed-world-size <# GPUS you are using> --distributed-port 54100 \

I hope this helps!

alphadl commented 4 years ago

Thanks for your kind suggestions~ I am trying it and will report the results later ! btw, that issue is opened by me as well 😆

alphadl commented 4 years ago

I merely change the --update-freq 4 to --update-freq 1, and got the reasonable loss curve:

Namespace(adam_betas='(0.9, 0.999)', adam_eps=1e-06, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, arch='disco_transformer', at_only=False, at_rm=False, attention_dropout=0.0, best_checkpoint_metric='loss', bilm_add_bos=False, bilm_attention_dropout=0.0, bilm_mask_last_state=False, bilm_model_dropout=0.1, bilm_relu_dropout=0.0, bucket_cap_mb=25, clip_norm=25, cpu=False, criterion='label_smoothed_length_cross_entropy', curriculum=0, data=['wmt16.en-de.disco.dist'], dataset_impl=None, ddp_backend='c10d', decoder_attention_heads=8, decoder_embed_dim=512, decoder_embed_path=None, decoder_embed_scale=None, decoder_ffn_embed_dim=2048, decoder_input_dim=512, decoder_layers=6, decoder_learned_pos=False, decoder_normalize_before=False, decoder_output_dim=512, device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method='tcp://localhost:19951', distributed_no_spawn=False, distributed_port=-1, distributed_rank=0, distributed_world_size=8, dropout=0.2, dynamic_length=False, dynamic_masking=True, embedding_only=False, encoder_attention_heads=8, encoder_embed_dim=512, encoder_embed_path=None, encoder_embed_scale=None, encoder_ffn_embed_dim=2048, encoder_layers=6, encoder_learned_pos=False, encoder_normalize_before=False, find_unused_parameters=False, fix_batches_to_gpus=False, fp16=True, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, full_masking=False, ignore_eos_loss=True, keep_interval_updates=-1, keep_last_epochs=20, label_smoothing=0.1, left_pad_source='True', left_pad_target='False', log_format='simple', log_interval=100, lr=[0.0005], lr_scheduler='inverse_sqrt', mask_range=False, maskp=False, max_epoch=0, max_sentences=None, max_sentences_valid=None, max_source_positions=10000, max_target_positions=10000, max_tokens=16000, max_tokens_valid=16000, max_update=300000, maximize_best_checkpoint_metric=False, memory_efficient_fp16=False, min_loss_scale=0.0001, min_lr=1e-09, mix_masking=False, no_dec_token_positional_embeddings=False, no_enc_token_positional_embeddings=False, no_epoch_checkpoints=False, no_last_checkpoints=False, no_progress_bar=True, no_save=False, no_save_optimizer_state=False, num_workers=0, optimizer='adam', optimizer_overrides='{}', perm_only=False, raw_text=False, relu_dropout=0.0, required_batch_size_multiple=8, reset_dataloader=False, reset_lr_scheduler=False, reset_meters=False, reset_optimizer=False, restore_file='checkpoint_last.pt', save_dir='checkpoints/wmt16.en-de.nat_authordata_v1', save_interval=1, save_interval_updates=2000, seed=1, self_target=False, sentence_avg=False, share_all_embeddings=True, share_decoder_input_output_embed=False, share_layers=False, skip_eos=False, skip_invalid_size_inputs_valid_test=False, source_lang=None, target_lang=None, task='translation_self', tbmf_wrapper=False, tensorboard_logdir='', threshold_loss_scale=None, train_subset='train', update_freq=[1], upsample_primary=1, use_bmuf=False, user_dir=None, valid_subset='valid', validate_interval=1, warmup_init_lr=1e-07, warmup_updates=10000, weight_decay=0.01)
| $path/wmt16.en-de.disco.dist/ valid 3000 examples
| epoch 001 | valid on 'valid' subset | loss 11.538 | nll_loss 10.732 | ppl 1700.59 | num_updates 1112 | length_loss 8.07768
| epoch 002 | valid on 'valid' subset | loss 10.608 | nll_loss 9.655 | ppl 806.45 | num_updates 2000 | best_loss 10.6081 | length_loss 6.71878
| epoch 002 | valid on 'valid' subset | loss 10.329 | nll_loss 9.306 | ppl 632.98 | num_updates 2228 | best_loss 10.3293 | length_loss 6.59959
| epoch 003 | valid on 'valid' subset | loss 9.444 | nll_loss 8.200 | ppl 294.10 | num_updates 3344 | best_loss 9.44363 | length_loss 5.74804
| epoch 004 | valid on 'valid' subset | loss 8.778 | nll_loss 7.399 | ppl 168.77 | num_updates 4000 | best_loss 8.77766 | length_loss 5.54225
| epoch 004 | valid on 'valid' subset | loss 8.404 | nll_loss 6.919 | ppl 120.99 | num_updates 4461 | best_loss 8.40365 | length_loss 5.7412
| epoch 005 | valid on 'valid' subset | loss 7.576 | nll_loss 5.950 | ppl 61.84 | num_updates 5577 | best_loss 7.57616 | length_loss 5.39451
| epoch 006 | valid on 'valid' subset | loss 7.394 | nll_loss 5.753 | ppl 53.91 | num_updates 6000 | best_loss 7.39413 | length_loss 5.2428
| epoch 006 | valid on 'valid' subset | loss 7.149 | nll_loss 5.480 | ppl 44.64 | num_updates 6694 | best_loss 7.14901 | length_loss 5.12927
| epoch 007 | valid on 'valid' subset | loss 6.837 | nll_loss 5.118 | ppl 34.72 | num_updates 7811 | best_loss 6.8366 | length_loss 5.45324
| epoch 008 | valid on 'valid' subset | loss 6.780 | nll_loss 5.074 | ppl 33.68 | num_updates 8000 | best_loss 6.77959 | length_loss 5.10814
| epoch 008 | valid on 'valid' subset | loss 6.580 | nll_loss 4.844 | ppl 28.71 | num_updates 8926 | best_loss 6.57975 | length_loss 5.67847
| epoch 009 | valid on 'valid' subset | loss 6.451 | nll_loss 4.712 | ppl 26.22 | num_updates 10000 | best_loss 6.45135 | length_loss 5.04276
| epoch 009 | valid on 'valid' subset | loss 6.432 | nll_loss 4.681 | ppl 25.65 | num_updates 10043 | best_loss 6.43218 | length_loss 5.35626
| epoch 010 | valid on 'valid' subset | loss 6.283 | nll_loss 4.521 | ppl 22.95 | num_updates 11160 | best_loss 6.28313 | length_loss 5.3799
| epoch 011 | valid on 'valid' subset | loss 6.195 | nll_loss 4.458 | ppl 21.97 | num_updates 12000 | best_loss 6.19524 | length_loss 4.68387
| epoch 011 | valid on 'valid' subset | loss 6.160 | nll_loss 4.421 | ppl 21.42 | num_updates 12276 | best_loss 6.15985 | length_loss 4.99431
| epoch 012 | valid on 'valid' subset | loss 6.094 | nll_loss 4.353 | ppl 20.44 | num_updates 13390 | best_loss 6.09379 | length_loss 4.97237
| epoch 013 | valid on 'valid' subset | loss 6.081 | nll_loss 4.332 | ppl 20.14 | num_updates 14000 | best_loss 6.08146 | length_loss 4.8785
| epoch 013 | valid on 'valid' subset | loss 6.068 | nll_loss 4.328 | ppl 20.09 | num_updates 14507 | best_loss 6.06823 | length_loss 5.03105
| epoch 014 | valid on 'valid' subset | loss 5.967 | nll_loss 4.215 | ppl 18.57 | num_updates 15624 | best_loss 5.96733 | length_loss 4.86164
| epoch 015 | valid on 'valid' subset | loss 5.969 | nll_loss 4.221 | ppl 18.65 | num_updates 16000 | best_loss 5.96733 | length_loss 5.09694
| epoch 015 | valid on 'valid' subset | loss 5.918 | nll_loss 4.161 | ppl 17.89 | num_updates 16740 | best_loss 5.91825 | length_loss 4.96521
| epoch 016 | valid on 'valid' subset | loss 5.903 | nll_loss 4.164 | ppl 17.93 | num_updates 17857 | best_loss 5.90326 | length_loss 4.70368
| epoch 017 | valid on 'valid' subset | loss 5.920 | nll_loss 4.176 | ppl 18.08 | num_updates 18000 | best_loss 5.90326 | length_loss 5.12186
| epoch 017 | valid on 'valid' subset | loss 5.877 | nll_loss 4.125 | ppl 17.45 | num_updates 18974 | best_loss 5.87721 | length_loss 4.87007
| epoch 018 | valid on 'valid' subset | loss 5.850 | nll_loss 4.101 | ppl 17.16 | num_updates 20000 | best_loss 5.8498 | length_loss 5.18816
| epoch 018 | valid on 'valid' subset | loss 5.878 | nll_loss 4.127 | ppl 17.47 | num_updates 20091 | best_loss 5.8498 | length_loss 5.15794
| epoch 019 | valid on 'valid' subset | loss 5.836 | nll_loss 4.087 | ppl 16.99 | num_updates 21208 | best_loss 5.83566 | length_loss 4.97006
| epoch 020 | valid on 'valid' subset | loss 5.799 | nll_loss 4.040 | ppl 16.45 | num_updates 22000 | best_loss 5.79851 | length_loss 4.93875
| epoch 020 | valid on 'valid' subset | loss 5.829 | nll_loss 4.082 | ppl 16.93 | num_updates 22325 | best_loss 5.79851 | length_loss 4.89598
| epoch 021 | valid on 'valid' subset | loss 5.807 | nll_loss 4.035 | ppl 16.39 | num_updates 23441 | best_loss 5.79851 | length_loss 5.26515
| epoch 022 | valid on 'valid' subset | loss 5.761 | nll_loss 4.002 | ppl 16.02 | num_updates 24000 | best_loss 5.76138 | length_loss 5.15905
| epoch 022 | valid on 'valid' subset | loss 5.770 | nll_loss 4.020 | ppl 16.22 | num_updates 24558 | best_loss 5.76138 | length_loss 4.63934
| epoch 023 | valid on 'valid' subset | loss 5.775 | nll_loss 4.031 | ppl 16.35 | num_updates 25675 | best_loss 5.76138 | length_loss 4.96205
| epoch 024 | valid on 'valid' subset | loss 5.756 | nll_loss 3.999 | ppl 15.99 | num_updates 26000 | best_loss 5.75582 | length_loss 4.90414
| epoch 024 | valid on 'valid' subset | loss 5.788 | nll_loss 4.037 | ppl 16.41 | num_updates 26792 | best_loss 5.75582 | length_loss 5.04501
| epoch 025 | valid on 'valid' subset | loss 5.757 | nll_loss 3.996 | ppl 15.96 | num_updates 27906 | best_loss 5.75582 | length_loss 5.22324
| epoch 026 | valid on 'valid' subset | loss 5.804 | nll_loss 4.062 | ppl 16.70 | num_updates 28000 | best_loss 5.75582 | length_loss 4.77015
| epoch 026 | valid on 'valid' subset | loss 5.778 | nll_loss 4.010 | ppl 16.12 | num_updates 29023 | best_loss 5.75582 | length_loss 5.45128
| epoch 027 | valid on 'valid' subset | loss 5.738 | nll_loss 3.987 | ppl 15.86 | num_updates 30000 | best_loss 5.73791 | length_loss 4.79396
| epoch 027 | valid on 'valid' subset | loss 5.752 | nll_loss 3.995 | ppl 15.94 | num_updates 30140 | best_loss 5.73791 | length_loss 5.28362
| epoch 028 | valid on 'valid' subset | loss 5.731 | nll_loss 3.987 | ppl 15.85 | num_updates 31257 | best_loss 5.73146 | length_loss 4.81235
| epoch 029 | valid on 'valid' subset | loss 5.718 | nll_loss 3.956 | ppl 15.52 | num_updates 32000 | best_loss 5.71833 | length_loss 5.37431
| epoch 029 | valid on 'valid' subset | loss 5.724 | nll_loss 3.963 | ppl 15.59 | num_updates 32374 | best_loss 5.71833 | length_loss 5.08599
| epoch 030 | valid on 'valid' subset | loss 5.720 | nll_loss 3.967 | ppl 15.64 | num_updates 33491 | best_loss 5.71833 | length_loss 4.86158
| epoch 031 | valid on 'valid' subset | loss 5.727 | nll_loss 3.981 | ppl 15.79 | num_updates 34000 | best_loss 5.71833 | length_loss 4.80137
| epoch 031 | valid on 'valid' subset | loss 5.686 | nll_loss 3.926 | ppl 15.20 | num_updates 34607 | best_loss 5.68584 | length_loss 4.95641
| epoch 032 | valid on 'valid' subset | loss 5.727 | nll_loss 3.973 | ppl 15.71 | num_updates 35724 | best_loss 5.68584 | length_loss 4.9809
| epoch 033 | valid on 'valid' subset | loss 5.720 | nll_loss 3.966 | ppl 15.63 | num_updates 36000 | best_loss 5.68584 | length_loss 4.79346
| epoch 033 | valid on 'valid' subset | loss 5.714 | nll_loss 3.974 | ppl 15.72 | num_updates 36840 | best_loss 5.68584 | length_loss 4.6497
| epoch 034 | valid on 'valid' subset | loss 5.685 | nll_loss 3.916 | ppl 15.10 | num_updates 37957 | best_loss 5.6846 | length_loss 5.19303
| epoch 035 | valid on 'valid' subset | loss 5.686 | nll_loss 3.931 | ppl 15.25 | num_updates 38000 | best_loss 5.6846 | length_loss 5.17674
| epoch 035 | valid on 'valid' subset | loss 5.698 | nll_loss 3.937 | ppl 15.32 | num_updates 39073 | best_loss 5.6846 | length_loss 5.2187
| epoch 036 | valid on 'valid' subset | loss 5.682 | nll_loss 3.922 | ppl 15.16 | num_updates 40000 | best_loss 5.68205 | length_loss 5.09201
| epoch 036 | valid on 'valid' subset | loss 5.685 | nll_loss 3.937 | ppl 15.32 | num_updates 40190 | best_loss 5.68205 | length_loss 4.92951
| epoch 037 | valid on 'valid' subset | loss 5.686 | nll_loss 3.933 | ppl 15.27 | num_updates 41306 | best_loss 5.68205 | length_loss 5.09471
| epoch 038 | valid on 'valid' subset | loss 5.680 | nll_loss 3.925 | ppl 15.19 | num_updates 42000 | best_loss 5.68032 | length_loss 4.81772
| epoch 038 | valid on 'valid' subset | loss 5.649 | nll_loss 3.898 | ppl 14.91 | num_updates 42422 | best_loss 5.64865 | length_loss 4.99904
| epoch 039 | valid on 'valid' subset | loss 5.654 | nll_loss 3.891 | ppl 14.84 | num_updates 43539 | best_loss 5.64865 | length_loss 5.22766
| epoch 040 | valid on 'valid' subset | loss 5.681 | nll_loss 3.936 | ppl 15.31 | num_updates 44000 | best_loss 5.64865 | length_loss 4.73627
| epoch 040 | valid on 'valid' subset | loss 5.672 | nll_loss 3.919 | ppl 15.12 | num_updates 44656 | best_loss 5.64865 | length_loss 5.06906
| epoch 041 | valid on 'valid' subset | loss 5.653 | nll_loss 3.906 | ppl 14.99 | num_updates 45773 | best_loss 5.64865 | length_loss 4.81397
| epoch 042 | valid on 'valid' subset | loss 5.681 | nll_loss 3.926 | ppl 15.20 | num_updates 46000 | best_loss 5.64865 | length_loss 4.96613
| epoch 042 | valid on 'valid' subset | loss 5.671 | nll_loss 3.922 | ppl 15.16 | num_updates 46888 | best_loss 5.64865 | length_loss 4.78331
| epoch 043 | valid on 'valid' subset | loss 5.671 | nll_loss 3.905 | ppl 14.98 | num_updates 48000 | best_loss 5.64865 | length_loss 4.99218
| epoch 043 | valid on 'valid' subset | loss 5.690 | nll_loss 3.935 | ppl 15.30 | num_updates 48005 | best_loss 5.64865 | length_loss 4.85481
| epoch 044 | valid on 'valid' subset | loss 5.665 | nll_loss 3.905 | ppl 14.98 | num_updates 49121 | best_loss 5.64865 | length_loss 5.11471
| epoch 045 | valid on 'valid' subset | loss 5.668 | nll_loss 3.907 | ppl 15.00 | num_updates 50000 | best_loss 5.64865 | length_loss 5.12693
| epoch 045 | valid on 'valid' subset | loss 5.660 | nll_loss 3.905 | ppl 14.98 | num_updates 50238 | best_loss 5.64865 | length_loss 5.36505
| epoch 046 | valid on 'valid' subset | loss 5.659 | nll_loss 3.912 | ppl 15.06 | num_updates 51355 | best_loss 5.64865 | length_loss 4.92082
| epoch 047 | valid on 'valid' subset | loss 5.642 | nll_loss 3.883 | ppl 14.76 | num_updates 52000 | best_loss 5.64205 | length_loss 4.84096
| epoch 047 | valid on 'valid' subset | loss 5.634 | nll_loss 3.884 | ppl 14.76 | num_updates 52472 | best_loss 5.63406 | length_loss 5.12886
| epoch 048 | valid on 'valid' subset | loss 5.657 | nll_loss 3.910 | ppl 15.03 | num_updates 53589 | best_loss 5.63406 | length_loss 5.36747
| epoch 049 | valid on 'valid' subset | loss 5.656 | nll_loss 3.892 | ppl 14.85 | num_updates 54000 | best_loss 5.63406 | length_loss 5.42012
| epoch 049 | valid on 'valid' subset | loss 5.625 | nll_loss 3.859 | ppl 14.51 | num_updates 54705 | best_loss 5.62457 | length_loss 5.29776
| epoch 050 | valid on 'valid' subset | loss 5.681 | nll_loss 3.922 | ppl 15.16 | num_updates 55822 | best_loss 5.62457 | length_loss 5.3185
| epoch 051 | valid on 'valid' subset | loss 5.656 | nll_loss 3.894 | ppl 14.87 | num_updates 56000 | best_loss 5.62457 | length_loss 5.01675
| epoch 051 | valid on 'valid' subset | loss 5.620 | nll_loss 3.880 | ppl 14.72 | num_updates 56938 | best_loss 5.61993 | length_loss 4.89193
| epoch 052 | valid on 'valid' subset | loss 5.650 | nll_loss 3.886 | ppl 14.79 | num_updates 58000 | best_loss 5.61993 | length_loss 5.07345
| epoch 052 | valid on 'valid' subset | loss 5.642 | nll_loss 3.878 | ppl 14.71 | num_updates 58055 | best_loss 5.61993 | length_loss 5.29816
| epoch 053 | valid on 'valid' subset | loss 5.613 | nll_loss 3.861 | ppl 14.53 | num_updates 59171 | best_loss 5.61335 | length_loss 4.83874
| epoch 054 | valid on 'valid' subset | loss 5.618 | nll_loss 3.862 | ppl 14.54 | num_updates 60000 | best_loss 5.61335 | length_loss 5.05428
| epoch 054 | valid on 'valid' subset | loss 5.606 | nll_loss 3.839 | ppl 14.31 | num_updates 60288 | best_loss 5.60614 | length_loss 4.96276
| epoch 055 | valid on 'valid' subset | loss 5.606 | nll_loss 3.846 | ppl 14.38 | num_updates 61404 | best_loss 5.60614 | length_loss 5.17324
| epoch 056 | valid on 'valid' subset | loss 5.618 | nll_loss 3.863 | ppl 14.55 | num_updates 62000 | best_loss 5.60614 | length_loss 5.49125
| epoch 056 | valid on 'valid' subset | loss 5.632 | nll_loss 3.878 | ppl 14.71 | num_updates 62521 | best_loss 5.60614 | length_loss 4.88908
| epoch 057 | valid on 'valid' subset | loss 5.596 | nll_loss 3.839 | ppl 14.31 | num_updates 63638 | best_loss 5.59556 | length_loss 5.28005
| epoch 058 | valid on 'valid' subset | loss 5.658 | nll_loss 3.912 | ppl 15.05 | num_updates 64000 | best_loss 5.59556 | length_loss 4.7559
| epoch 058 | valid on 'valid' subset | loss 5.640 | nll_loss 3.876 | ppl 14.69 | num_updates 64754 | best_loss 5.59556 | length_loss 5.24229
| epoch 059 | valid on 'valid' subset | loss 5.608 | nll_loss 3.849 | ppl 14.41 | num_updates 65871 | best_loss 5.59556 | length_loss 5.104
| epoch 060 | valid on 'valid' subset | loss 5.623 | nll_loss 3.870 | ppl 14.62 | num_updates 66000 | best_loss 5.59556 | length_loss 5.01172
| epoch 060 | valid on 'valid' subset | loss 5.614 | nll_loss 3.853 | ppl 14.45 | num_updates 66987 | best_loss 5.59556 | length_loss 5.13998
| epoch 061 | valid on 'valid' subset | loss 5.615 | nll_loss 3.847 | ppl 14.39 | num_updates 68000 | best_loss 5.59556 | length_loss 5.66169
| epoch 061 | valid on 'valid' subset | loss 5.624 | nll_loss 3.873 | ppl 14.65 | num_updates 68104 | best_loss 5.59556 | length_loss 4.83914
| epoch 062 | valid on 'valid' subset | loss 5.639 | nll_loss 3.870 | ppl 14.62 | num_updates 69220 | best_loss 5.59556 | length_loss 5.43168
| epoch 063 | valid on 'valid' subset | loss 5.634 | nll_loss 3.871 | ppl 14.63 | num_updates 70000 | best_loss 5.59556 | length_loss 5.35281
| epoch 063 | valid on 'valid' subset | loss 5.599 | nll_loss 3.844 | ppl 14.36 | num_updates 70337 | best_loss 5.59556 | length_loss 5.57717
| epoch 064 | valid on 'valid' subset | loss 5.587 | nll_loss 3.828 | ppl 14.20 | num_updates 71454 | best_loss 5.58679 | length_loss 5.23344
| epoch 065 | valid on 'valid' subset | loss 5.638 | nll_loss 3.892 | ppl 14.85 | num_updates 72000 | best_loss 5.58679 | length_loss 4.88949
| epoch 065 | valid on 'valid' subset | loss 5.616 | nll_loss 3.861 | ppl 14.53 | num_updates 72571 | best_loss 5.58679 | length_loss 4.98517
| epoch 066 | valid on 'valid' subset | loss 5.611 | nll_loss 3.859 | ppl 14.51 | num_updates 73687 | best_loss 5.58679 | length_loss 5.14025
| epoch 067 | valid on 'valid' subset | loss 5.602 | nll_loss 3.846 | ppl 14.38 | num_updates 74000 | best_loss 5.58679 | length_loss 5.30373
| epoch 067 | valid on 'valid' subset | loss 5.621 | nll_loss 3.871 | ppl 14.63 | num_updates 74803 | best_loss 5.58679 | length_loss 5.21696
| epoch 068 | valid on 'valid' subset | loss 5.628 | nll_loss 3.871 | ppl 14.63 | num_updates 75920 | best_loss 5.58679 | length_loss 5.32986
| epoch 069 | valid on 'valid' subset | loss 5.595 | nll_loss 3.837 | ppl 14.29 | num_updates 76000 | best_loss 5.58679 | length_loss 5.02404
| epoch 069 | valid on 'valid' subset | loss 5.639 | nll_loss 3.878 | ppl 14.71 | num_updates 77037 | best_loss 5.58679 | length_loss 5.30288
| epoch 070 | valid on 'valid' subset | loss 5.614 | nll_loss 3.860 | ppl 14.52 | num_updates 78000 | best_loss 5.58679 | length_loss 5.00955
| epoch 070 | valid on 'valid' subset | loss 5.590 | nll_loss 3.831 | ppl 14.23 | num_updates 78153 | best_loss 5.58679 | length_loss 5.18245
| epoch 071 | valid on 'valid' subset | loss 5.627 | nll_loss 3.862 | ppl 14.54 | num_updates 79270 | best_loss 5.58679 | length_loss 5.14994
| epoch 072 | valid on 'valid' subset | loss 5.602 | nll_loss 3.847 | ppl 14.39 | num_updates 80000 | best_loss 5.58679 | length_loss 5.17858
| epoch 072 | valid on 'valid' subset | loss 5.592 | nll_loss 3.824 | ppl 14.16 | num_updates 80387 | best_loss 5.58679 | length_loss 5.48149
| epoch 073 | valid on 'valid' subset | loss 5.590 | nll_loss 3.828 | ppl 14.21 | num_updates 81503 | best_loss 5.58679 | length_loss 5.49635
| epoch 074 | valid on 'valid' subset | loss 5.596 | nll_loss 3.826 | ppl 14.18 | num_updates 82000 | best_loss 5.58679 | length_loss 5.73306
| epoch 074 | valid on 'valid' subset | loss 5.604 | nll_loss 3.846 | ppl 14.38 | num_updates 82620 | best_loss 5.58679 | length_loss 4.95396
| epoch 075 | valid on 'valid' subset | loss 5.590 | nll_loss 3.827 | ppl 14.19 | num_updates 83737 | best_loss 5.58679 | length_loss 5.23502
| epoch 076 | valid on 'valid' subset | loss 5.601 | nll_loss 3.835 | ppl 14.27 | num_updates 84000 | best_loss 5.58679 | length_loss 5.25717
| epoch 076 | valid on 'valid' subset | loss 5.600 | nll_loss 3.839 | ppl 14.31 | num_updates 84854 | best_loss 5.58679 | length_loss 5.06934
| epoch 077 | valid on 'valid' subset | loss 5.597 | nll_loss 3.832 | ppl 14.24 | num_updates 85970 | best_loss 5.58679 | length_loss 5.49544
| epoch 078 | valid on 'valid' subset | loss 5.631 | nll_loss 3.867 | ppl 14.59 | num_updates 86000 | best_loss 5.58679 | length_loss 5.52312
| epoch 078 | valid on 'valid' subset | loss 5.596 | nll_loss 3.826 | ppl 14.18 | num_updates 87086 | best_loss 5.58679 | length_loss 5.32114
| epoch 079 | valid on 'valid' subset | loss 5.586 | nll_loss 3.828 | ppl 14.20 | num_updates 88000 | best_loss 5.58557 | length_loss 5.28118
| epoch 079 | valid on 'valid' subset | loss 5.587 | nll_loss 3.825 | ppl 14.17 | num_updates 88203 | best_loss 5.58557 | length_loss 5.03314
| epoch 080 | valid on 'valid' subset | loss 5.579 | nll_loss 3.818 | ppl 14.11 | num_updates 89319 | best_loss 5.57853 | length_loss 5.49761
| epoch 081 | valid on 'valid' subset | loss 5.600 | nll_loss 3.837 | ppl 14.29 | num_updates 90000 | best_loss 5.57853 | length_loss 5.09228
| epoch 081 | valid on 'valid' subset | loss 5.623 | nll_loss 3.873 | ppl 14.65 | num_updates 90436 | best_loss 5.57853 | length_loss 5.15329
| epoch 082 | valid on 'valid' subset | loss 5.594 | nll_loss 3.835 | ppl 14.27 | num_updates 91552 | best_loss 5.57853 | length_loss 5.34417
| epoch 083 | valid on 'valid' subset | loss 5.626 | nll_loss 3.874 | ppl 14.66 | num_updates 92000 | best_loss 5.57853 | length_loss 4.9694
| epoch 083 | valid on 'valid' subset | loss 5.585 | nll_loss 3.821 | ppl 14.14 | num_updates 92669 | best_loss 5.57853 | length_loss 5.36829
| epoch 084 | valid on 'valid' subset | loss 5.610 | nll_loss 3.866 | ppl 14.58 | num_updates 93786 | best_loss 5.57853 | length_loss 4.95562
| epoch 085 | valid on 'valid' subset | loss 5.616 | nll_loss 3.864 | ppl 14.56 | num_updates 94000 | best_loss 5.57853 | length_loss 5.2255
| epoch 085 | valid on 'valid' subset | loss 5.580 | nll_loss 3.826 | ppl 14.18 | num_updates 94902 | best_loss 5.57853 | length_loss 5.10636
| epoch 086 | valid on 'valid' subset | loss 5.576 | nll_loss 3.814 | ppl 14.07 | num_updates 96000 | best_loss 5.57596 | length_loss 4.9704
| epoch 086 | valid on 'valid' subset | loss 5.603 | nll_loss 3.847 | ppl 14.39 | num_updates 96019 | best_loss 5.57596 | length_loss 5.2242
| epoch 087 | valid on 'valid' subset | loss 5.581 | nll_loss 3.812 | ppl 14.05 | num_updates 97136 | best_loss 5.57596 | length_loss 5.17751
| epoch 088 | valid on 'valid' subset | loss 5.609 | nll_loss 3.846 | ppl 14.38 | num_updates 98000 | best_loss 5.57596 | length_loss 5.35939
| epoch 088 | valid on 'valid' subset | loss 5.585 | nll_loss 3.818 | ppl 14.11 | num_updates 98253 | best_loss 5.57596 | length_loss 5.38032
| epoch 089 | valid on 'valid' subset | loss 5.592 | nll_loss 3.830 | ppl 14.22 | num_updates 99369 | best_loss 5.57596 | length_loss 5.25736
| epoch 090 | valid on 'valid' subset | loss 5.571 | nll_loss 3.812 | ppl 14.04 | num_updates 100000 | best_loss 5.57142 | length_loss 5.41195
| epoch 090 | valid on 'valid' subset | loss 5.576 | nll_loss 3.813 | ppl 14.05 | num_updates 100485 | best_loss 5.57142 | length_loss 5.32919
| epoch 091 | valid on 'valid' subset | loss 5.571 | nll_loss 3.800 | ppl 13.93 | num_updates 101602 | best_loss 5.5709 | length_loss 5.55755
| epoch 092 | valid on 'valid' subset | loss 5.586 | nll_loss 3.833 | ppl 14.25 | num_updates 102000 | best_loss 5.5709 | length_loss 5.11072
| epoch 092 | valid on 'valid' subset | loss 5.588 | nll_loss 3.825 | ppl 14.17 | num_updates 102719 | best_loss 5.5709 | length_loss 5.38931
| epoch 093 | valid on 'valid' subset | loss 5.593 | nll_loss 3.834 | ppl 14.26 | num_updates 103835 | best_loss 5.5709 | length_loss 5.4362
| epoch 094 | valid on 'valid' subset | loss 5.605 | nll_loss 3.844 | ppl 14.36 | num_updates 104000 | best_loss 5.5709 | length_loss 5.26631
| epoch 094 | valid on 'valid' subset | loss 5.605 | nll_loss 3.857 | ppl 14.49 | num_updates 104952 | best_loss 5.5709 | length_loss 5.1748
| epoch 095 | valid on 'valid' subset | loss 5.620 | nll_loss 3.863 | ppl 14.55 | num_updates 106000 | best_loss 5.5709 | length_loss 5.45649
| epoch 095 | valid on 'valid' subset | loss 5.588 | nll_loss 3.826 | ppl 14.18 | num_updates 106069 | best_loss 5.5709 | length_loss 5.26936
| epoch 096 | valid on 'valid' subset | loss 5.571 | nll_loss 3.809 | ppl 14.02 | num_updates 107186 | best_loss 5.5709 | length_loss 5.28208
| epoch 097 | valid on 'valid' subset | loss 5.636 | nll_loss 3.871 | ppl 14.63 | num_updates 108000 | best_loss 5.5709 | length_loss 5.47391
| epoch 097 | valid on 'valid' subset | loss 5.573 | nll_loss 3.820 | ppl 14.12 | num_updates 108301 | best_loss 5.5709 | length_loss 5.25959
| epoch 098 | valid on 'valid' subset | loss 5.593 | nll_loss 3.838 | ppl 14.30 | num_updates 109418 | best_loss 5.5709 | length_loss 5.22102
| epoch 099 | valid on 'valid' subset | loss 5.576 | nll_loss 3.825 | ppl 14.17 | num_updates 110000 | best_loss 5.5709 | length_loss 5.34992
| epoch 099 | valid on 'valid' subset | loss 5.590 | nll_loss 3.832 | ppl 14.24 | num_updates 110534 | best_loss 5.5709 | length_loss 5.30928
| epoch 100 | valid on 'valid' subset | loss 5.563 | nll_loss 3.814 | ppl 14.06 | num_updates 111651 | best_loss 5.56259 | length_loss 4.94879
| epoch 101 | valid on 'valid' subset | loss 5.604 | nll_loss 3.848 | ppl 14.40 | num_updates 112000 | best_loss 5.56259 | length_loss 5.26776
| epoch 101 | valid on 'valid' subset | loss 5.592 | nll_loss 3.845 | ppl 14.37 | num_updates 112767 | best_loss 5.56259 | length_loss 5.17652
| epoch 102 | valid on 'valid' subset | loss 5.576 | nll_loss 3.816 | ppl 14.09 | num_updates 113884 | best_loss 5.56259 | length_loss 5.29073
| epoch 103 | valid on 'valid' subset | loss 5.615 | nll_loss 3.859 | ppl 14.51 | num_updates 114000 | best_loss 5.56259 | length_loss 5.39229
| epoch 103 | valid on 'valid' subset | loss 5.586 | nll_loss 3.840 | ppl 14.32 | num_updates 115000 | best_loss 5.56259 | length_loss 5.18116
| epoch 104 | valid on 'valid' subset | loss 5.581 | nll_loss 3.830 | ppl 14.23 | num_updates 116000 | best_loss 5.56259 | length_loss 5.18728
| epoch 104 | valid on 'valid' subset | loss 5.579 | nll_loss 3.823 | ppl 14.16 | num_updates 116117 | best_loss 5.56259 | length_loss 5.35772
| epoch 105 | valid on 'valid' subset | loss 5.612 | nll_loss 3.854 | ppl 14.46 | num_updates 117234 | best_loss 5.56259 | length_loss 5.33082
| epoch 106 | valid on 'valid' subset | loss 5.599 | nll_loss 3.839 | ppl 14.31 | num_updates 118000 | best_loss 5.56259 | length_loss 5.42866
| epoch 106 | valid on 'valid' subset | loss 5.581 | nll_loss 3.829 | ppl 14.21 | num_updates 118350 | best_loss 5.56259 | length_loss 5.2323
| epoch 107 | valid on 'valid' subset | loss 5.568 | nll_loss 3.814 | ppl 14.06 | num_updates 119467 | best_loss 5.56259 | length_loss 5.21312
| epoch 108 | valid on 'valid' subset | loss 5.566 | nll_loss 3.807 | ppl 13.99 | num_updates 120000 | best_loss 5.56259 | length_loss 5.44057
| epoch 108 | valid on 'valid' subset | loss 5.584 | nll_loss 3.809 | ppl 14.02 | num_updates 120583 | best_loss 5.56259 | length_loss 5.75644
| epoch 109 | valid on 'valid' subset | loss 5.561 | nll_loss 3.800 | ppl 13.93 | num_updates 121700 | best_loss 5.56063 | length_loss 5.1544
| epoch 110 | valid on 'valid' subset | loss 5.631 | nll_loss 3.865 | ppl 14.57 | num_updates 122000 | best_loss 5.56063 | length_loss 5.7186
| epoch 110 | valid on 'valid' subset | loss 5.561 | nll_loss 3.804 | ppl 13.97 | num_updates 122817 | best_loss 5.56063 | length_loss 5.30358
| epoch 111 | valid on 'valid' subset | loss 5.586 | nll_loss 3.824 | ppl 14.16 | num_updates 123933 | best_loss 5.56063 | length_loss 5.35073
| epoch 112 | valid on 'valid' subset | loss 5.584 | nll_loss 3.824 | ppl 14.16 | num_updates 124000 | best_loss 5.56063 | length_loss 5.15542
| epoch 112 | valid on 'valid' subset | loss 5.587 | nll_loss 3.817 | ppl 14.10 | num_updates 125050 | best_loss 5.56063 | length_loss 5.61057
| epoch 113 | valid on 'valid' subset | loss 5.583 | nll_loss 3.825 | ppl 14.18 | num_updates 126000 | best_loss 5.56063 | length_loss 5.05128
| epoch 113 | valid on 'valid' subset | loss 5.608 | nll_loss 3.859 | ppl 14.51 | num_updates 126166 | best_loss 5.56063 | length_loss 5.15769
| epoch 114 | valid on 'valid' subset | loss 5.607 | nll_loss 3.853 | ppl 14.45 | num_updates 127283 | best_loss 5.56063 | length_loss 5.09543
| epoch 115 | valid on 'valid' subset | loss 5.560 | nll_loss 3.808 | ppl 14.00 | num_updates 128000 | best_loss 5.55957 | length_loss 5.00717
| epoch 115 | valid on 'valid' subset | loss 5.609 | nll_loss 3.847 | ppl 14.39 | num_updates 128400 | best_loss 5.55957 | length_loss 5.17661
| epoch 116 | valid on 'valid' subset | loss 5.554 | nll_loss 3.800 | ppl 13.93 | num_updates 129517 | best_loss 5.55379 | length_loss 5.1794
| epoch 117 | valid on 'valid' subset | loss 5.608 | nll_loss 3.856 | ppl 14.48 | num_updates 130000 | best_loss 5.55379 | length_loss 5.50267
| epoch 117 | valid on 'valid' subset | loss 5.566 | nll_loss 3.816 | ppl 14.09 | num_updates 130633 | best_loss 5.55379 | length_loss 5.10008
| epoch 118 | valid on 'valid' subset | loss 5.597 | nll_loss 3.849 | ppl 14.41 | num_updates 131749 | best_loss 5.55379 | length_loss 5.26575
| epoch 119 | valid on 'valid' subset | loss 5.590 | nll_loss 3.827 | ppl 14.19 | num_updates 132000 | best_loss 5.55379 | length_loss 5.60259
| epoch 119 | valid on 'valid' subset | loss 5.594 | nll_loss 3.838 | ppl 14.30 | num_updates 132866 | best_loss 5.55379 | length_loss 5.10372
| epoch 120 | valid on 'valid' subset | loss 5.568 | nll_loss 3.817 | ppl 14.09 | num_updates 133983 | best_loss 5.55379 | length_loss 5.09884
| epoch 121 | valid on 'valid' subset | loss 5.575 | nll_loss 3.814 | ppl 14.07 | num_updates 134000 | best_loss 5.55379 | length_loss 5.60332
| epoch 121 | valid on 'valid' subset | loss 5.576 | nll_loss 3.818 | ppl 14.10 | num_updates 135099 | best_loss 5.55379 | length_loss 5.19808
| epoch 122 | valid on 'valid' subset | loss 5.559 | nll_loss 3.802 | ppl 13.95 | num_updates 136000 | best_loss 5.55379 | length_loss 5.27703
| epoch 122 | valid on 'valid' subset | loss 5.570 | nll_loss 3.806 | ppl 13.98 | num_updates 136216 | best_loss 5.55379 | length_loss 5.33263
| epoch 123 | valid on 'valid' subset | loss 5.581 | nll_loss 3.830 | ppl 14.22 | num_updates 137333 | best_loss 5.55379 | length_loss 5.15172
| epoch 124 | valid on 'valid' subset | loss 5.584 | nll_loss 3.829 | ppl 14.21 | num_updates 138000 | best_loss 5.55379 | length_loss 5.29541
| epoch 124 | valid on 'valid' subset | loss 5.586 | nll_loss 3.835 | ppl 14.28 | num_updates 138449 | best_loss 5.55379 | length_loss 5.17188
| epoch 125 | valid on 'valid' subset | loss 5.574 | nll_loss 3.810 | ppl 14.02 | num_updates 139565 | best_loss 5.55379 | length_loss 5.61704
| epoch 126 | valid on 'valid' subset | loss 5.571 | nll_loss 3.803 | ppl 13.96 | num_updates 140000 | best_loss 5.55379 | length_loss 5.74625
| epoch 126 | valid on 'valid' subset | loss 5.573 | nll_loss 3.815 | ppl 14.07 | num_updates 140682 | best_loss 5.55379 | length_loss 5.47755
| epoch 127 | valid on 'valid' subset | loss 5.588 | nll_loss 3.823 | ppl 14.15 | num_updates 141799 | best_loss 5.55379 | length_loss 5.45464
| epoch 128 | valid on 'valid' subset | loss 5.611 | nll_loss 3.847 | ppl 14.39 | num_updates 142000 | best_loss 5.55379 | length_loss 5.65676
| epoch 128 | valid on 'valid' subset | loss 5.589 | nll_loss 3.831 | ppl 14.23 | num_updates 142916 | best_loss 5.55379 | length_loss 5.35568
| epoch 129 | valid on 'valid' subset | loss 5.572 | nll_loss 3.817 | ppl 14.10 | num_updates 144000 | best_loss 5.55379 | length_loss 5.08733
| epoch 129 | valid on 'valid' subset | loss 5.579 | nll_loss 3.828 | ppl 14.20 | num_updates 144032 | best_loss 5.55379 | length_loss 5.20082
| epoch 130 | valid on 'valid' subset | loss 5.582 | nll_loss 3.828 | ppl 14.20 | num_updates 145149 | best_loss 5.55379 | length_loss 5.31697
| epoch 131 | valid on 'valid' subset | loss 5.574 | nll_loss 3.814 | ppl 14.07 | num_updates 146000 | best_loss 5.55379 | length_loss 5.58609
| epoch 131 | valid on 'valid' subset | loss 5.557 | nll_loss 3.799 | ppl 13.92 | num_updates 146265 | best_loss 5.55379 | length_loss 5.25248
| epoch 132 | valid on 'valid' subset | loss 5.570 | nll_loss 3.805 | ppl 13.97 | num_updates 147382 | best_loss 5.55379 | length_loss 5.62696
| epoch 133 | valid on 'valid' subset | loss 5.565 | nll_loss 3.812 | ppl 14.04 | num_updates 148000 | best_loss 5.55379 | length_loss 5.23231
| epoch 133 | valid on 'valid' subset | loss 5.543 | nll_loss 3.782 | ppl 13.76 | num_updates 148498 | best_loss 5.54307 | length_loss 5.483
| epoch 134 | valid on 'valid' subset | loss 5.581 | nll_loss 3.816 | ppl 14.08 | num_updates 149615 | best_loss 5.54307 | length_loss 5.53838
| epoch 135 | valid on 'valid' subset | loss 5.571 | nll_loss 3.822 | ppl 14.14 | num_updates 150000 | best_loss 5.54307 | length_loss 5.07699
| epoch 135 | valid on 'valid' subset | loss 5.563 | nll_loss 3.812 | ppl 14.05 | num_updates 150731 | best_loss 5.54307 | length_loss 5.20817
| epoch 136 | valid on 'valid' subset | loss 5.555 | nll_loss 3.792 | ppl 13.85 | num_updates 151848 | best_loss 5.54307 | length_loss 5.36711
| epoch 137 | valid on 'valid' subset | loss 5.586 | nll_loss 3.829 | ppl 14.22 | num_updates 152000 | best_loss 5.54307 | length_loss 5.23819
| epoch 137 | valid on 'valid' subset | loss 5.559 | nll_loss 3.793 | ppl 13.86 | num_updates 152964 | best_loss 5.54307 | length_loss 5.49783
| epoch 138 | valid on 'valid' subset | loss 5.585 | nll_loss 3.824 | ppl 14.16 | num_updates 154000 | best_loss 5.54307 | length_loss 5.2778
| epoch 138 | valid on 'valid' subset | loss 5.581 | nll_loss 3.826 | ppl 14.18 | num_updates 154081 | best_loss 5.54307 | length_loss 5.51582
| epoch 139 | valid on 'valid' subset | loss 5.541 | nll_loss 3.780 | ppl 13.74 | num_updates 155197 | best_loss 5.54051 | length_loss 5.34901
| epoch 140 | valid on 'valid' subset | loss 5.599 | nll_loss 3.841 | ppl 14.33 | num_updates 156000 | best_loss 5.54051 | length_loss 5.25924
| epoch 140 | valid on 'valid' subset | loss 5.561 | nll_loss 3.808 | ppl 14.01 | num_updates 156314 | best_loss 5.54051 | length_loss 5.46697
| epoch 141 | valid on 'valid' subset | loss 5.585 | nll_loss 3.826 | ppl 14.18 | num_updates 157431 | best_loss 5.54051 | length_loss 5.43181
| epoch 142 | valid on 'valid' subset | loss 5.570 | nll_loss 3.817 | ppl 14.09 | num_updates 158000 | best_loss 5.54051 | length_loss 5.17516
| epoch 142 | valid on 'valid' subset | loss 5.564 | nll_loss 3.792 | ppl 13.85 | num_updates 158547 | best_loss 5.54051 | length_loss 5.67563
| epoch 143 | valid on 'valid' subset | loss 5.553 | nll_loss 3.799 | ppl 13.92 | num_updates 159664 | best_loss 5.54051 | length_loss 5.26242
| epoch 144 | valid on 'valid' subset | loss 5.563 | nll_loss 3.806 | ppl 13.99 | num_updates 160000 | best_loss 5.54051 | length_loss 5.31434
| epoch 144 | valid on 'valid' subset | loss 5.571 | nll_loss 3.816 | ppl 14.08 | num_updates 160781 | best_loss 5.54051 | length_loss 5.52447
| epoch 145 | valid on 'valid' subset | loss 5.553 | nll_loss 3.794 | ppl 13.87 | num_updates 161897 | best_loss 5.54051 | length_loss 5.19175
| epoch 146 | valid on 'valid' subset | loss 5.573 | nll_loss 3.821 | ppl 14.13 | num_updates 162000 | best_loss 5.54051 | length_loss 5.47153
| epoch 146 | valid on 'valid' subset | loss 5.567 | nll_loss 3.806 | ppl 13.99 | num_updates 163014 | best_loss 5.54051 | length_loss 5.28773
| epoch 147 | valid on 'valid' subset | loss 5.560 | nll_loss 3.798 | ppl 13.91 | num_updates 164000 | best_loss 5.54051 | length_loss 5.77107
| epoch 147 | valid on 'valid' subset | loss 5.563 | nll_loss 3.803 | ppl 13.96 | num_updates 164131 | best_loss 5.54051 | length_loss 5.38077
| epoch 148 | valid on 'valid' subset | loss 5.549 | nll_loss 3.777 | ppl 13.70 | num_updates 165247 | best_loss 5.54051 | length_loss 6.14763
| epoch 149 | valid on 'valid' subset | loss 5.576 | nll_loss 3.821 | ppl 14.13 | num_updates 166000 | best_loss 5.54051 | length_loss 5.17569
| epoch 149 | valid on 'valid' subset | loss 5.549 | nll_loss 3.796 | ppl 13.89 | num_updates 166363 | best_loss 5.54051 | length_loss 5.11687
| epoch 150 | valid on 'valid' subset | loss 5.567 | nll_loss 3.813 | ppl 14.05 | num_updates 167480 | best_loss 5.54051 | length_loss 5.09933
| epoch 151 | valid on 'valid' subset | loss 5.557 | nll_loss 3.804 | ppl 13.97 | num_updates 168000 | best_loss 5.54051 | length_loss 5.26535
| epoch 151 | valid on 'valid' subset | loss 5.567 | nll_loss 3.816 | ppl 14.08 | num_updates 168597 | best_loss 5.54051 | length_loss 5.31387
| epoch 152 | valid on 'valid' subset | loss 5.568 | nll_loss 3.809 | ppl 14.02 | num_updates 169714 | best_loss 5.54051 | length_loss 5.39712
| epoch 153 | valid on 'valid' subset | loss 5.606 | nll_loss 3.845 | ppl 14.37 | num_updates 170000 | best_loss 5.54051 | length_loss 5.62737
| epoch 153 | valid on 'valid' subset | loss 5.573 | nll_loss 3.814 | ppl 14.07 | num_updates 170831 | best_loss 5.54051 | length_loss 5.19096
| epoch 154 | valid on 'valid' subset | loss 5.565 | nll_loss 3.806 | ppl 13.99 | num_updates 171947 | best_loss 5.54051 | length_loss 5.34469
| epoch 155 | valid on 'valid' subset | loss 5.598 | nll_loss 3.841 | ppl 14.33 | num_updates 172000 | best_loss 5.54051 | length_loss 5.44766
| epoch 155 | valid on 'valid' subset | loss 5.555 | nll_loss 3.795 | ppl 13.88 | num_updates 173064 | best_loss 5.54051 | length_loss 5.15003
| epoch 156 | valid on 'valid' subset | loss 5.595 | nll_loss 3.843 | ppl 14.35 | num_updates 174000 | best_loss 5.54051 | length_loss 5.46855
| epoch 156 | valid on 'valid' subset | loss 5.567 | nll_loss 3.813 | ppl 14.05 | num_updates 174181 | best_loss 5.54051 | length_loss 5.4195
| epoch 157 | valid on 'valid' subset | loss 5.568 | nll_loss 3.814 | ppl 14.07 | num_updates 175297 | best_loss 5.54051 | length_loss 5.37113
| epoch 158 | valid on 'valid' subset | loss 5.595 | nll_loss 3.844 | ppl 14.36 | num_updates 176000 | best_loss 5.54051 | length_loss 5.27404
| epoch 158 | valid on 'valid' subset | loss 5.566 | nll_loss 3.808 | ppl 14.00 | num_updates 176414 | best_loss 5.54051 | length_loss 5.30657
| epoch 159 | valid on 'valid' subset | loss 5.570 | nll_loss 3.799 | ppl 13.92 | num_updates 177530 | best_loss 5.54051 | length_loss 5.51189
| epoch 160 | valid on 'valid' subset | loss 5.567 | nll_loss 3.814 | ppl 14.06 | num_updates 178000 | best_loss 5.54051 | length_loss 5.24147
| epoch 160 | valid on 'valid' subset | loss 5.555 | nll_loss 3.797 | ppl 13.90 | num_updates 178647 | best_loss 5.54051 | length_loss 5.17881
| epoch 161 | valid on 'valid' subset | loss 5.584 | nll_loss 3.818 | ppl 14.10 | num_updates 179764 | best_loss 5.54051 | length_loss 5.69451
| epoch 162 | valid on 'valid' subset | loss 5.580 | nll_loss 3.827 | ppl 14.20 | num_updates 180000 | best_loss 5.54051 | length_loss 5.3949
| epoch 162 | valid on 'valid' subset | loss 5.568 | nll_loss 3.811 | ppl 14.03 | num_updates 180880 | best_loss 5.54051 | length_loss 5.4461
| epoch 163 | valid on 'valid' subset | loss 5.550 | nll_loss 3.789 | ppl 13.82 | num_updates 181997 | best_loss 5.54051 | length_loss 5.37923
| epoch 164 | valid on 'valid' subset | loss 5.595 | nll_loss 3.838 | ppl 14.30 | num_updates 182000 | best_loss 5.54051 | length_loss 5.31286
| epoch 164 | valid on 'valid' subset | loss 5.570 | nll_loss 3.814 | ppl 14.07 | num_updates 183113 | best_loss 5.54051 | length_loss 5.35665
| epoch 165 | valid on 'valid' subset | loss 5.561 | nll_loss 3.797 | ppl 13.90 | num_updates 184000 | best_loss 5.54051 | length_loss 5.50869
| epoch 165 | valid on 'valid' subset | loss 5.565 | nll_loss 3.806 | ppl 13.98 | num_updates 184230 | best_loss 5.54051 | length_loss 5.54872
| epoch 166 | valid on 'valid' subset | loss 5.598 | nll_loss 3.844 | ppl 14.36 | num_updates 185346 | best_loss 5.54051 | length_loss 5.27286
| epoch 167 | valid on 'valid' subset | loss 5.597 | nll_loss 3.829 | ppl 14.21 | num_updates 186000 | best_loss 5.54051 | length_loss 5.60072
| epoch 167 | valid on 'valid' subset | loss 5.577 | nll_loss 3.818 | ppl 14.10 | num_updates 186463 | best_loss 5.54051 | length_loss 5.34891
| epoch 168 | valid on 'valid' subset | loss 5.551 | nll_loss 3.783 | ppl 13.77 | num_updates 187579 | best_loss 5.54051 | length_loss 5.42081
| epoch 169 | valid on 'valid' subset | loss 5.573 | nll_loss 3.811 | ppl 14.04 | num_updates 188000 | best_loss 5.54051 | length_loss 5.46544
| epoch 169 | valid on 'valid' subset | loss 5.562 | nll_loss 3.801 | ppl 13.94 | num_updates 188696 | best_loss 5.54051 | length_loss 5.46789
| epoch 170 | valid on 'valid' subset | loss 5.564 | nll_loss 3.799 | ppl 13.92 | num_updates 189812 | best_loss 5.54051 | length_loss 5.81281
| epoch 171 | valid on 'valid' subset | loss 5.569 | nll_loss 3.818 | ppl 14.10 | num_updates 190000 | best_loss 5.54051 | length_loss 5.47333
| epoch 171 | valid on 'valid' subset | loss 5.561 | nll_loss 3.806 | ppl 13.99 | num_updates 190929 | best_loss 5.54051 | length_loss 5.39761
| epoch 172 | valid on 'valid' subset | loss 5.571 | nll_loss 3.813 | ppl 14.05 | num_updates 192000 | best_loss 5.54051 | length_loss 5.6447
| epoch 172 | valid on 'valid' subset | loss 5.556 | nll_loss 3.801 | ppl 13.94 | num_updates 192046 | best_loss 5.54051 | length_loss 5.54948
| epoch 173 | valid on 'valid' subset | loss 5.579 | nll_loss 3.825 | ppl 14.17 | num_updates 193161 | best_loss 5.54051 | length_loss 5.43502
| epoch 174 | valid on 'valid' subset | loss 5.530 | nll_loss 3.762 | ppl 13.57 | num_updates 194000 | best_loss 5.53033 | length_loss 5.45161
| epoch 174 | valid on 'valid' subset | loss 5.567 | nll_loss 3.811 | ppl 14.03 | num_updates 194278 | best_loss 5.53033 | length_loss 5.56833
| epoch 175 | valid on 'valid' subset | loss 5.557 | nll_loss 3.797 | ppl 13.90 | num_updates 195395 | best_loss 5.53033 | length_loss 5.41813
| epoch 176 | valid on 'valid' subset | loss 5.570 | nll_loss 3.794 | ppl 13.87 | num_updates 196000 | best_loss 5.53033 | length_loss 5.71809
| epoch 176 | valid on 'valid' subset | loss 5.540 | nll_loss 3.786 | ppl 13.80 | num_updates 196512 | best_loss 5.53033 | length_loss 5.37814
| epoch 177 | valid on 'valid' subset | loss 5.566 | nll_loss 3.802 | ppl 13.95 | num_updates 197629 | best_loss 5.53033 | length_loss 5.41428
| epoch 178 | valid on 'valid' subset | loss 5.572 | nll_loss 3.814 | ppl 14.06 | num_updates 198000 | best_loss 5.53033 | length_loss 5.61653
| epoch 178 | valid on 'valid' subset | loss 5.566 | nll_loss 3.804 | ppl 13.96 | num_updates 198745 | best_loss 5.53033 | length_loss 5.30565
| epoch 179 | valid on 'valid' subset | loss 5.536 | nll_loss 3.782 | ppl 13.75 | num_updates 199862 | best_loss 5.53033 | length_loss 5.24064
| epoch 180 | valid on 'valid' subset | loss 5.568 | nll_loss 3.796 | ppl 13.89 | num_updates 200000 | best_loss 5.53033 | length_loss 5.66345
| epoch 180 | valid on 'valid' subset | loss 5.551 | nll_loss 3.795 | ppl 13.88 | num_updates 200977 | best_loss 5.53033 | length_loss 5.14049
| epoch 181 | valid on 'valid' subset | loss 5.567 | nll_loss 3.805 | ppl 13.98 | num_updates 202000 | best_loss 5.53033 | length_loss 5.57159
| epoch 181 | valid on 'valid' subset | loss 5.544 | nll_loss 3.791 | ppl 13.84 | num_updates 202094 | best_loss 5.53033 | length_loss 5.06756
| epoch 182 | valid on 'valid' subset | loss 5.578 | nll_loss 3.819 | ppl 14.11 | num_updates 203211 | best_loss 5.53033 | length_loss 5.28628
| epoch 183 | valid on 'valid' subset | loss 5.546 | nll_loss 3.800 | ppl 13.93 | num_updates 204000 | best_loss 5.53033 | length_loss 5.08014
| epoch 183 | valid on 'valid' subset | loss 5.572 | nll_loss 3.814 | ppl 14.07 | num_updates 204328 | best_loss 5.53033 | length_loss 5.37596
| epoch 184 | valid on 'valid' subset | loss 5.581 | nll_loss 3.817 | ppl 14.09 | num_updates 205444 | best_loss 5.53033 | length_loss 5.59484
| epoch 185 | valid on 'valid' subset | loss 5.570 | nll_loss 3.808 | ppl 14.01 | num_updates 206000 | best_loss 5.53033 | length_loss 5.45787
| epoch 185 | valid on 'valid' subset | loss 5.562 | nll_loss 3.810 | ppl 14.03 | num_updates 206560 | best_loss 5.53033 | length_loss 5.41936
| epoch 186 | valid on 'valid' subset | loss 5.547 | nll_loss 3.787 | ppl 13.81 | num_updates 207677 | best_loss 5.53033 | length_loss 5.4934
| epoch 187 | valid on 'valid' subset | loss 5.546 | nll_loss 3.785 | ppl 13.79 | num_updates 208000 | best_loss 5.53033 | length_loss 5.46068
| epoch 187 | valid on 'valid' subset | loss 5.552 | nll_loss 3.801 | ppl 13.94 | num_updates 208793 | best_loss 5.53033 | length_loss 5.20748
| epoch 188 | valid on 'valid' subset | loss 5.586 | nll_loss 3.826 | ppl 14.18 | num_updates 209910 | best_loss 5.53033 | length_loss 5.44483
| epoch 189 | valid on 'valid' subset | loss 5.591 | nll_loss 3.821 | ppl 14.13 | num_updates 210000 | best_loss 5.53033 | length_loss 5.74206
| epoch 189 | valid on 'valid' subset | loss 5.568 | nll_loss 3.811 | ppl 14.04 | num_updates 211027 | best_loss 5.53033 | length_loss 5.5314
| epoch 190 | valid on 'valid' subset | loss 5.557 | nll_loss 3.809 | ppl 14.01 | num_updates 212000 | best_loss 5.53033 | length_loss 5.15352
| epoch 190 | valid on 'valid' subset | loss 5.560 | nll_loss 3.792 | ppl 13.85 | num_updates 212144 | best_loss 5.53033 | length_loss 5.76931
| epoch 191 | valid on 'valid' subset | loss 5.553 | nll_loss 3.799 | ppl 13.92 | num_updates 213260 | best_loss 5.53033 | length_loss 5.39174
| epoch 192 | valid on 'valid' subset | loss 5.561 | nll_loss 3.809 | ppl 14.02 | num_updates 214000 | best_loss 5.53033 | length_loss 5.37401
| epoch 192 | valid on 'valid' subset | loss 5.573 | nll_loss 3.818 | ppl 14.10 | num_updates 214377 | best_loss 5.53033 | length_loss 5.42767
| epoch 193 | valid on 'valid' subset | loss 5.575 | nll_loss 3.804 | ppl 13.97 | num_updates 215494 | best_loss 5.53033 | length_loss 5.91402
| epoch 194 | valid on 'valid' subset | loss 5.563 | nll_loss 3.801 | ppl 13.94 | num_updates 216000 | best_loss 5.53033 | length_loss 5.6633
| epoch 194 | valid on 'valid' subset | loss 5.578 | nll_loss 3.824 | ppl 14.16 | num_updates 216610 | best_loss 5.53033 | length_loss 5.41596
| epoch 195 | valid on 'valid' subset | loss 5.565 | nll_loss 3.810 | ppl 14.02 | num_updates 217727 | best_loss 5.53033 | length_loss 5.25721
| epoch 196 | valid on 'valid' subset | loss 5.563 | nll_loss 3.806 | ppl 13.98 | num_updates 218000 | best_loss 5.53033 | length_loss 5.53871
| epoch 196 | valid on 'valid' subset | loss 5.551 | nll_loss 3.795 | ppl 13.88 | num_updates 218844 | best_loss 5.53033 | length_loss 5.42202
| epoch 197 | valid on 'valid' subset | loss 5.585 | nll_loss 3.823 | ppl 14.15 | num_updates 219960 | best_loss 5.53033 | length_loss 5.49439
| epoch 198 | valid on 'valid' subset | loss 5.573 | nll_loss 3.810 | ppl 14.02 | num_updates 220000 | best_loss 5.53033 | length_loss 5.55596
| epoch 198 | valid on 'valid' subset | loss 5.572 | nll_loss 3.807 | ppl 14.00 | num_updates 221077 | best_loss 5.53033 | length_loss 5.66251
| epoch 199 | valid on 'valid' subset | loss 5.550 | nll_loss 3.783 | ppl 13.77 | num_updates 222000 | best_loss 5.53033 | length_loss 5.73069
| epoch 199 | valid on 'valid' subset | loss 5.561 | nll_loss 3.806 | ppl 13.98 | num_updates 222194 | best_loss 5.53033 | length_loss 5.51685
| epoch 200 | valid on 'valid' subset | loss 5.561 | nll_loss 3.802 | ppl 13.95 | num_updates 223311 | best_loss 5.53033 | length_loss 5.36789
| epoch 201 | valid on 'valid' subset | loss 5.584 | nll_loss 3.829 | ppl 14.22 | num_updates 224000 | best_loss 5.53033 | length_loss 5.7705
| epoch 201 | valid on 'valid' subset | loss 5.544 | nll_loss 3.785 | ppl 13.78 | num_updates 224426 | best_loss 5.53033 | length_loss 5.46047
| epoch 202 | valid on 'valid' subset | loss 5.554 | nll_loss 3.803 | ppl 13.96 | num_updates 225543 | best_loss 5.53033 | length_loss 5.27932
| epoch 203 | valid on 'valid' subset | loss 5.568 | nll_loss 3.820 | ppl 14.12 | num_updates 226000 | best_loss 5.53033 | length_loss 5.474
| epoch 203 | valid on 'valid' subset | loss 5.540 | nll_loss 3.776 | ppl 13.70 | num_updates 226659 | best_loss 5.53033 | length_loss 5.5055
| epoch 204 | valid on 'valid' subset | loss 5.579 | nll_loss 3.821 | ppl 14.13 | num_updates 227776 | best_loss 5.53033 | length_loss 5.52887
| epoch 205 | valid on 'valid' subset | loss 5.556 | nll_loss 3.800 | ppl 13.92 | num_updates 228000 | best_loss 5.53033 | length_loss 5.48365
| epoch 205 | valid on 'valid' subset | loss 5.563 | nll_loss 3.810 | ppl 14.02 | num_updates 228893 | best_loss 5.53033 | length_loss 5.45851
| epoch 206 | valid on 'valid' subset | loss 5.568 | nll_loss 3.819 | ppl 14.11 | num_updates 230000 | best_loss 5.53033 | length_loss 5.33926
| epoch 206 | valid on 'valid' subset | loss 5.555 | nll_loss 3.802 | ppl 13.95 | num_updates 230010 | best_loss 5.53033 | length_loss 5.2936
| epoch 207 | valid on 'valid' subset | loss 5.545 | nll_loss 3.787 | ppl 13.80 | num_updates 231127 | best_loss 5.53033 | length_loss 5.35306
| epoch 208 | valid on 'valid' subset | loss 5.546 | nll_loss 3.789 | ppl 13.83 | num_updates 232000 | best_loss 5.53033 | length_loss 5.25107
| epoch 208 | valid on 'valid' subset | loss 5.554 | nll_loss 3.784 | ppl 13.78 | num_updates 232243 | best_loss 5.53033 | length_loss 5.83934
| epoch 209 | valid on 'valid' subset | loss 5.554 | nll_loss 3.790 | ppl 13.83 | num_updates 233360 | best_loss 5.53033 | length_loss 5.65368
| epoch 210 | valid on 'valid' subset | loss 5.568 | nll_loss 3.812 | ppl 14.04 | num_updates 234000 | best_loss 5.53033 | length_loss 5.36371
| epoch 210 | valid on 'valid' subset | loss 5.603 | nll_loss 3.843 | ppl 14.35 | num_updates 234477 | best_loss 5.53033 | length_loss 5.53959
| epoch 211 | valid on 'valid' subset | loss 5.553 | nll_loss 3.791 | ppl 13.84 | num_updates 235593 | best_loss 5.53033 | length_loss 5.80747
| epoch 212 | valid on 'valid' subset | loss 5.545 | nll_loss 3.784 | ppl 13.77 | num_updates 236000 | best_loss 5.53033 | length_loss 5.5965
| epoch 212 | valid on 'valid' subset | loss 5.574 | nll_loss 3.807 | ppl 14.00 | num_updates 236710 | best_loss 5.53033 | length_loss 5.66404
| epoch 213 | valid on 'valid' subset | loss 5.575 | nll_loss 3.815 | ppl 14.08 | num_updates 237827 | best_loss 5.53033 | length_loss 5.90345
| epoch 214 | valid on 'valid' subset | loss 5.583 | nll_loss 3.826 | ppl 14.18 | num_updates 238000 | best_loss 5.53033 | length_loss 5.63592
| epoch 214 | valid on 'valid' subset | loss 5.567 | nll_loss 3.801 | ppl 13.94 | num_updates 238943 | best_loss 5.53033 | length_loss 5.82184
| epoch 215 | valid on 'valid' subset | loss 5.543 | nll_loss 3.791 | ppl 13.84 | num_updates 240000 | best_loss 5.53033 | length_loss 5.10817
| epoch 215 | valid on 'valid' subset | loss 5.515 | nll_loss 3.762 | ppl 13.57 | num_updates 240060 | best_loss 5.51535 | length_loss 5.38116
| epoch 216 | valid on 'valid' subset | loss 5.531 | nll_loss 3.761 | ppl 13.56 | num_updates 241177 | best_loss 5.51535 | length_loss 5.93191
| epoch 217 | valid on 'valid' subset | loss 5.579 | nll_loss 3.815 | ppl 14.08 | num_updates 242000 | best_loss 5.51535 | length_loss 5.58746
| epoch 217 | valid on 'valid' subset | loss 5.551 | nll_loss 3.791 | ppl 13.85 | num_updates 242293 | best_loss 5.51535 | length_loss 5.70344
| epoch 218 | valid on 'valid' subset | loss 5.566 | nll_loss 3.806 | ppl 13.99 | num_updates 243410 | best_loss 5.51535 | length_loss 5.62662
| epoch 219 | valid on 'valid' subset | loss 5.560 | nll_loss 3.803 | ppl 13.96 | num_updates 244000 | best_loss 5.51535 | length_loss 5.2603
| epoch 219 | valid on 'valid' subset | loss 5.570 | nll_loss 3.812 | ppl 14.05 | num_updates 244527 | best_loss 5.51535 | length_loss 5.65762
| epoch 220 | valid on 'valid' subset | loss 5.540 | nll_loss 3.780 | ppl 13.74 | num_updates 245643 | best_loss 5.51535 | length_loss 5.83115
| epoch 221 | valid on 'valid' subset | loss 5.574 | nll_loss 3.814 | ppl 14.06 | num_updates 246000 | best_loss 5.51535 | length_loss 5.54657
| epoch 221 | valid on 'valid' subset | loss 5.548 | nll_loss 3.775 | ppl 13.69 | num_updates 246760 | best_loss 5.51535 | length_loss 5.85829
| epoch 222 | valid on 'valid' subset | loss 5.581 | nll_loss 3.823 | ppl 14.16 | num_updates 247876 | best_loss 5.51535 | length_loss 5.45996
| epoch 223 | valid on 'valid' subset | loss 5.555 | nll_loss 3.800 | ppl 13.93 | num_updates 248000 | best_loss 5.51535 | length_loss 5.22607
| epoch 223 | valid on 'valid' subset | loss 5.562 | nll_loss 3.808 | ppl 14.01 | num_updates 248993 | best_loss 5.51535 | length_loss 5.30127
| epoch 224 | valid on 'valid' subset | loss 5.561 | nll_loss 3.799 | ppl 13.91 | num_updates 250000 | best_loss 5.51535 | length_loss 5.53859
| epoch 224 | valid on 'valid' subset | loss 5.548 | nll_loss 3.800 | ppl 13.93 | num_updates 250109 | best_loss 5.51535 | length_loss 5.0822
| epoch 225 | valid on 'valid' subset | loss 5.559 | nll_loss 3.809 | ppl 14.01 | num_updates 251226 | best_loss 5.51535 | length_loss 5.16153
| epoch 226 | valid on 'valid' subset | loss 5.535 | nll_loss 3.778 | ppl 13.72 | num_updates 252000 | best_loss 5.51535 | length_loss 5.45856
| epoch 226 | valid on 'valid' subset | loss 5.544 | nll_loss 3.782 | ppl 13.76 | num_updates 252342 | best_loss 5.51535 | length_loss 5.49948
| epoch 227 | valid on 'valid' subset | loss 5.555 | nll_loss 3.802 | ppl 13.95 | num_updates 253459 | best_loss 5.51535 | length_loss 5.27929
| epoch 228 | valid on 'valid' subset | loss 5.556 | nll_loss 3.808 | ppl 14.01 | num_updates 254000 | best_loss 5.51535 | length_loss 5.24114
| epoch 228 | valid on 'valid' subset | loss 5.537 | nll_loss 3.776 | ppl 13.69 | num_updates 254576 | best_loss 5.51535 | length_loss 5.73599
| epoch 229 | valid on 'valid' subset | loss 5.557 | nll_loss 3.801 | ppl 13.94 | num_updates 255692 | best_loss 5.51535 | length_loss 5.54465
| epoch 230 | valid on 'valid' subset | loss 5.581 | nll_loss 3.823 | ppl 14.15 | num_updates 256000 | best_loss 5.51535 | length_loss 5.37077
| epoch 230 | valid on 'valid' subset | loss 5.560 | nll_loss 3.802 | ppl 13.94 | num_updates 256808 | best_loss 5.51535 | length_loss 5.352
| epoch 231 | valid on 'valid' subset | loss 5.569 | nll_loss 3.816 | ppl 14.08 | num_updates 257925 | best_loss 5.51535 | length_loss 5.41897
| epoch 232 | valid on 'valid' subset | loss 5.579 | nll_loss 3.827 | ppl 14.19 | num_updates 258000 | best_loss 5.51535 | length_loss 5.303
| epoch 232 | valid on 'valid' subset | loss 5.546 | nll_loss 3.792 | ppl 13.85 | num_updates 259042 | best_loss 5.51535 | length_loss 5.50189
| epoch 233 | valid on 'valid' subset | loss 5.559 | nll_loss 3.802 | ppl 13.94 | num_updates 260000 | best_loss 5.51535 | length_loss 5.48539
| epoch 233 | valid on 'valid' subset | loss 5.568 | nll_loss 3.815 | ppl 14.08 | num_updates 260159 | best_loss 5.51535 | length_loss 5.46542
| epoch 234 | valid on 'valid' subset | loss 5.572 | nll_loss 3.827 | ppl 14.19 | num_updates 261275 | best_loss 5.51535 | length_loss 5.45701
| epoch 235 | valid on 'valid' subset | loss 5.549 | nll_loss 3.787 | ppl 13.80 | num_updates 262000 | best_loss 5.51535 | length_loss 5.63183
| epoch 235 | valid on 'valid' subset | loss 5.535 | nll_loss 3.782 | ppl 13.76 | num_updates 262392 | best_loss 5.51535 | length_loss 5.47303
| epoch 236 | valid on 'valid' subset | loss 5.547 | nll_loss 3.786 | ppl 13.80 | num_updates 263508 | best_loss 5.51535 | length_loss 5.28952
| epoch 237 | valid on 'valid' subset | loss 5.558 | nll_loss 3.804 | ppl 13.97 | num_updates 264000 | best_loss 5.51535 | length_loss 5.38988
| epoch 237 | valid on 'valid' subset | loss 5.570 | nll_loss 3.811 | ppl 14.03 | num_updates 264624 | best_loss 5.51535 | length_loss 5.70354
| epoch 238 | valid on 'valid' subset | loss 5.562 | nll_loss 3.802 | ppl 13.95 | num_updates 265741 | best_loss 5.51535 | length_loss 5.42643
| epoch 239 | valid on 'valid' subset | loss 5.575 | nll_loss 3.815 | ppl 14.07 | num_updates 266000 | best_loss 5.51535 | length_loss 5.5593
| epoch 239 | valid on 'valid' subset | loss 5.564 | nll_loss 3.799 | ppl 13.92 | num_updates 266858 | best_loss 5.51535 | length_loss 5.75153
| epoch 240 | valid on 'valid' subset | loss 5.554 | nll_loss 3.801 | ppl 13.94 | num_updates 267975 | best_loss 5.51535 | length_loss 5.27644
| epoch 241 | valid on 'valid' subset | loss 5.580 | nll_loss 3.829 | ppl 14.21 | num_updates 268000 | best_loss 5.51535 | length_loss 5.52113
| epoch 241 | valid on 'valid' subset | loss 5.548 | nll_loss 3.788 | ppl 13.82 | num_updates 269091 | best_loss 5.51535 | length_loss 5.5829
| epoch 242 | valid on 'valid' subset | loss 5.577 | nll_loss 3.808 | ppl 14.00 | num_updates 270000 | best_loss 5.51535 | length_loss 5.60181
| epoch 242 | valid on 'valid' subset | loss 5.557 | nll_loss 3.802 | ppl 13.95 | num_updates 270208 | best_loss 5.51535 | length_loss 5.33101
| epoch 243 | valid on 'valid' subset | loss 5.542 | nll_loss 3.783 | ppl 13.77 | num_updates 271324 | best_loss 5.51535 | length_loss 5.42662
| epoch 244 | valid on 'valid' subset | loss 5.548 | nll_loss 3.801 | ppl 13.94 | num_updates 272000 | best_loss 5.51535 | length_loss 5.3026
| epoch 244 | valid on 'valid' subset | loss 5.558 | nll_loss 3.799 | ppl 13.92 | num_updates 272441 | best_loss 5.51535 | length_loss 5.47595
| epoch 245 | valid on 'valid' subset | loss 5.542 | nll_loss 3.777 | ppl 13.71 | num_updates 273558 | best_loss 5.51535 | length_loss 5.84733
| epoch 246 | valid on 'valid' subset | loss 5.557 | nll_loss 3.806 | ppl 13.98 | num_updates 274000 | best_loss 5.51535 | length_loss 5.25389
| epoch 246 | valid on 'valid' subset | loss 5.576 | nll_loss 3.811 | ppl 14.03 | num_updates 274674 | best_loss 5.51535 | length_loss 5.73085
| epoch 247 | valid on 'valid' subset | loss 5.572 | nll_loss 3.815 | ppl 14.08 | num_updates 275791 | best_loss 5.51535 | length_loss 5.33448
| epoch 248 | valid on 'valid' subset | loss 5.575 | nll_loss 3.821 | ppl 14.13 | num_updates 276000 | best_loss 5.51535 | length_loss 5.31691
| epoch 248 | valid on 'valid' subset | loss 5.536 | nll_loss 3.780 | ppl 13.74 | num_updates 276907 | best_loss 5.51535 | length_loss 5.04319
| epoch 249 | valid on 'valid' subset | loss 5.562 | nll_loss 3.801 | ppl 13.94 | num_updates 278000 | best_loss 5.51535 | length_loss 5.4161
| epoch 249 | valid on 'valid' subset | loss 5.553 | nll_loss 3.793 | ppl 13.86 | num_updates 278024 | best_loss 5.51535 | length_loss 5.47107
| epoch 250 | valid on 'valid' subset | loss 5.551 | nll_loss 3.796 | ppl 13.89 | num_updates 279139 | best_loss 5.51535 | length_loss 5.29736
| epoch 251 | valid on 'valid' subset | loss 5.580 | nll_loss 3.823 | ppl 14.15 | num_updates 280000 | best_loss 5.51535 | length_loss 5.49814
| epoch 251 | valid on 'valid' subset | loss 5.556 | nll_loss 3.791 | ppl 13.84 | num_updates 280256 | best_loss 5.51535 | length_loss 5.53517
| epoch 252 | valid on 'valid' subset | loss 5.557 | nll_loss 3.798 | ppl 13.91 | num_updates 281373 | best_loss 5.51535 | length_loss 5.41261
| epoch 253 | valid on 'valid' subset | loss 5.554 | nll_loss 3.800 | ppl 13.93 | num_updates 282000 | best_loss 5.51535 | length_loss 5.37349
| epoch 253 | valid on 'valid' subset | loss 5.578 | nll_loss 3.820 | ppl 14.13 | num_updates 282490 | best_loss 5.51535 | length_loss 5.45404
| epoch 254 | valid on 'valid' subset | loss 5.580 | nll_loss 3.808 | ppl 14.01 | num_updates 283606 | best_loss 5.51535 | length_loss 6.01807
| epoch 255 | valid on 'valid' subset | loss 5.584 | nll_loss 3.833 | ppl 14.25 | num_updates 284000 | best_loss 5.51535 | length_loss 5.65133
| epoch 255 | valid on 'valid' subset | loss 5.544 | nll_loss 3.787 | ppl 13.80 | num_updates 284723 | best_loss 5.51535 | length_loss 5.46523
| epoch 256 | valid on 'valid' subset | loss 5.545 | nll_loss 3.796 | ppl 13.89 | num_updates 285839 | best_loss 5.51535 | length_loss 5.35542
| epoch 257 | valid on 'valid' subset | loss 5.540 | nll_loss 3.787 | ppl 13.81 | num_updates 286000 | best_loss 5.51535 | length_loss 5.39352
| epoch 257 | valid on 'valid' subset | loss 5.541 | nll_loss 3.777 | ppl 13.71 | num_updates 286956 | best_loss 5.51535 | length_loss 5.56823
| epoch 258 | valid on 'valid' subset | loss 5.562 | nll_loss 3.802 | ppl 13.94 | num_updates 288000 | best_loss 5.51535 | length_loss 5.55644
| epoch 258 | valid on 'valid' subset | loss 5.564 | nll_loss 3.811 | ppl 14.03 | num_updates 288072 | best_loss 5.51535 | length_loss 5.36939
| epoch 259 | valid on 'valid' subset | loss 5.555 | nll_loss 3.798 | ppl 13.91 | num_updates 289189 | best_loss 5.51535 | length_loss 5.56483
| epoch 260 | valid on 'valid' subset | loss 5.568 | nll_loss 3.806 | ppl 13.99 | num_updates 290000 | best_loss 5.51535 | length_loss 5.45672
| epoch 260 | valid on 'valid' subset | loss 5.557 | nll_loss 3.802 | ppl 13.95 | num_updates 290306 | best_loss 5.51535 | length_loss 5.32805
| epoch 261 | valid on 'valid' subset | loss 5.557 | nll_loss 3.807 | ppl 13.99 | num_updates 291422 | best_loss 5.51535 | length_loss 5.14117
| epoch 262 | valid on 'valid' subset | loss 5.557 | nll_loss 3.801 | ppl 13.94 | num_updates 292000 | best_loss 5.51535 | length_loss 5.38264
| epoch 262 | valid on 'valid' subset | loss 5.543 | nll_loss 3.790 | ppl 13.84 | num_updates 292539 | best_loss 5.51535 | length_loss 5.41206
| epoch 263 | valid on 'valid' subset | loss 5.547 | nll_loss 3.783 | ppl 13.76 | num_updates 293656 | best_loss 5.51535 | length_loss 5.43538
| epoch 264 | valid on 'valid' subset | loss 5.543 | nll_loss 3.778 | ppl 13.72 | num_updates 294000 | best_loss 5.51535 | length_loss 5.71871
| epoch 264 | valid on 'valid' subset | loss 5.555 | nll_loss 3.799 | ppl 13.92 | num_updates 294772 | best_loss 5.51535 | length_loss 5.51499
| epoch 265 | valid on 'valid' subset | loss 5.550 | nll_loss 3.791 | ppl 13.84 | num_updates 295889 | best_loss 5.51535 | length_loss 5.4894
| epoch 266 | valid on 'valid' subset | loss 5.568 | nll_loss 3.813 | ppl 14.05 | num_updates 296000 | best_loss 5.51535 | length_loss 5.53522
| epoch 266 | valid on 'valid' subset | loss 5.584 | nll_loss 3.836 | ppl 14.28 | num_updates 297005 | best_loss 5.51535 | length_loss 5.15856
| epoch 267 | valid on 'valid' subset | loss 5.560 | nll_loss 3.805 | ppl 13.98 | num_updates 298000 | best_loss 5.51535 | length_loss 5.30485
| epoch 267 | valid on 'valid' subset | loss 5.566 | nll_loss 3.810 | ppl 14.03 | num_updates 298122 | best_loss 5.51535 | length_loss 5.3212
| epoch 268 | valid on 'valid' subset | loss 5.578 | nll_loss 3.828 | ppl 14.20 | num_updates 299239 | best_loss 5.51535 | length_loss 5.62171
| epoch 269 | valid on 'valid' subset | loss 5.545 | nll_loss 3.788 | ppl 13.82 | num_updates 300000 | best_loss 5.51535 | length_loss 5.35381
| epoch 269 | valid on 'valid' subset | loss 5.562 | nll_loss 3.807 | ppl 13.99 | num_updates 300000 | best_loss 5.51535 | length_loss 5.35381

And I simply averaged the last 5 checkpoints, where I got ~27.3 BLEU.

Thanks for your helpful suggestions again to reproduce the reults. I will close this issue : )

jungokasai commented 4 years ago

Great! The result seems very reasonable. Thank you for the update.