Closed alphadl closed 4 years ago
The perplexity aka. PPL seems to show a cyclical fluctuation trend, very unstable. @jungokasai
The valid loss divergence after 100+ epochs looks strange indeed. So the only differences I can see from my setting are:
--max-tokens 16000
vs. --max-tokens 8192
--distributed-world-size 8
vs. --distributed-world-size 16
--update-freq 4
vs. --update-freq 1
I would guess 3. might have the biggest impact. Could you try setting --update-freq 1
instead? For reference, the following is the exact command I used to produce the results.
python train.py <PATH_TO_DATA> --arch disco_transformer --criterion label_smoothed_length_cross_entropy --label-smoothing 0.1 --lr 5e-4 --warmup-init-lr 1e-7 \
--min-lr 1e-9 --lr-scheduler inverse_sqrt --warmup-updates 10000 --optimizer adam --adam-betas '(0.9, 0.999)' --adam-eps 1e-6 \
--task translation_self --max-tokens 8192 --weight-decay 0.01 --dropout 0.2 \
--encoder-layers 6 --encoder-embed-dim 512 --decoder-layers 6 --decoder-embed-dim 512 \
--fp16 --max-source-positions 10000 --max-target-positions 10000 --max-update 300000 --seed 1 \
--save-dir <SAVE_DIR> --dynamic-masking --ignore-eos-loss \
--share-all-embeddings \
--distributed-world-size 16 --distributed-port 54100 \
Alternatively, it might be due to the fact that the optimization hyperparameters used in CMLMs are less robust to different configurations. It worked fine with their exact setting, which I simply followed for DisCo as well. Several people found it doesn't work well with the transformer large configuration. If the problem still persists, could you try something like this?
python train.py <PATH_TO_DATA> --arch disco_transformer --criterion label_smoothed_length_cross_entropy --label-smoothing 0.1 --lr 5e-4 --warmup-init-lr 1e-7 \
--min-lr 1e-9 --lr-scheduler inverse_sqrt --warmup-updates 4000 --optimizer adam --adam-betas '(0.9, 0.98)' --adam-eps 1e-6 \
--task translation_self --max-tokens <Your_Batch_Size> --weight-decay 0.01 --dropout 0.2 \
--encoder-layers 6 --encoder-embed-dim 512 --decoder-layers 6 --decoder-embed-dim 512 \
--fp16 --max-source-positions 10000 --max-target-positions 10000 --max-update 300000 --seed 1 \
--save-dir <SAVE_DIR> --dynamic-masking --ignore-eos-loss \
--share-all-embeddings \
--distributed-world-size <# GPUS you are using> --distributed-port 54100 \
I hope this helps!
Thanks for your kind suggestions~ I am trying it and will report the results later ! btw, that issue is opened by me as well 😆
I merely change the --update-freq 4
to --update-freq 1
, and got the reasonable loss curve:
Namespace(adam_betas='(0.9, 0.999)', adam_eps=1e-06, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, arch='disco_transformer', at_only=False, at_rm=False, attention_dropout=0.0, best_checkpoint_metric='loss', bilm_add_bos=False, bilm_attention_dropout=0.0, bilm_mask_last_state=False, bilm_model_dropout=0.1, bilm_relu_dropout=0.0, bucket_cap_mb=25, clip_norm=25, cpu=False, criterion='label_smoothed_length_cross_entropy', curriculum=0, data=['wmt16.en-de.disco.dist'], dataset_impl=None, ddp_backend='c10d', decoder_attention_heads=8, decoder_embed_dim=512, decoder_embed_path=None, decoder_embed_scale=None, decoder_ffn_embed_dim=2048, decoder_input_dim=512, decoder_layers=6, decoder_learned_pos=False, decoder_normalize_before=False, decoder_output_dim=512, device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method='tcp://localhost:19951', distributed_no_spawn=False, distributed_port=-1, distributed_rank=0, distributed_world_size=8, dropout=0.2, dynamic_length=False, dynamic_masking=True, embedding_only=False, encoder_attention_heads=8, encoder_embed_dim=512, encoder_embed_path=None, encoder_embed_scale=None, encoder_ffn_embed_dim=2048, encoder_layers=6, encoder_learned_pos=False, encoder_normalize_before=False, find_unused_parameters=False, fix_batches_to_gpus=False, fp16=True, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, full_masking=False, ignore_eos_loss=True, keep_interval_updates=-1, keep_last_epochs=20, label_smoothing=0.1, left_pad_source='True', left_pad_target='False', log_format='simple', log_interval=100, lr=[0.0005], lr_scheduler='inverse_sqrt', mask_range=False, maskp=False, max_epoch=0, max_sentences=None, max_sentences_valid=None, max_source_positions=10000, max_target_positions=10000, max_tokens=16000, max_tokens_valid=16000, max_update=300000, maximize_best_checkpoint_metric=False, memory_efficient_fp16=False, min_loss_scale=0.0001, min_lr=1e-09, mix_masking=False, no_dec_token_positional_embeddings=False, no_enc_token_positional_embeddings=False, no_epoch_checkpoints=False, no_last_checkpoints=False, no_progress_bar=True, no_save=False, no_save_optimizer_state=False, num_workers=0, optimizer='adam', optimizer_overrides='{}', perm_only=False, raw_text=False, relu_dropout=0.0, required_batch_size_multiple=8, reset_dataloader=False, reset_lr_scheduler=False, reset_meters=False, reset_optimizer=False, restore_file='checkpoint_last.pt', save_dir='checkpoints/wmt16.en-de.nat_authordata_v1', save_interval=1, save_interval_updates=2000, seed=1, self_target=False, sentence_avg=False, share_all_embeddings=True, share_decoder_input_output_embed=False, share_layers=False, skip_eos=False, skip_invalid_size_inputs_valid_test=False, source_lang=None, target_lang=None, task='translation_self', tbmf_wrapper=False, tensorboard_logdir='', threshold_loss_scale=None, train_subset='train', update_freq=[1], upsample_primary=1, use_bmuf=False, user_dir=None, valid_subset='valid', validate_interval=1, warmup_init_lr=1e-07, warmup_updates=10000, weight_decay=0.01)
| $path/wmt16.en-de.disco.dist/ valid 3000 examples
| epoch 001 | valid on 'valid' subset | loss 11.538 | nll_loss 10.732 | ppl 1700.59 | num_updates 1112 | length_loss 8.07768
| epoch 002 | valid on 'valid' subset | loss 10.608 | nll_loss 9.655 | ppl 806.45 | num_updates 2000 | best_loss 10.6081 | length_loss 6.71878
| epoch 002 | valid on 'valid' subset | loss 10.329 | nll_loss 9.306 | ppl 632.98 | num_updates 2228 | best_loss 10.3293 | length_loss 6.59959
| epoch 003 | valid on 'valid' subset | loss 9.444 | nll_loss 8.200 | ppl 294.10 | num_updates 3344 | best_loss 9.44363 | length_loss 5.74804
| epoch 004 | valid on 'valid' subset | loss 8.778 | nll_loss 7.399 | ppl 168.77 | num_updates 4000 | best_loss 8.77766 | length_loss 5.54225
| epoch 004 | valid on 'valid' subset | loss 8.404 | nll_loss 6.919 | ppl 120.99 | num_updates 4461 | best_loss 8.40365 | length_loss 5.7412
| epoch 005 | valid on 'valid' subset | loss 7.576 | nll_loss 5.950 | ppl 61.84 | num_updates 5577 | best_loss 7.57616 | length_loss 5.39451
| epoch 006 | valid on 'valid' subset | loss 7.394 | nll_loss 5.753 | ppl 53.91 | num_updates 6000 | best_loss 7.39413 | length_loss 5.2428
| epoch 006 | valid on 'valid' subset | loss 7.149 | nll_loss 5.480 | ppl 44.64 | num_updates 6694 | best_loss 7.14901 | length_loss 5.12927
| epoch 007 | valid on 'valid' subset | loss 6.837 | nll_loss 5.118 | ppl 34.72 | num_updates 7811 | best_loss 6.8366 | length_loss 5.45324
| epoch 008 | valid on 'valid' subset | loss 6.780 | nll_loss 5.074 | ppl 33.68 | num_updates 8000 | best_loss 6.77959 | length_loss 5.10814
| epoch 008 | valid on 'valid' subset | loss 6.580 | nll_loss 4.844 | ppl 28.71 | num_updates 8926 | best_loss 6.57975 | length_loss 5.67847
| epoch 009 | valid on 'valid' subset | loss 6.451 | nll_loss 4.712 | ppl 26.22 | num_updates 10000 | best_loss 6.45135 | length_loss 5.04276
| epoch 009 | valid on 'valid' subset | loss 6.432 | nll_loss 4.681 | ppl 25.65 | num_updates 10043 | best_loss 6.43218 | length_loss 5.35626
| epoch 010 | valid on 'valid' subset | loss 6.283 | nll_loss 4.521 | ppl 22.95 | num_updates 11160 | best_loss 6.28313 | length_loss 5.3799
| epoch 011 | valid on 'valid' subset | loss 6.195 | nll_loss 4.458 | ppl 21.97 | num_updates 12000 | best_loss 6.19524 | length_loss 4.68387
| epoch 011 | valid on 'valid' subset | loss 6.160 | nll_loss 4.421 | ppl 21.42 | num_updates 12276 | best_loss 6.15985 | length_loss 4.99431
| epoch 012 | valid on 'valid' subset | loss 6.094 | nll_loss 4.353 | ppl 20.44 | num_updates 13390 | best_loss 6.09379 | length_loss 4.97237
| epoch 013 | valid on 'valid' subset | loss 6.081 | nll_loss 4.332 | ppl 20.14 | num_updates 14000 | best_loss 6.08146 | length_loss 4.8785
| epoch 013 | valid on 'valid' subset | loss 6.068 | nll_loss 4.328 | ppl 20.09 | num_updates 14507 | best_loss 6.06823 | length_loss 5.03105
| epoch 014 | valid on 'valid' subset | loss 5.967 | nll_loss 4.215 | ppl 18.57 | num_updates 15624 | best_loss 5.96733 | length_loss 4.86164
| epoch 015 | valid on 'valid' subset | loss 5.969 | nll_loss 4.221 | ppl 18.65 | num_updates 16000 | best_loss 5.96733 | length_loss 5.09694
| epoch 015 | valid on 'valid' subset | loss 5.918 | nll_loss 4.161 | ppl 17.89 | num_updates 16740 | best_loss 5.91825 | length_loss 4.96521
| epoch 016 | valid on 'valid' subset | loss 5.903 | nll_loss 4.164 | ppl 17.93 | num_updates 17857 | best_loss 5.90326 | length_loss 4.70368
| epoch 017 | valid on 'valid' subset | loss 5.920 | nll_loss 4.176 | ppl 18.08 | num_updates 18000 | best_loss 5.90326 | length_loss 5.12186
| epoch 017 | valid on 'valid' subset | loss 5.877 | nll_loss 4.125 | ppl 17.45 | num_updates 18974 | best_loss 5.87721 | length_loss 4.87007
| epoch 018 | valid on 'valid' subset | loss 5.850 | nll_loss 4.101 | ppl 17.16 | num_updates 20000 | best_loss 5.8498 | length_loss 5.18816
| epoch 018 | valid on 'valid' subset | loss 5.878 | nll_loss 4.127 | ppl 17.47 | num_updates 20091 | best_loss 5.8498 | length_loss 5.15794
| epoch 019 | valid on 'valid' subset | loss 5.836 | nll_loss 4.087 | ppl 16.99 | num_updates 21208 | best_loss 5.83566 | length_loss 4.97006
| epoch 020 | valid on 'valid' subset | loss 5.799 | nll_loss 4.040 | ppl 16.45 | num_updates 22000 | best_loss 5.79851 | length_loss 4.93875
| epoch 020 | valid on 'valid' subset | loss 5.829 | nll_loss 4.082 | ppl 16.93 | num_updates 22325 | best_loss 5.79851 | length_loss 4.89598
| epoch 021 | valid on 'valid' subset | loss 5.807 | nll_loss 4.035 | ppl 16.39 | num_updates 23441 | best_loss 5.79851 | length_loss 5.26515
| epoch 022 | valid on 'valid' subset | loss 5.761 | nll_loss 4.002 | ppl 16.02 | num_updates 24000 | best_loss 5.76138 | length_loss 5.15905
| epoch 022 | valid on 'valid' subset | loss 5.770 | nll_loss 4.020 | ppl 16.22 | num_updates 24558 | best_loss 5.76138 | length_loss 4.63934
| epoch 023 | valid on 'valid' subset | loss 5.775 | nll_loss 4.031 | ppl 16.35 | num_updates 25675 | best_loss 5.76138 | length_loss 4.96205
| epoch 024 | valid on 'valid' subset | loss 5.756 | nll_loss 3.999 | ppl 15.99 | num_updates 26000 | best_loss 5.75582 | length_loss 4.90414
| epoch 024 | valid on 'valid' subset | loss 5.788 | nll_loss 4.037 | ppl 16.41 | num_updates 26792 | best_loss 5.75582 | length_loss 5.04501
| epoch 025 | valid on 'valid' subset | loss 5.757 | nll_loss 3.996 | ppl 15.96 | num_updates 27906 | best_loss 5.75582 | length_loss 5.22324
| epoch 026 | valid on 'valid' subset | loss 5.804 | nll_loss 4.062 | ppl 16.70 | num_updates 28000 | best_loss 5.75582 | length_loss 4.77015
| epoch 026 | valid on 'valid' subset | loss 5.778 | nll_loss 4.010 | ppl 16.12 | num_updates 29023 | best_loss 5.75582 | length_loss 5.45128
| epoch 027 | valid on 'valid' subset | loss 5.738 | nll_loss 3.987 | ppl 15.86 | num_updates 30000 | best_loss 5.73791 | length_loss 4.79396
| epoch 027 | valid on 'valid' subset | loss 5.752 | nll_loss 3.995 | ppl 15.94 | num_updates 30140 | best_loss 5.73791 | length_loss 5.28362
| epoch 028 | valid on 'valid' subset | loss 5.731 | nll_loss 3.987 | ppl 15.85 | num_updates 31257 | best_loss 5.73146 | length_loss 4.81235
| epoch 029 | valid on 'valid' subset | loss 5.718 | nll_loss 3.956 | ppl 15.52 | num_updates 32000 | best_loss 5.71833 | length_loss 5.37431
| epoch 029 | valid on 'valid' subset | loss 5.724 | nll_loss 3.963 | ppl 15.59 | num_updates 32374 | best_loss 5.71833 | length_loss 5.08599
| epoch 030 | valid on 'valid' subset | loss 5.720 | nll_loss 3.967 | ppl 15.64 | num_updates 33491 | best_loss 5.71833 | length_loss 4.86158
| epoch 031 | valid on 'valid' subset | loss 5.727 | nll_loss 3.981 | ppl 15.79 | num_updates 34000 | best_loss 5.71833 | length_loss 4.80137
| epoch 031 | valid on 'valid' subset | loss 5.686 | nll_loss 3.926 | ppl 15.20 | num_updates 34607 | best_loss 5.68584 | length_loss 4.95641
| epoch 032 | valid on 'valid' subset | loss 5.727 | nll_loss 3.973 | ppl 15.71 | num_updates 35724 | best_loss 5.68584 | length_loss 4.9809
| epoch 033 | valid on 'valid' subset | loss 5.720 | nll_loss 3.966 | ppl 15.63 | num_updates 36000 | best_loss 5.68584 | length_loss 4.79346
| epoch 033 | valid on 'valid' subset | loss 5.714 | nll_loss 3.974 | ppl 15.72 | num_updates 36840 | best_loss 5.68584 | length_loss 4.6497
| epoch 034 | valid on 'valid' subset | loss 5.685 | nll_loss 3.916 | ppl 15.10 | num_updates 37957 | best_loss 5.6846 | length_loss 5.19303
| epoch 035 | valid on 'valid' subset | loss 5.686 | nll_loss 3.931 | ppl 15.25 | num_updates 38000 | best_loss 5.6846 | length_loss 5.17674
| epoch 035 | valid on 'valid' subset | loss 5.698 | nll_loss 3.937 | ppl 15.32 | num_updates 39073 | best_loss 5.6846 | length_loss 5.2187
| epoch 036 | valid on 'valid' subset | loss 5.682 | nll_loss 3.922 | ppl 15.16 | num_updates 40000 | best_loss 5.68205 | length_loss 5.09201
| epoch 036 | valid on 'valid' subset | loss 5.685 | nll_loss 3.937 | ppl 15.32 | num_updates 40190 | best_loss 5.68205 | length_loss 4.92951
| epoch 037 | valid on 'valid' subset | loss 5.686 | nll_loss 3.933 | ppl 15.27 | num_updates 41306 | best_loss 5.68205 | length_loss 5.09471
| epoch 038 | valid on 'valid' subset | loss 5.680 | nll_loss 3.925 | ppl 15.19 | num_updates 42000 | best_loss 5.68032 | length_loss 4.81772
| epoch 038 | valid on 'valid' subset | loss 5.649 | nll_loss 3.898 | ppl 14.91 | num_updates 42422 | best_loss 5.64865 | length_loss 4.99904
| epoch 039 | valid on 'valid' subset | loss 5.654 | nll_loss 3.891 | ppl 14.84 | num_updates 43539 | best_loss 5.64865 | length_loss 5.22766
| epoch 040 | valid on 'valid' subset | loss 5.681 | nll_loss 3.936 | ppl 15.31 | num_updates 44000 | best_loss 5.64865 | length_loss 4.73627
| epoch 040 | valid on 'valid' subset | loss 5.672 | nll_loss 3.919 | ppl 15.12 | num_updates 44656 | best_loss 5.64865 | length_loss 5.06906
| epoch 041 | valid on 'valid' subset | loss 5.653 | nll_loss 3.906 | ppl 14.99 | num_updates 45773 | best_loss 5.64865 | length_loss 4.81397
| epoch 042 | valid on 'valid' subset | loss 5.681 | nll_loss 3.926 | ppl 15.20 | num_updates 46000 | best_loss 5.64865 | length_loss 4.96613
| epoch 042 | valid on 'valid' subset | loss 5.671 | nll_loss 3.922 | ppl 15.16 | num_updates 46888 | best_loss 5.64865 | length_loss 4.78331
| epoch 043 | valid on 'valid' subset | loss 5.671 | nll_loss 3.905 | ppl 14.98 | num_updates 48000 | best_loss 5.64865 | length_loss 4.99218
| epoch 043 | valid on 'valid' subset | loss 5.690 | nll_loss 3.935 | ppl 15.30 | num_updates 48005 | best_loss 5.64865 | length_loss 4.85481
| epoch 044 | valid on 'valid' subset | loss 5.665 | nll_loss 3.905 | ppl 14.98 | num_updates 49121 | best_loss 5.64865 | length_loss 5.11471
| epoch 045 | valid on 'valid' subset | loss 5.668 | nll_loss 3.907 | ppl 15.00 | num_updates 50000 | best_loss 5.64865 | length_loss 5.12693
| epoch 045 | valid on 'valid' subset | loss 5.660 | nll_loss 3.905 | ppl 14.98 | num_updates 50238 | best_loss 5.64865 | length_loss 5.36505
| epoch 046 | valid on 'valid' subset | loss 5.659 | nll_loss 3.912 | ppl 15.06 | num_updates 51355 | best_loss 5.64865 | length_loss 4.92082
| epoch 047 | valid on 'valid' subset | loss 5.642 | nll_loss 3.883 | ppl 14.76 | num_updates 52000 | best_loss 5.64205 | length_loss 4.84096
| epoch 047 | valid on 'valid' subset | loss 5.634 | nll_loss 3.884 | ppl 14.76 | num_updates 52472 | best_loss 5.63406 | length_loss 5.12886
| epoch 048 | valid on 'valid' subset | loss 5.657 | nll_loss 3.910 | ppl 15.03 | num_updates 53589 | best_loss 5.63406 | length_loss 5.36747
| epoch 049 | valid on 'valid' subset | loss 5.656 | nll_loss 3.892 | ppl 14.85 | num_updates 54000 | best_loss 5.63406 | length_loss 5.42012
| epoch 049 | valid on 'valid' subset | loss 5.625 | nll_loss 3.859 | ppl 14.51 | num_updates 54705 | best_loss 5.62457 | length_loss 5.29776
| epoch 050 | valid on 'valid' subset | loss 5.681 | nll_loss 3.922 | ppl 15.16 | num_updates 55822 | best_loss 5.62457 | length_loss 5.3185
| epoch 051 | valid on 'valid' subset | loss 5.656 | nll_loss 3.894 | ppl 14.87 | num_updates 56000 | best_loss 5.62457 | length_loss 5.01675
| epoch 051 | valid on 'valid' subset | loss 5.620 | nll_loss 3.880 | ppl 14.72 | num_updates 56938 | best_loss 5.61993 | length_loss 4.89193
| epoch 052 | valid on 'valid' subset | loss 5.650 | nll_loss 3.886 | ppl 14.79 | num_updates 58000 | best_loss 5.61993 | length_loss 5.07345
| epoch 052 | valid on 'valid' subset | loss 5.642 | nll_loss 3.878 | ppl 14.71 | num_updates 58055 | best_loss 5.61993 | length_loss 5.29816
| epoch 053 | valid on 'valid' subset | loss 5.613 | nll_loss 3.861 | ppl 14.53 | num_updates 59171 | best_loss 5.61335 | length_loss 4.83874
| epoch 054 | valid on 'valid' subset | loss 5.618 | nll_loss 3.862 | ppl 14.54 | num_updates 60000 | best_loss 5.61335 | length_loss 5.05428
| epoch 054 | valid on 'valid' subset | loss 5.606 | nll_loss 3.839 | ppl 14.31 | num_updates 60288 | best_loss 5.60614 | length_loss 4.96276
| epoch 055 | valid on 'valid' subset | loss 5.606 | nll_loss 3.846 | ppl 14.38 | num_updates 61404 | best_loss 5.60614 | length_loss 5.17324
| epoch 056 | valid on 'valid' subset | loss 5.618 | nll_loss 3.863 | ppl 14.55 | num_updates 62000 | best_loss 5.60614 | length_loss 5.49125
| epoch 056 | valid on 'valid' subset | loss 5.632 | nll_loss 3.878 | ppl 14.71 | num_updates 62521 | best_loss 5.60614 | length_loss 4.88908
| epoch 057 | valid on 'valid' subset | loss 5.596 | nll_loss 3.839 | ppl 14.31 | num_updates 63638 | best_loss 5.59556 | length_loss 5.28005
| epoch 058 | valid on 'valid' subset | loss 5.658 | nll_loss 3.912 | ppl 15.05 | num_updates 64000 | best_loss 5.59556 | length_loss 4.7559
| epoch 058 | valid on 'valid' subset | loss 5.640 | nll_loss 3.876 | ppl 14.69 | num_updates 64754 | best_loss 5.59556 | length_loss 5.24229
| epoch 059 | valid on 'valid' subset | loss 5.608 | nll_loss 3.849 | ppl 14.41 | num_updates 65871 | best_loss 5.59556 | length_loss 5.104
| epoch 060 | valid on 'valid' subset | loss 5.623 | nll_loss 3.870 | ppl 14.62 | num_updates 66000 | best_loss 5.59556 | length_loss 5.01172
| epoch 060 | valid on 'valid' subset | loss 5.614 | nll_loss 3.853 | ppl 14.45 | num_updates 66987 | best_loss 5.59556 | length_loss 5.13998
| epoch 061 | valid on 'valid' subset | loss 5.615 | nll_loss 3.847 | ppl 14.39 | num_updates 68000 | best_loss 5.59556 | length_loss 5.66169
| epoch 061 | valid on 'valid' subset | loss 5.624 | nll_loss 3.873 | ppl 14.65 | num_updates 68104 | best_loss 5.59556 | length_loss 4.83914
| epoch 062 | valid on 'valid' subset | loss 5.639 | nll_loss 3.870 | ppl 14.62 | num_updates 69220 | best_loss 5.59556 | length_loss 5.43168
| epoch 063 | valid on 'valid' subset | loss 5.634 | nll_loss 3.871 | ppl 14.63 | num_updates 70000 | best_loss 5.59556 | length_loss 5.35281
| epoch 063 | valid on 'valid' subset | loss 5.599 | nll_loss 3.844 | ppl 14.36 | num_updates 70337 | best_loss 5.59556 | length_loss 5.57717
| epoch 064 | valid on 'valid' subset | loss 5.587 | nll_loss 3.828 | ppl 14.20 | num_updates 71454 | best_loss 5.58679 | length_loss 5.23344
| epoch 065 | valid on 'valid' subset | loss 5.638 | nll_loss 3.892 | ppl 14.85 | num_updates 72000 | best_loss 5.58679 | length_loss 4.88949
| epoch 065 | valid on 'valid' subset | loss 5.616 | nll_loss 3.861 | ppl 14.53 | num_updates 72571 | best_loss 5.58679 | length_loss 4.98517
| epoch 066 | valid on 'valid' subset | loss 5.611 | nll_loss 3.859 | ppl 14.51 | num_updates 73687 | best_loss 5.58679 | length_loss 5.14025
| epoch 067 | valid on 'valid' subset | loss 5.602 | nll_loss 3.846 | ppl 14.38 | num_updates 74000 | best_loss 5.58679 | length_loss 5.30373
| epoch 067 | valid on 'valid' subset | loss 5.621 | nll_loss 3.871 | ppl 14.63 | num_updates 74803 | best_loss 5.58679 | length_loss 5.21696
| epoch 068 | valid on 'valid' subset | loss 5.628 | nll_loss 3.871 | ppl 14.63 | num_updates 75920 | best_loss 5.58679 | length_loss 5.32986
| epoch 069 | valid on 'valid' subset | loss 5.595 | nll_loss 3.837 | ppl 14.29 | num_updates 76000 | best_loss 5.58679 | length_loss 5.02404
| epoch 069 | valid on 'valid' subset | loss 5.639 | nll_loss 3.878 | ppl 14.71 | num_updates 77037 | best_loss 5.58679 | length_loss 5.30288
| epoch 070 | valid on 'valid' subset | loss 5.614 | nll_loss 3.860 | ppl 14.52 | num_updates 78000 | best_loss 5.58679 | length_loss 5.00955
| epoch 070 | valid on 'valid' subset | loss 5.590 | nll_loss 3.831 | ppl 14.23 | num_updates 78153 | best_loss 5.58679 | length_loss 5.18245
| epoch 071 | valid on 'valid' subset | loss 5.627 | nll_loss 3.862 | ppl 14.54 | num_updates 79270 | best_loss 5.58679 | length_loss 5.14994
| epoch 072 | valid on 'valid' subset | loss 5.602 | nll_loss 3.847 | ppl 14.39 | num_updates 80000 | best_loss 5.58679 | length_loss 5.17858
| epoch 072 | valid on 'valid' subset | loss 5.592 | nll_loss 3.824 | ppl 14.16 | num_updates 80387 | best_loss 5.58679 | length_loss 5.48149
| epoch 073 | valid on 'valid' subset | loss 5.590 | nll_loss 3.828 | ppl 14.21 | num_updates 81503 | best_loss 5.58679 | length_loss 5.49635
| epoch 074 | valid on 'valid' subset | loss 5.596 | nll_loss 3.826 | ppl 14.18 | num_updates 82000 | best_loss 5.58679 | length_loss 5.73306
| epoch 074 | valid on 'valid' subset | loss 5.604 | nll_loss 3.846 | ppl 14.38 | num_updates 82620 | best_loss 5.58679 | length_loss 4.95396
| epoch 075 | valid on 'valid' subset | loss 5.590 | nll_loss 3.827 | ppl 14.19 | num_updates 83737 | best_loss 5.58679 | length_loss 5.23502
| epoch 076 | valid on 'valid' subset | loss 5.601 | nll_loss 3.835 | ppl 14.27 | num_updates 84000 | best_loss 5.58679 | length_loss 5.25717
| epoch 076 | valid on 'valid' subset | loss 5.600 | nll_loss 3.839 | ppl 14.31 | num_updates 84854 | best_loss 5.58679 | length_loss 5.06934
| epoch 077 | valid on 'valid' subset | loss 5.597 | nll_loss 3.832 | ppl 14.24 | num_updates 85970 | best_loss 5.58679 | length_loss 5.49544
| epoch 078 | valid on 'valid' subset | loss 5.631 | nll_loss 3.867 | ppl 14.59 | num_updates 86000 | best_loss 5.58679 | length_loss 5.52312
| epoch 078 | valid on 'valid' subset | loss 5.596 | nll_loss 3.826 | ppl 14.18 | num_updates 87086 | best_loss 5.58679 | length_loss 5.32114
| epoch 079 | valid on 'valid' subset | loss 5.586 | nll_loss 3.828 | ppl 14.20 | num_updates 88000 | best_loss 5.58557 | length_loss 5.28118
| epoch 079 | valid on 'valid' subset | loss 5.587 | nll_loss 3.825 | ppl 14.17 | num_updates 88203 | best_loss 5.58557 | length_loss 5.03314
| epoch 080 | valid on 'valid' subset | loss 5.579 | nll_loss 3.818 | ppl 14.11 | num_updates 89319 | best_loss 5.57853 | length_loss 5.49761
| epoch 081 | valid on 'valid' subset | loss 5.600 | nll_loss 3.837 | ppl 14.29 | num_updates 90000 | best_loss 5.57853 | length_loss 5.09228
| epoch 081 | valid on 'valid' subset | loss 5.623 | nll_loss 3.873 | ppl 14.65 | num_updates 90436 | best_loss 5.57853 | length_loss 5.15329
| epoch 082 | valid on 'valid' subset | loss 5.594 | nll_loss 3.835 | ppl 14.27 | num_updates 91552 | best_loss 5.57853 | length_loss 5.34417
| epoch 083 | valid on 'valid' subset | loss 5.626 | nll_loss 3.874 | ppl 14.66 | num_updates 92000 | best_loss 5.57853 | length_loss 4.9694
| epoch 083 | valid on 'valid' subset | loss 5.585 | nll_loss 3.821 | ppl 14.14 | num_updates 92669 | best_loss 5.57853 | length_loss 5.36829
| epoch 084 | valid on 'valid' subset | loss 5.610 | nll_loss 3.866 | ppl 14.58 | num_updates 93786 | best_loss 5.57853 | length_loss 4.95562
| epoch 085 | valid on 'valid' subset | loss 5.616 | nll_loss 3.864 | ppl 14.56 | num_updates 94000 | best_loss 5.57853 | length_loss 5.2255
| epoch 085 | valid on 'valid' subset | loss 5.580 | nll_loss 3.826 | ppl 14.18 | num_updates 94902 | best_loss 5.57853 | length_loss 5.10636
| epoch 086 | valid on 'valid' subset | loss 5.576 | nll_loss 3.814 | ppl 14.07 | num_updates 96000 | best_loss 5.57596 | length_loss 4.9704
| epoch 086 | valid on 'valid' subset | loss 5.603 | nll_loss 3.847 | ppl 14.39 | num_updates 96019 | best_loss 5.57596 | length_loss 5.2242
| epoch 087 | valid on 'valid' subset | loss 5.581 | nll_loss 3.812 | ppl 14.05 | num_updates 97136 | best_loss 5.57596 | length_loss 5.17751
| epoch 088 | valid on 'valid' subset | loss 5.609 | nll_loss 3.846 | ppl 14.38 | num_updates 98000 | best_loss 5.57596 | length_loss 5.35939
| epoch 088 | valid on 'valid' subset | loss 5.585 | nll_loss 3.818 | ppl 14.11 | num_updates 98253 | best_loss 5.57596 | length_loss 5.38032
| epoch 089 | valid on 'valid' subset | loss 5.592 | nll_loss 3.830 | ppl 14.22 | num_updates 99369 | best_loss 5.57596 | length_loss 5.25736
| epoch 090 | valid on 'valid' subset | loss 5.571 | nll_loss 3.812 | ppl 14.04 | num_updates 100000 | best_loss 5.57142 | length_loss 5.41195
| epoch 090 | valid on 'valid' subset | loss 5.576 | nll_loss 3.813 | ppl 14.05 | num_updates 100485 | best_loss 5.57142 | length_loss 5.32919
| epoch 091 | valid on 'valid' subset | loss 5.571 | nll_loss 3.800 | ppl 13.93 | num_updates 101602 | best_loss 5.5709 | length_loss 5.55755
| epoch 092 | valid on 'valid' subset | loss 5.586 | nll_loss 3.833 | ppl 14.25 | num_updates 102000 | best_loss 5.5709 | length_loss 5.11072
| epoch 092 | valid on 'valid' subset | loss 5.588 | nll_loss 3.825 | ppl 14.17 | num_updates 102719 | best_loss 5.5709 | length_loss 5.38931
| epoch 093 | valid on 'valid' subset | loss 5.593 | nll_loss 3.834 | ppl 14.26 | num_updates 103835 | best_loss 5.5709 | length_loss 5.4362
| epoch 094 | valid on 'valid' subset | loss 5.605 | nll_loss 3.844 | ppl 14.36 | num_updates 104000 | best_loss 5.5709 | length_loss 5.26631
| epoch 094 | valid on 'valid' subset | loss 5.605 | nll_loss 3.857 | ppl 14.49 | num_updates 104952 | best_loss 5.5709 | length_loss 5.1748
| epoch 095 | valid on 'valid' subset | loss 5.620 | nll_loss 3.863 | ppl 14.55 | num_updates 106000 | best_loss 5.5709 | length_loss 5.45649
| epoch 095 | valid on 'valid' subset | loss 5.588 | nll_loss 3.826 | ppl 14.18 | num_updates 106069 | best_loss 5.5709 | length_loss 5.26936
| epoch 096 | valid on 'valid' subset | loss 5.571 | nll_loss 3.809 | ppl 14.02 | num_updates 107186 | best_loss 5.5709 | length_loss 5.28208
| epoch 097 | valid on 'valid' subset | loss 5.636 | nll_loss 3.871 | ppl 14.63 | num_updates 108000 | best_loss 5.5709 | length_loss 5.47391
| epoch 097 | valid on 'valid' subset | loss 5.573 | nll_loss 3.820 | ppl 14.12 | num_updates 108301 | best_loss 5.5709 | length_loss 5.25959
| epoch 098 | valid on 'valid' subset | loss 5.593 | nll_loss 3.838 | ppl 14.30 | num_updates 109418 | best_loss 5.5709 | length_loss 5.22102
| epoch 099 | valid on 'valid' subset | loss 5.576 | nll_loss 3.825 | ppl 14.17 | num_updates 110000 | best_loss 5.5709 | length_loss 5.34992
| epoch 099 | valid on 'valid' subset | loss 5.590 | nll_loss 3.832 | ppl 14.24 | num_updates 110534 | best_loss 5.5709 | length_loss 5.30928
| epoch 100 | valid on 'valid' subset | loss 5.563 | nll_loss 3.814 | ppl 14.06 | num_updates 111651 | best_loss 5.56259 | length_loss 4.94879
| epoch 101 | valid on 'valid' subset | loss 5.604 | nll_loss 3.848 | ppl 14.40 | num_updates 112000 | best_loss 5.56259 | length_loss 5.26776
| epoch 101 | valid on 'valid' subset | loss 5.592 | nll_loss 3.845 | ppl 14.37 | num_updates 112767 | best_loss 5.56259 | length_loss 5.17652
| epoch 102 | valid on 'valid' subset | loss 5.576 | nll_loss 3.816 | ppl 14.09 | num_updates 113884 | best_loss 5.56259 | length_loss 5.29073
| epoch 103 | valid on 'valid' subset | loss 5.615 | nll_loss 3.859 | ppl 14.51 | num_updates 114000 | best_loss 5.56259 | length_loss 5.39229
| epoch 103 | valid on 'valid' subset | loss 5.586 | nll_loss 3.840 | ppl 14.32 | num_updates 115000 | best_loss 5.56259 | length_loss 5.18116
| epoch 104 | valid on 'valid' subset | loss 5.581 | nll_loss 3.830 | ppl 14.23 | num_updates 116000 | best_loss 5.56259 | length_loss 5.18728
| epoch 104 | valid on 'valid' subset | loss 5.579 | nll_loss 3.823 | ppl 14.16 | num_updates 116117 | best_loss 5.56259 | length_loss 5.35772
| epoch 105 | valid on 'valid' subset | loss 5.612 | nll_loss 3.854 | ppl 14.46 | num_updates 117234 | best_loss 5.56259 | length_loss 5.33082
| epoch 106 | valid on 'valid' subset | loss 5.599 | nll_loss 3.839 | ppl 14.31 | num_updates 118000 | best_loss 5.56259 | length_loss 5.42866
| epoch 106 | valid on 'valid' subset | loss 5.581 | nll_loss 3.829 | ppl 14.21 | num_updates 118350 | best_loss 5.56259 | length_loss 5.2323
| epoch 107 | valid on 'valid' subset | loss 5.568 | nll_loss 3.814 | ppl 14.06 | num_updates 119467 | best_loss 5.56259 | length_loss 5.21312
| epoch 108 | valid on 'valid' subset | loss 5.566 | nll_loss 3.807 | ppl 13.99 | num_updates 120000 | best_loss 5.56259 | length_loss 5.44057
| epoch 108 | valid on 'valid' subset | loss 5.584 | nll_loss 3.809 | ppl 14.02 | num_updates 120583 | best_loss 5.56259 | length_loss 5.75644
| epoch 109 | valid on 'valid' subset | loss 5.561 | nll_loss 3.800 | ppl 13.93 | num_updates 121700 | best_loss 5.56063 | length_loss 5.1544
| epoch 110 | valid on 'valid' subset | loss 5.631 | nll_loss 3.865 | ppl 14.57 | num_updates 122000 | best_loss 5.56063 | length_loss 5.7186
| epoch 110 | valid on 'valid' subset | loss 5.561 | nll_loss 3.804 | ppl 13.97 | num_updates 122817 | best_loss 5.56063 | length_loss 5.30358
| epoch 111 | valid on 'valid' subset | loss 5.586 | nll_loss 3.824 | ppl 14.16 | num_updates 123933 | best_loss 5.56063 | length_loss 5.35073
| epoch 112 | valid on 'valid' subset | loss 5.584 | nll_loss 3.824 | ppl 14.16 | num_updates 124000 | best_loss 5.56063 | length_loss 5.15542
| epoch 112 | valid on 'valid' subset | loss 5.587 | nll_loss 3.817 | ppl 14.10 | num_updates 125050 | best_loss 5.56063 | length_loss 5.61057
| epoch 113 | valid on 'valid' subset | loss 5.583 | nll_loss 3.825 | ppl 14.18 | num_updates 126000 | best_loss 5.56063 | length_loss 5.05128
| epoch 113 | valid on 'valid' subset | loss 5.608 | nll_loss 3.859 | ppl 14.51 | num_updates 126166 | best_loss 5.56063 | length_loss 5.15769
| epoch 114 | valid on 'valid' subset | loss 5.607 | nll_loss 3.853 | ppl 14.45 | num_updates 127283 | best_loss 5.56063 | length_loss 5.09543
| epoch 115 | valid on 'valid' subset | loss 5.560 | nll_loss 3.808 | ppl 14.00 | num_updates 128000 | best_loss 5.55957 | length_loss 5.00717
| epoch 115 | valid on 'valid' subset | loss 5.609 | nll_loss 3.847 | ppl 14.39 | num_updates 128400 | best_loss 5.55957 | length_loss 5.17661
| epoch 116 | valid on 'valid' subset | loss 5.554 | nll_loss 3.800 | ppl 13.93 | num_updates 129517 | best_loss 5.55379 | length_loss 5.1794
| epoch 117 | valid on 'valid' subset | loss 5.608 | nll_loss 3.856 | ppl 14.48 | num_updates 130000 | best_loss 5.55379 | length_loss 5.50267
| epoch 117 | valid on 'valid' subset | loss 5.566 | nll_loss 3.816 | ppl 14.09 | num_updates 130633 | best_loss 5.55379 | length_loss 5.10008
| epoch 118 | valid on 'valid' subset | loss 5.597 | nll_loss 3.849 | ppl 14.41 | num_updates 131749 | best_loss 5.55379 | length_loss 5.26575
| epoch 119 | valid on 'valid' subset | loss 5.590 | nll_loss 3.827 | ppl 14.19 | num_updates 132000 | best_loss 5.55379 | length_loss 5.60259
| epoch 119 | valid on 'valid' subset | loss 5.594 | nll_loss 3.838 | ppl 14.30 | num_updates 132866 | best_loss 5.55379 | length_loss 5.10372
| epoch 120 | valid on 'valid' subset | loss 5.568 | nll_loss 3.817 | ppl 14.09 | num_updates 133983 | best_loss 5.55379 | length_loss 5.09884
| epoch 121 | valid on 'valid' subset | loss 5.575 | nll_loss 3.814 | ppl 14.07 | num_updates 134000 | best_loss 5.55379 | length_loss 5.60332
| epoch 121 | valid on 'valid' subset | loss 5.576 | nll_loss 3.818 | ppl 14.10 | num_updates 135099 | best_loss 5.55379 | length_loss 5.19808
| epoch 122 | valid on 'valid' subset | loss 5.559 | nll_loss 3.802 | ppl 13.95 | num_updates 136000 | best_loss 5.55379 | length_loss 5.27703
| epoch 122 | valid on 'valid' subset | loss 5.570 | nll_loss 3.806 | ppl 13.98 | num_updates 136216 | best_loss 5.55379 | length_loss 5.33263
| epoch 123 | valid on 'valid' subset | loss 5.581 | nll_loss 3.830 | ppl 14.22 | num_updates 137333 | best_loss 5.55379 | length_loss 5.15172
| epoch 124 | valid on 'valid' subset | loss 5.584 | nll_loss 3.829 | ppl 14.21 | num_updates 138000 | best_loss 5.55379 | length_loss 5.29541
| epoch 124 | valid on 'valid' subset | loss 5.586 | nll_loss 3.835 | ppl 14.28 | num_updates 138449 | best_loss 5.55379 | length_loss 5.17188
| epoch 125 | valid on 'valid' subset | loss 5.574 | nll_loss 3.810 | ppl 14.02 | num_updates 139565 | best_loss 5.55379 | length_loss 5.61704
| epoch 126 | valid on 'valid' subset | loss 5.571 | nll_loss 3.803 | ppl 13.96 | num_updates 140000 | best_loss 5.55379 | length_loss 5.74625
| epoch 126 | valid on 'valid' subset | loss 5.573 | nll_loss 3.815 | ppl 14.07 | num_updates 140682 | best_loss 5.55379 | length_loss 5.47755
| epoch 127 | valid on 'valid' subset | loss 5.588 | nll_loss 3.823 | ppl 14.15 | num_updates 141799 | best_loss 5.55379 | length_loss 5.45464
| epoch 128 | valid on 'valid' subset | loss 5.611 | nll_loss 3.847 | ppl 14.39 | num_updates 142000 | best_loss 5.55379 | length_loss 5.65676
| epoch 128 | valid on 'valid' subset | loss 5.589 | nll_loss 3.831 | ppl 14.23 | num_updates 142916 | best_loss 5.55379 | length_loss 5.35568
| epoch 129 | valid on 'valid' subset | loss 5.572 | nll_loss 3.817 | ppl 14.10 | num_updates 144000 | best_loss 5.55379 | length_loss 5.08733
| epoch 129 | valid on 'valid' subset | loss 5.579 | nll_loss 3.828 | ppl 14.20 | num_updates 144032 | best_loss 5.55379 | length_loss 5.20082
| epoch 130 | valid on 'valid' subset | loss 5.582 | nll_loss 3.828 | ppl 14.20 | num_updates 145149 | best_loss 5.55379 | length_loss 5.31697
| epoch 131 | valid on 'valid' subset | loss 5.574 | nll_loss 3.814 | ppl 14.07 | num_updates 146000 | best_loss 5.55379 | length_loss 5.58609
| epoch 131 | valid on 'valid' subset | loss 5.557 | nll_loss 3.799 | ppl 13.92 | num_updates 146265 | best_loss 5.55379 | length_loss 5.25248
| epoch 132 | valid on 'valid' subset | loss 5.570 | nll_loss 3.805 | ppl 13.97 | num_updates 147382 | best_loss 5.55379 | length_loss 5.62696
| epoch 133 | valid on 'valid' subset | loss 5.565 | nll_loss 3.812 | ppl 14.04 | num_updates 148000 | best_loss 5.55379 | length_loss 5.23231
| epoch 133 | valid on 'valid' subset | loss 5.543 | nll_loss 3.782 | ppl 13.76 | num_updates 148498 | best_loss 5.54307 | length_loss 5.483
| epoch 134 | valid on 'valid' subset | loss 5.581 | nll_loss 3.816 | ppl 14.08 | num_updates 149615 | best_loss 5.54307 | length_loss 5.53838
| epoch 135 | valid on 'valid' subset | loss 5.571 | nll_loss 3.822 | ppl 14.14 | num_updates 150000 | best_loss 5.54307 | length_loss 5.07699
| epoch 135 | valid on 'valid' subset | loss 5.563 | nll_loss 3.812 | ppl 14.05 | num_updates 150731 | best_loss 5.54307 | length_loss 5.20817
| epoch 136 | valid on 'valid' subset | loss 5.555 | nll_loss 3.792 | ppl 13.85 | num_updates 151848 | best_loss 5.54307 | length_loss 5.36711
| epoch 137 | valid on 'valid' subset | loss 5.586 | nll_loss 3.829 | ppl 14.22 | num_updates 152000 | best_loss 5.54307 | length_loss 5.23819
| epoch 137 | valid on 'valid' subset | loss 5.559 | nll_loss 3.793 | ppl 13.86 | num_updates 152964 | best_loss 5.54307 | length_loss 5.49783
| epoch 138 | valid on 'valid' subset | loss 5.585 | nll_loss 3.824 | ppl 14.16 | num_updates 154000 | best_loss 5.54307 | length_loss 5.2778
| epoch 138 | valid on 'valid' subset | loss 5.581 | nll_loss 3.826 | ppl 14.18 | num_updates 154081 | best_loss 5.54307 | length_loss 5.51582
| epoch 139 | valid on 'valid' subset | loss 5.541 | nll_loss 3.780 | ppl 13.74 | num_updates 155197 | best_loss 5.54051 | length_loss 5.34901
| epoch 140 | valid on 'valid' subset | loss 5.599 | nll_loss 3.841 | ppl 14.33 | num_updates 156000 | best_loss 5.54051 | length_loss 5.25924
| epoch 140 | valid on 'valid' subset | loss 5.561 | nll_loss 3.808 | ppl 14.01 | num_updates 156314 | best_loss 5.54051 | length_loss 5.46697
| epoch 141 | valid on 'valid' subset | loss 5.585 | nll_loss 3.826 | ppl 14.18 | num_updates 157431 | best_loss 5.54051 | length_loss 5.43181
| epoch 142 | valid on 'valid' subset | loss 5.570 | nll_loss 3.817 | ppl 14.09 | num_updates 158000 | best_loss 5.54051 | length_loss 5.17516
| epoch 142 | valid on 'valid' subset | loss 5.564 | nll_loss 3.792 | ppl 13.85 | num_updates 158547 | best_loss 5.54051 | length_loss 5.67563
| epoch 143 | valid on 'valid' subset | loss 5.553 | nll_loss 3.799 | ppl 13.92 | num_updates 159664 | best_loss 5.54051 | length_loss 5.26242
| epoch 144 | valid on 'valid' subset | loss 5.563 | nll_loss 3.806 | ppl 13.99 | num_updates 160000 | best_loss 5.54051 | length_loss 5.31434
| epoch 144 | valid on 'valid' subset | loss 5.571 | nll_loss 3.816 | ppl 14.08 | num_updates 160781 | best_loss 5.54051 | length_loss 5.52447
| epoch 145 | valid on 'valid' subset | loss 5.553 | nll_loss 3.794 | ppl 13.87 | num_updates 161897 | best_loss 5.54051 | length_loss 5.19175
| epoch 146 | valid on 'valid' subset | loss 5.573 | nll_loss 3.821 | ppl 14.13 | num_updates 162000 | best_loss 5.54051 | length_loss 5.47153
| epoch 146 | valid on 'valid' subset | loss 5.567 | nll_loss 3.806 | ppl 13.99 | num_updates 163014 | best_loss 5.54051 | length_loss 5.28773
| epoch 147 | valid on 'valid' subset | loss 5.560 | nll_loss 3.798 | ppl 13.91 | num_updates 164000 | best_loss 5.54051 | length_loss 5.77107
| epoch 147 | valid on 'valid' subset | loss 5.563 | nll_loss 3.803 | ppl 13.96 | num_updates 164131 | best_loss 5.54051 | length_loss 5.38077
| epoch 148 | valid on 'valid' subset | loss 5.549 | nll_loss 3.777 | ppl 13.70 | num_updates 165247 | best_loss 5.54051 | length_loss 6.14763
| epoch 149 | valid on 'valid' subset | loss 5.576 | nll_loss 3.821 | ppl 14.13 | num_updates 166000 | best_loss 5.54051 | length_loss 5.17569
| epoch 149 | valid on 'valid' subset | loss 5.549 | nll_loss 3.796 | ppl 13.89 | num_updates 166363 | best_loss 5.54051 | length_loss 5.11687
| epoch 150 | valid on 'valid' subset | loss 5.567 | nll_loss 3.813 | ppl 14.05 | num_updates 167480 | best_loss 5.54051 | length_loss 5.09933
| epoch 151 | valid on 'valid' subset | loss 5.557 | nll_loss 3.804 | ppl 13.97 | num_updates 168000 | best_loss 5.54051 | length_loss 5.26535
| epoch 151 | valid on 'valid' subset | loss 5.567 | nll_loss 3.816 | ppl 14.08 | num_updates 168597 | best_loss 5.54051 | length_loss 5.31387
| epoch 152 | valid on 'valid' subset | loss 5.568 | nll_loss 3.809 | ppl 14.02 | num_updates 169714 | best_loss 5.54051 | length_loss 5.39712
| epoch 153 | valid on 'valid' subset | loss 5.606 | nll_loss 3.845 | ppl 14.37 | num_updates 170000 | best_loss 5.54051 | length_loss 5.62737
| epoch 153 | valid on 'valid' subset | loss 5.573 | nll_loss 3.814 | ppl 14.07 | num_updates 170831 | best_loss 5.54051 | length_loss 5.19096
| epoch 154 | valid on 'valid' subset | loss 5.565 | nll_loss 3.806 | ppl 13.99 | num_updates 171947 | best_loss 5.54051 | length_loss 5.34469
| epoch 155 | valid on 'valid' subset | loss 5.598 | nll_loss 3.841 | ppl 14.33 | num_updates 172000 | best_loss 5.54051 | length_loss 5.44766
| epoch 155 | valid on 'valid' subset | loss 5.555 | nll_loss 3.795 | ppl 13.88 | num_updates 173064 | best_loss 5.54051 | length_loss 5.15003
| epoch 156 | valid on 'valid' subset | loss 5.595 | nll_loss 3.843 | ppl 14.35 | num_updates 174000 | best_loss 5.54051 | length_loss 5.46855
| epoch 156 | valid on 'valid' subset | loss 5.567 | nll_loss 3.813 | ppl 14.05 | num_updates 174181 | best_loss 5.54051 | length_loss 5.4195
| epoch 157 | valid on 'valid' subset | loss 5.568 | nll_loss 3.814 | ppl 14.07 | num_updates 175297 | best_loss 5.54051 | length_loss 5.37113
| epoch 158 | valid on 'valid' subset | loss 5.595 | nll_loss 3.844 | ppl 14.36 | num_updates 176000 | best_loss 5.54051 | length_loss 5.27404
| epoch 158 | valid on 'valid' subset | loss 5.566 | nll_loss 3.808 | ppl 14.00 | num_updates 176414 | best_loss 5.54051 | length_loss 5.30657
| epoch 159 | valid on 'valid' subset | loss 5.570 | nll_loss 3.799 | ppl 13.92 | num_updates 177530 | best_loss 5.54051 | length_loss 5.51189
| epoch 160 | valid on 'valid' subset | loss 5.567 | nll_loss 3.814 | ppl 14.06 | num_updates 178000 | best_loss 5.54051 | length_loss 5.24147
| epoch 160 | valid on 'valid' subset | loss 5.555 | nll_loss 3.797 | ppl 13.90 | num_updates 178647 | best_loss 5.54051 | length_loss 5.17881
| epoch 161 | valid on 'valid' subset | loss 5.584 | nll_loss 3.818 | ppl 14.10 | num_updates 179764 | best_loss 5.54051 | length_loss 5.69451
| epoch 162 | valid on 'valid' subset | loss 5.580 | nll_loss 3.827 | ppl 14.20 | num_updates 180000 | best_loss 5.54051 | length_loss 5.3949
| epoch 162 | valid on 'valid' subset | loss 5.568 | nll_loss 3.811 | ppl 14.03 | num_updates 180880 | best_loss 5.54051 | length_loss 5.4461
| epoch 163 | valid on 'valid' subset | loss 5.550 | nll_loss 3.789 | ppl 13.82 | num_updates 181997 | best_loss 5.54051 | length_loss 5.37923
| epoch 164 | valid on 'valid' subset | loss 5.595 | nll_loss 3.838 | ppl 14.30 | num_updates 182000 | best_loss 5.54051 | length_loss 5.31286
| epoch 164 | valid on 'valid' subset | loss 5.570 | nll_loss 3.814 | ppl 14.07 | num_updates 183113 | best_loss 5.54051 | length_loss 5.35665
| epoch 165 | valid on 'valid' subset | loss 5.561 | nll_loss 3.797 | ppl 13.90 | num_updates 184000 | best_loss 5.54051 | length_loss 5.50869
| epoch 165 | valid on 'valid' subset | loss 5.565 | nll_loss 3.806 | ppl 13.98 | num_updates 184230 | best_loss 5.54051 | length_loss 5.54872
| epoch 166 | valid on 'valid' subset | loss 5.598 | nll_loss 3.844 | ppl 14.36 | num_updates 185346 | best_loss 5.54051 | length_loss 5.27286
| epoch 167 | valid on 'valid' subset | loss 5.597 | nll_loss 3.829 | ppl 14.21 | num_updates 186000 | best_loss 5.54051 | length_loss 5.60072
| epoch 167 | valid on 'valid' subset | loss 5.577 | nll_loss 3.818 | ppl 14.10 | num_updates 186463 | best_loss 5.54051 | length_loss 5.34891
| epoch 168 | valid on 'valid' subset | loss 5.551 | nll_loss 3.783 | ppl 13.77 | num_updates 187579 | best_loss 5.54051 | length_loss 5.42081
| epoch 169 | valid on 'valid' subset | loss 5.573 | nll_loss 3.811 | ppl 14.04 | num_updates 188000 | best_loss 5.54051 | length_loss 5.46544
| epoch 169 | valid on 'valid' subset | loss 5.562 | nll_loss 3.801 | ppl 13.94 | num_updates 188696 | best_loss 5.54051 | length_loss 5.46789
| epoch 170 | valid on 'valid' subset | loss 5.564 | nll_loss 3.799 | ppl 13.92 | num_updates 189812 | best_loss 5.54051 | length_loss 5.81281
| epoch 171 | valid on 'valid' subset | loss 5.569 | nll_loss 3.818 | ppl 14.10 | num_updates 190000 | best_loss 5.54051 | length_loss 5.47333
| epoch 171 | valid on 'valid' subset | loss 5.561 | nll_loss 3.806 | ppl 13.99 | num_updates 190929 | best_loss 5.54051 | length_loss 5.39761
| epoch 172 | valid on 'valid' subset | loss 5.571 | nll_loss 3.813 | ppl 14.05 | num_updates 192000 | best_loss 5.54051 | length_loss 5.6447
| epoch 172 | valid on 'valid' subset | loss 5.556 | nll_loss 3.801 | ppl 13.94 | num_updates 192046 | best_loss 5.54051 | length_loss 5.54948
| epoch 173 | valid on 'valid' subset | loss 5.579 | nll_loss 3.825 | ppl 14.17 | num_updates 193161 | best_loss 5.54051 | length_loss 5.43502
| epoch 174 | valid on 'valid' subset | loss 5.530 | nll_loss 3.762 | ppl 13.57 | num_updates 194000 | best_loss 5.53033 | length_loss 5.45161
| epoch 174 | valid on 'valid' subset | loss 5.567 | nll_loss 3.811 | ppl 14.03 | num_updates 194278 | best_loss 5.53033 | length_loss 5.56833
| epoch 175 | valid on 'valid' subset | loss 5.557 | nll_loss 3.797 | ppl 13.90 | num_updates 195395 | best_loss 5.53033 | length_loss 5.41813
| epoch 176 | valid on 'valid' subset | loss 5.570 | nll_loss 3.794 | ppl 13.87 | num_updates 196000 | best_loss 5.53033 | length_loss 5.71809
| epoch 176 | valid on 'valid' subset | loss 5.540 | nll_loss 3.786 | ppl 13.80 | num_updates 196512 | best_loss 5.53033 | length_loss 5.37814
| epoch 177 | valid on 'valid' subset | loss 5.566 | nll_loss 3.802 | ppl 13.95 | num_updates 197629 | best_loss 5.53033 | length_loss 5.41428
| epoch 178 | valid on 'valid' subset | loss 5.572 | nll_loss 3.814 | ppl 14.06 | num_updates 198000 | best_loss 5.53033 | length_loss 5.61653
| epoch 178 | valid on 'valid' subset | loss 5.566 | nll_loss 3.804 | ppl 13.96 | num_updates 198745 | best_loss 5.53033 | length_loss 5.30565
| epoch 179 | valid on 'valid' subset | loss 5.536 | nll_loss 3.782 | ppl 13.75 | num_updates 199862 | best_loss 5.53033 | length_loss 5.24064
| epoch 180 | valid on 'valid' subset | loss 5.568 | nll_loss 3.796 | ppl 13.89 | num_updates 200000 | best_loss 5.53033 | length_loss 5.66345
| epoch 180 | valid on 'valid' subset | loss 5.551 | nll_loss 3.795 | ppl 13.88 | num_updates 200977 | best_loss 5.53033 | length_loss 5.14049
| epoch 181 | valid on 'valid' subset | loss 5.567 | nll_loss 3.805 | ppl 13.98 | num_updates 202000 | best_loss 5.53033 | length_loss 5.57159
| epoch 181 | valid on 'valid' subset | loss 5.544 | nll_loss 3.791 | ppl 13.84 | num_updates 202094 | best_loss 5.53033 | length_loss 5.06756
| epoch 182 | valid on 'valid' subset | loss 5.578 | nll_loss 3.819 | ppl 14.11 | num_updates 203211 | best_loss 5.53033 | length_loss 5.28628
| epoch 183 | valid on 'valid' subset | loss 5.546 | nll_loss 3.800 | ppl 13.93 | num_updates 204000 | best_loss 5.53033 | length_loss 5.08014
| epoch 183 | valid on 'valid' subset | loss 5.572 | nll_loss 3.814 | ppl 14.07 | num_updates 204328 | best_loss 5.53033 | length_loss 5.37596
| epoch 184 | valid on 'valid' subset | loss 5.581 | nll_loss 3.817 | ppl 14.09 | num_updates 205444 | best_loss 5.53033 | length_loss 5.59484
| epoch 185 | valid on 'valid' subset | loss 5.570 | nll_loss 3.808 | ppl 14.01 | num_updates 206000 | best_loss 5.53033 | length_loss 5.45787
| epoch 185 | valid on 'valid' subset | loss 5.562 | nll_loss 3.810 | ppl 14.03 | num_updates 206560 | best_loss 5.53033 | length_loss 5.41936
| epoch 186 | valid on 'valid' subset | loss 5.547 | nll_loss 3.787 | ppl 13.81 | num_updates 207677 | best_loss 5.53033 | length_loss 5.4934
| epoch 187 | valid on 'valid' subset | loss 5.546 | nll_loss 3.785 | ppl 13.79 | num_updates 208000 | best_loss 5.53033 | length_loss 5.46068
| epoch 187 | valid on 'valid' subset | loss 5.552 | nll_loss 3.801 | ppl 13.94 | num_updates 208793 | best_loss 5.53033 | length_loss 5.20748
| epoch 188 | valid on 'valid' subset | loss 5.586 | nll_loss 3.826 | ppl 14.18 | num_updates 209910 | best_loss 5.53033 | length_loss 5.44483
| epoch 189 | valid on 'valid' subset | loss 5.591 | nll_loss 3.821 | ppl 14.13 | num_updates 210000 | best_loss 5.53033 | length_loss 5.74206
| epoch 189 | valid on 'valid' subset | loss 5.568 | nll_loss 3.811 | ppl 14.04 | num_updates 211027 | best_loss 5.53033 | length_loss 5.5314
| epoch 190 | valid on 'valid' subset | loss 5.557 | nll_loss 3.809 | ppl 14.01 | num_updates 212000 | best_loss 5.53033 | length_loss 5.15352
| epoch 190 | valid on 'valid' subset | loss 5.560 | nll_loss 3.792 | ppl 13.85 | num_updates 212144 | best_loss 5.53033 | length_loss 5.76931
| epoch 191 | valid on 'valid' subset | loss 5.553 | nll_loss 3.799 | ppl 13.92 | num_updates 213260 | best_loss 5.53033 | length_loss 5.39174
| epoch 192 | valid on 'valid' subset | loss 5.561 | nll_loss 3.809 | ppl 14.02 | num_updates 214000 | best_loss 5.53033 | length_loss 5.37401
| epoch 192 | valid on 'valid' subset | loss 5.573 | nll_loss 3.818 | ppl 14.10 | num_updates 214377 | best_loss 5.53033 | length_loss 5.42767
| epoch 193 | valid on 'valid' subset | loss 5.575 | nll_loss 3.804 | ppl 13.97 | num_updates 215494 | best_loss 5.53033 | length_loss 5.91402
| epoch 194 | valid on 'valid' subset | loss 5.563 | nll_loss 3.801 | ppl 13.94 | num_updates 216000 | best_loss 5.53033 | length_loss 5.6633
| epoch 194 | valid on 'valid' subset | loss 5.578 | nll_loss 3.824 | ppl 14.16 | num_updates 216610 | best_loss 5.53033 | length_loss 5.41596
| epoch 195 | valid on 'valid' subset | loss 5.565 | nll_loss 3.810 | ppl 14.02 | num_updates 217727 | best_loss 5.53033 | length_loss 5.25721
| epoch 196 | valid on 'valid' subset | loss 5.563 | nll_loss 3.806 | ppl 13.98 | num_updates 218000 | best_loss 5.53033 | length_loss 5.53871
| epoch 196 | valid on 'valid' subset | loss 5.551 | nll_loss 3.795 | ppl 13.88 | num_updates 218844 | best_loss 5.53033 | length_loss 5.42202
| epoch 197 | valid on 'valid' subset | loss 5.585 | nll_loss 3.823 | ppl 14.15 | num_updates 219960 | best_loss 5.53033 | length_loss 5.49439
| epoch 198 | valid on 'valid' subset | loss 5.573 | nll_loss 3.810 | ppl 14.02 | num_updates 220000 | best_loss 5.53033 | length_loss 5.55596
| epoch 198 | valid on 'valid' subset | loss 5.572 | nll_loss 3.807 | ppl 14.00 | num_updates 221077 | best_loss 5.53033 | length_loss 5.66251
| epoch 199 | valid on 'valid' subset | loss 5.550 | nll_loss 3.783 | ppl 13.77 | num_updates 222000 | best_loss 5.53033 | length_loss 5.73069
| epoch 199 | valid on 'valid' subset | loss 5.561 | nll_loss 3.806 | ppl 13.98 | num_updates 222194 | best_loss 5.53033 | length_loss 5.51685
| epoch 200 | valid on 'valid' subset | loss 5.561 | nll_loss 3.802 | ppl 13.95 | num_updates 223311 | best_loss 5.53033 | length_loss 5.36789
| epoch 201 | valid on 'valid' subset | loss 5.584 | nll_loss 3.829 | ppl 14.22 | num_updates 224000 | best_loss 5.53033 | length_loss 5.7705
| epoch 201 | valid on 'valid' subset | loss 5.544 | nll_loss 3.785 | ppl 13.78 | num_updates 224426 | best_loss 5.53033 | length_loss 5.46047
| epoch 202 | valid on 'valid' subset | loss 5.554 | nll_loss 3.803 | ppl 13.96 | num_updates 225543 | best_loss 5.53033 | length_loss 5.27932
| epoch 203 | valid on 'valid' subset | loss 5.568 | nll_loss 3.820 | ppl 14.12 | num_updates 226000 | best_loss 5.53033 | length_loss 5.474
| epoch 203 | valid on 'valid' subset | loss 5.540 | nll_loss 3.776 | ppl 13.70 | num_updates 226659 | best_loss 5.53033 | length_loss 5.5055
| epoch 204 | valid on 'valid' subset | loss 5.579 | nll_loss 3.821 | ppl 14.13 | num_updates 227776 | best_loss 5.53033 | length_loss 5.52887
| epoch 205 | valid on 'valid' subset | loss 5.556 | nll_loss 3.800 | ppl 13.92 | num_updates 228000 | best_loss 5.53033 | length_loss 5.48365
| epoch 205 | valid on 'valid' subset | loss 5.563 | nll_loss 3.810 | ppl 14.02 | num_updates 228893 | best_loss 5.53033 | length_loss 5.45851
| epoch 206 | valid on 'valid' subset | loss 5.568 | nll_loss 3.819 | ppl 14.11 | num_updates 230000 | best_loss 5.53033 | length_loss 5.33926
| epoch 206 | valid on 'valid' subset | loss 5.555 | nll_loss 3.802 | ppl 13.95 | num_updates 230010 | best_loss 5.53033 | length_loss 5.2936
| epoch 207 | valid on 'valid' subset | loss 5.545 | nll_loss 3.787 | ppl 13.80 | num_updates 231127 | best_loss 5.53033 | length_loss 5.35306
| epoch 208 | valid on 'valid' subset | loss 5.546 | nll_loss 3.789 | ppl 13.83 | num_updates 232000 | best_loss 5.53033 | length_loss 5.25107
| epoch 208 | valid on 'valid' subset | loss 5.554 | nll_loss 3.784 | ppl 13.78 | num_updates 232243 | best_loss 5.53033 | length_loss 5.83934
| epoch 209 | valid on 'valid' subset | loss 5.554 | nll_loss 3.790 | ppl 13.83 | num_updates 233360 | best_loss 5.53033 | length_loss 5.65368
| epoch 210 | valid on 'valid' subset | loss 5.568 | nll_loss 3.812 | ppl 14.04 | num_updates 234000 | best_loss 5.53033 | length_loss 5.36371
| epoch 210 | valid on 'valid' subset | loss 5.603 | nll_loss 3.843 | ppl 14.35 | num_updates 234477 | best_loss 5.53033 | length_loss 5.53959
| epoch 211 | valid on 'valid' subset | loss 5.553 | nll_loss 3.791 | ppl 13.84 | num_updates 235593 | best_loss 5.53033 | length_loss 5.80747
| epoch 212 | valid on 'valid' subset | loss 5.545 | nll_loss 3.784 | ppl 13.77 | num_updates 236000 | best_loss 5.53033 | length_loss 5.5965
| epoch 212 | valid on 'valid' subset | loss 5.574 | nll_loss 3.807 | ppl 14.00 | num_updates 236710 | best_loss 5.53033 | length_loss 5.66404
| epoch 213 | valid on 'valid' subset | loss 5.575 | nll_loss 3.815 | ppl 14.08 | num_updates 237827 | best_loss 5.53033 | length_loss 5.90345
| epoch 214 | valid on 'valid' subset | loss 5.583 | nll_loss 3.826 | ppl 14.18 | num_updates 238000 | best_loss 5.53033 | length_loss 5.63592
| epoch 214 | valid on 'valid' subset | loss 5.567 | nll_loss 3.801 | ppl 13.94 | num_updates 238943 | best_loss 5.53033 | length_loss 5.82184
| epoch 215 | valid on 'valid' subset | loss 5.543 | nll_loss 3.791 | ppl 13.84 | num_updates 240000 | best_loss 5.53033 | length_loss 5.10817
| epoch 215 | valid on 'valid' subset | loss 5.515 | nll_loss 3.762 | ppl 13.57 | num_updates 240060 | best_loss 5.51535 | length_loss 5.38116
| epoch 216 | valid on 'valid' subset | loss 5.531 | nll_loss 3.761 | ppl 13.56 | num_updates 241177 | best_loss 5.51535 | length_loss 5.93191
| epoch 217 | valid on 'valid' subset | loss 5.579 | nll_loss 3.815 | ppl 14.08 | num_updates 242000 | best_loss 5.51535 | length_loss 5.58746
| epoch 217 | valid on 'valid' subset | loss 5.551 | nll_loss 3.791 | ppl 13.85 | num_updates 242293 | best_loss 5.51535 | length_loss 5.70344
| epoch 218 | valid on 'valid' subset | loss 5.566 | nll_loss 3.806 | ppl 13.99 | num_updates 243410 | best_loss 5.51535 | length_loss 5.62662
| epoch 219 | valid on 'valid' subset | loss 5.560 | nll_loss 3.803 | ppl 13.96 | num_updates 244000 | best_loss 5.51535 | length_loss 5.2603
| epoch 219 | valid on 'valid' subset | loss 5.570 | nll_loss 3.812 | ppl 14.05 | num_updates 244527 | best_loss 5.51535 | length_loss 5.65762
| epoch 220 | valid on 'valid' subset | loss 5.540 | nll_loss 3.780 | ppl 13.74 | num_updates 245643 | best_loss 5.51535 | length_loss 5.83115
| epoch 221 | valid on 'valid' subset | loss 5.574 | nll_loss 3.814 | ppl 14.06 | num_updates 246000 | best_loss 5.51535 | length_loss 5.54657
| epoch 221 | valid on 'valid' subset | loss 5.548 | nll_loss 3.775 | ppl 13.69 | num_updates 246760 | best_loss 5.51535 | length_loss 5.85829
| epoch 222 | valid on 'valid' subset | loss 5.581 | nll_loss 3.823 | ppl 14.16 | num_updates 247876 | best_loss 5.51535 | length_loss 5.45996
| epoch 223 | valid on 'valid' subset | loss 5.555 | nll_loss 3.800 | ppl 13.93 | num_updates 248000 | best_loss 5.51535 | length_loss 5.22607
| epoch 223 | valid on 'valid' subset | loss 5.562 | nll_loss 3.808 | ppl 14.01 | num_updates 248993 | best_loss 5.51535 | length_loss 5.30127
| epoch 224 | valid on 'valid' subset | loss 5.561 | nll_loss 3.799 | ppl 13.91 | num_updates 250000 | best_loss 5.51535 | length_loss 5.53859
| epoch 224 | valid on 'valid' subset | loss 5.548 | nll_loss 3.800 | ppl 13.93 | num_updates 250109 | best_loss 5.51535 | length_loss 5.0822
| epoch 225 | valid on 'valid' subset | loss 5.559 | nll_loss 3.809 | ppl 14.01 | num_updates 251226 | best_loss 5.51535 | length_loss 5.16153
| epoch 226 | valid on 'valid' subset | loss 5.535 | nll_loss 3.778 | ppl 13.72 | num_updates 252000 | best_loss 5.51535 | length_loss 5.45856
| epoch 226 | valid on 'valid' subset | loss 5.544 | nll_loss 3.782 | ppl 13.76 | num_updates 252342 | best_loss 5.51535 | length_loss 5.49948
| epoch 227 | valid on 'valid' subset | loss 5.555 | nll_loss 3.802 | ppl 13.95 | num_updates 253459 | best_loss 5.51535 | length_loss 5.27929
| epoch 228 | valid on 'valid' subset | loss 5.556 | nll_loss 3.808 | ppl 14.01 | num_updates 254000 | best_loss 5.51535 | length_loss 5.24114
| epoch 228 | valid on 'valid' subset | loss 5.537 | nll_loss 3.776 | ppl 13.69 | num_updates 254576 | best_loss 5.51535 | length_loss 5.73599
| epoch 229 | valid on 'valid' subset | loss 5.557 | nll_loss 3.801 | ppl 13.94 | num_updates 255692 | best_loss 5.51535 | length_loss 5.54465
| epoch 230 | valid on 'valid' subset | loss 5.581 | nll_loss 3.823 | ppl 14.15 | num_updates 256000 | best_loss 5.51535 | length_loss 5.37077
| epoch 230 | valid on 'valid' subset | loss 5.560 | nll_loss 3.802 | ppl 13.94 | num_updates 256808 | best_loss 5.51535 | length_loss 5.352
| epoch 231 | valid on 'valid' subset | loss 5.569 | nll_loss 3.816 | ppl 14.08 | num_updates 257925 | best_loss 5.51535 | length_loss 5.41897
| epoch 232 | valid on 'valid' subset | loss 5.579 | nll_loss 3.827 | ppl 14.19 | num_updates 258000 | best_loss 5.51535 | length_loss 5.303
| epoch 232 | valid on 'valid' subset | loss 5.546 | nll_loss 3.792 | ppl 13.85 | num_updates 259042 | best_loss 5.51535 | length_loss 5.50189
| epoch 233 | valid on 'valid' subset | loss 5.559 | nll_loss 3.802 | ppl 13.94 | num_updates 260000 | best_loss 5.51535 | length_loss 5.48539
| epoch 233 | valid on 'valid' subset | loss 5.568 | nll_loss 3.815 | ppl 14.08 | num_updates 260159 | best_loss 5.51535 | length_loss 5.46542
| epoch 234 | valid on 'valid' subset | loss 5.572 | nll_loss 3.827 | ppl 14.19 | num_updates 261275 | best_loss 5.51535 | length_loss 5.45701
| epoch 235 | valid on 'valid' subset | loss 5.549 | nll_loss 3.787 | ppl 13.80 | num_updates 262000 | best_loss 5.51535 | length_loss 5.63183
| epoch 235 | valid on 'valid' subset | loss 5.535 | nll_loss 3.782 | ppl 13.76 | num_updates 262392 | best_loss 5.51535 | length_loss 5.47303
| epoch 236 | valid on 'valid' subset | loss 5.547 | nll_loss 3.786 | ppl 13.80 | num_updates 263508 | best_loss 5.51535 | length_loss 5.28952
| epoch 237 | valid on 'valid' subset | loss 5.558 | nll_loss 3.804 | ppl 13.97 | num_updates 264000 | best_loss 5.51535 | length_loss 5.38988
| epoch 237 | valid on 'valid' subset | loss 5.570 | nll_loss 3.811 | ppl 14.03 | num_updates 264624 | best_loss 5.51535 | length_loss 5.70354
| epoch 238 | valid on 'valid' subset | loss 5.562 | nll_loss 3.802 | ppl 13.95 | num_updates 265741 | best_loss 5.51535 | length_loss 5.42643
| epoch 239 | valid on 'valid' subset | loss 5.575 | nll_loss 3.815 | ppl 14.07 | num_updates 266000 | best_loss 5.51535 | length_loss 5.5593
| epoch 239 | valid on 'valid' subset | loss 5.564 | nll_loss 3.799 | ppl 13.92 | num_updates 266858 | best_loss 5.51535 | length_loss 5.75153
| epoch 240 | valid on 'valid' subset | loss 5.554 | nll_loss 3.801 | ppl 13.94 | num_updates 267975 | best_loss 5.51535 | length_loss 5.27644
| epoch 241 | valid on 'valid' subset | loss 5.580 | nll_loss 3.829 | ppl 14.21 | num_updates 268000 | best_loss 5.51535 | length_loss 5.52113
| epoch 241 | valid on 'valid' subset | loss 5.548 | nll_loss 3.788 | ppl 13.82 | num_updates 269091 | best_loss 5.51535 | length_loss 5.5829
| epoch 242 | valid on 'valid' subset | loss 5.577 | nll_loss 3.808 | ppl 14.00 | num_updates 270000 | best_loss 5.51535 | length_loss 5.60181
| epoch 242 | valid on 'valid' subset | loss 5.557 | nll_loss 3.802 | ppl 13.95 | num_updates 270208 | best_loss 5.51535 | length_loss 5.33101
| epoch 243 | valid on 'valid' subset | loss 5.542 | nll_loss 3.783 | ppl 13.77 | num_updates 271324 | best_loss 5.51535 | length_loss 5.42662
| epoch 244 | valid on 'valid' subset | loss 5.548 | nll_loss 3.801 | ppl 13.94 | num_updates 272000 | best_loss 5.51535 | length_loss 5.3026
| epoch 244 | valid on 'valid' subset | loss 5.558 | nll_loss 3.799 | ppl 13.92 | num_updates 272441 | best_loss 5.51535 | length_loss 5.47595
| epoch 245 | valid on 'valid' subset | loss 5.542 | nll_loss 3.777 | ppl 13.71 | num_updates 273558 | best_loss 5.51535 | length_loss 5.84733
| epoch 246 | valid on 'valid' subset | loss 5.557 | nll_loss 3.806 | ppl 13.98 | num_updates 274000 | best_loss 5.51535 | length_loss 5.25389
| epoch 246 | valid on 'valid' subset | loss 5.576 | nll_loss 3.811 | ppl 14.03 | num_updates 274674 | best_loss 5.51535 | length_loss 5.73085
| epoch 247 | valid on 'valid' subset | loss 5.572 | nll_loss 3.815 | ppl 14.08 | num_updates 275791 | best_loss 5.51535 | length_loss 5.33448
| epoch 248 | valid on 'valid' subset | loss 5.575 | nll_loss 3.821 | ppl 14.13 | num_updates 276000 | best_loss 5.51535 | length_loss 5.31691
| epoch 248 | valid on 'valid' subset | loss 5.536 | nll_loss 3.780 | ppl 13.74 | num_updates 276907 | best_loss 5.51535 | length_loss 5.04319
| epoch 249 | valid on 'valid' subset | loss 5.562 | nll_loss 3.801 | ppl 13.94 | num_updates 278000 | best_loss 5.51535 | length_loss 5.4161
| epoch 249 | valid on 'valid' subset | loss 5.553 | nll_loss 3.793 | ppl 13.86 | num_updates 278024 | best_loss 5.51535 | length_loss 5.47107
| epoch 250 | valid on 'valid' subset | loss 5.551 | nll_loss 3.796 | ppl 13.89 | num_updates 279139 | best_loss 5.51535 | length_loss 5.29736
| epoch 251 | valid on 'valid' subset | loss 5.580 | nll_loss 3.823 | ppl 14.15 | num_updates 280000 | best_loss 5.51535 | length_loss 5.49814
| epoch 251 | valid on 'valid' subset | loss 5.556 | nll_loss 3.791 | ppl 13.84 | num_updates 280256 | best_loss 5.51535 | length_loss 5.53517
| epoch 252 | valid on 'valid' subset | loss 5.557 | nll_loss 3.798 | ppl 13.91 | num_updates 281373 | best_loss 5.51535 | length_loss 5.41261
| epoch 253 | valid on 'valid' subset | loss 5.554 | nll_loss 3.800 | ppl 13.93 | num_updates 282000 | best_loss 5.51535 | length_loss 5.37349
| epoch 253 | valid on 'valid' subset | loss 5.578 | nll_loss 3.820 | ppl 14.13 | num_updates 282490 | best_loss 5.51535 | length_loss 5.45404
| epoch 254 | valid on 'valid' subset | loss 5.580 | nll_loss 3.808 | ppl 14.01 | num_updates 283606 | best_loss 5.51535 | length_loss 6.01807
| epoch 255 | valid on 'valid' subset | loss 5.584 | nll_loss 3.833 | ppl 14.25 | num_updates 284000 | best_loss 5.51535 | length_loss 5.65133
| epoch 255 | valid on 'valid' subset | loss 5.544 | nll_loss 3.787 | ppl 13.80 | num_updates 284723 | best_loss 5.51535 | length_loss 5.46523
| epoch 256 | valid on 'valid' subset | loss 5.545 | nll_loss 3.796 | ppl 13.89 | num_updates 285839 | best_loss 5.51535 | length_loss 5.35542
| epoch 257 | valid on 'valid' subset | loss 5.540 | nll_loss 3.787 | ppl 13.81 | num_updates 286000 | best_loss 5.51535 | length_loss 5.39352
| epoch 257 | valid on 'valid' subset | loss 5.541 | nll_loss 3.777 | ppl 13.71 | num_updates 286956 | best_loss 5.51535 | length_loss 5.56823
| epoch 258 | valid on 'valid' subset | loss 5.562 | nll_loss 3.802 | ppl 13.94 | num_updates 288000 | best_loss 5.51535 | length_loss 5.55644
| epoch 258 | valid on 'valid' subset | loss 5.564 | nll_loss 3.811 | ppl 14.03 | num_updates 288072 | best_loss 5.51535 | length_loss 5.36939
| epoch 259 | valid on 'valid' subset | loss 5.555 | nll_loss 3.798 | ppl 13.91 | num_updates 289189 | best_loss 5.51535 | length_loss 5.56483
| epoch 260 | valid on 'valid' subset | loss 5.568 | nll_loss 3.806 | ppl 13.99 | num_updates 290000 | best_loss 5.51535 | length_loss 5.45672
| epoch 260 | valid on 'valid' subset | loss 5.557 | nll_loss 3.802 | ppl 13.95 | num_updates 290306 | best_loss 5.51535 | length_loss 5.32805
| epoch 261 | valid on 'valid' subset | loss 5.557 | nll_loss 3.807 | ppl 13.99 | num_updates 291422 | best_loss 5.51535 | length_loss 5.14117
| epoch 262 | valid on 'valid' subset | loss 5.557 | nll_loss 3.801 | ppl 13.94 | num_updates 292000 | best_loss 5.51535 | length_loss 5.38264
| epoch 262 | valid on 'valid' subset | loss 5.543 | nll_loss 3.790 | ppl 13.84 | num_updates 292539 | best_loss 5.51535 | length_loss 5.41206
| epoch 263 | valid on 'valid' subset | loss 5.547 | nll_loss 3.783 | ppl 13.76 | num_updates 293656 | best_loss 5.51535 | length_loss 5.43538
| epoch 264 | valid on 'valid' subset | loss 5.543 | nll_loss 3.778 | ppl 13.72 | num_updates 294000 | best_loss 5.51535 | length_loss 5.71871
| epoch 264 | valid on 'valid' subset | loss 5.555 | nll_loss 3.799 | ppl 13.92 | num_updates 294772 | best_loss 5.51535 | length_loss 5.51499
| epoch 265 | valid on 'valid' subset | loss 5.550 | nll_loss 3.791 | ppl 13.84 | num_updates 295889 | best_loss 5.51535 | length_loss 5.4894
| epoch 266 | valid on 'valid' subset | loss 5.568 | nll_loss 3.813 | ppl 14.05 | num_updates 296000 | best_loss 5.51535 | length_loss 5.53522
| epoch 266 | valid on 'valid' subset | loss 5.584 | nll_loss 3.836 | ppl 14.28 | num_updates 297005 | best_loss 5.51535 | length_loss 5.15856
| epoch 267 | valid on 'valid' subset | loss 5.560 | nll_loss 3.805 | ppl 13.98 | num_updates 298000 | best_loss 5.51535 | length_loss 5.30485
| epoch 267 | valid on 'valid' subset | loss 5.566 | nll_loss 3.810 | ppl 14.03 | num_updates 298122 | best_loss 5.51535 | length_loss 5.3212
| epoch 268 | valid on 'valid' subset | loss 5.578 | nll_loss 3.828 | ppl 14.20 | num_updates 299239 | best_loss 5.51535 | length_loss 5.62171
| epoch 269 | valid on 'valid' subset | loss 5.545 | nll_loss 3.788 | ppl 13.82 | num_updates 300000 | best_loss 5.51535 | length_loss 5.35381
| epoch 269 | valid on 'valid' subset | loss 5.562 | nll_loss 3.807 | ppl 13.99 | num_updates 300000 | best_loss 5.51535 | length_loss 5.35381
And I simply averaged the last 5 checkpoints, where I got ~27.3 BLEU.
Thanks for your helpful suggestions again to reproduce the reults. I will close this issue : )
Great! The result seems very reasonable. Thank you for the update.
Hi, Kasai ~ Great work!
For EnDe, I can got ~27.3 with your pretrained model, but when I trained from scratch with your provided distilled data, I can get merely ~24 BLEU with weird loss curve. Can you give me some advices to properly reproduce your results?
I trained it on 8 V100 GPUs with following script:
The valid loss during training looks weird: