Closed feralvam closed 4 years ago
Skimming your provided code it looks alright. Does the model training look stable? Is the perplexity decreasing? If you decode on the training set instead of on test/valid, the model can produce the target sentences?
In addition to @huihuifan's suggestions, can you also double-check that your $TEXT/train.orig and $TEXT/train.simp have the same number of lines and that they are properly aligned?
I also noticed that you did not control the vocabulary size when running preprocess, you might try setting fairseq-preprocess --nwordssrc $BPE_CODE --nwordstgt $BPE_CODE (...)
Thank you for your quick replies.
| model transformer, criterion LabelSmoothedCrossEntropyCriterion
| num. model params: 59748352 (num. trained: 59748352)
| training on 1 GPUs
| max tokens per GPU = 4000 and max sentences per GPU = None
| no existing checkpoint found models/wikilarge/transformer/checkpoints/checkpoint_last.pt
| epoch 001: 1000 / 2737 loss=10.638, nll_loss=10.191, ppl=1169.35, wps=12889, ups=4.8, wpb=2611, bsz=109, num_updates=1001, lr=0.0001252, gnorm=2.391, clip=0%, oom=0, wall=207, train_wall=193
| epoch 001: 2000 / 2737 loss=9.936, nll_loss=9.378, ppl=665.34, wps=12548, ups=4.8, wpb=2584, bsz=108, num_updates=2001, lr=0.000250175, gnorm=1.961, clip=0%, oom=0, wall=416, train_wall=391
| epoch 001 | loss 9.623 | nll_loss 9.013 | ppl 516.48 | wps 12445 | ups 4.8 | wpb 2578 | bsz 108 | num_updates 2737 | lr 0.000342157 | gnorm 1.779 | clip 0% | oom 0 | wall 571 | train_wall 538
| epoch 001 | valid on 'valid' subset | valid_loss 8.52098 | valid_nll_loss 7.68015 | valid_ppl 205.10 | num_updates 2737
| epoch 002: 1000 / 2737 loss=8.433, nll_loss=7.627, ppl=197.74, wps=11899, ups=4.6, wpb=2522, bsz=106, num_updates=3738, lr=0.000467257, gnorm=1.141, clip=0%, oom=0, wall=787, train_wall=738
| epoch 002: 2000 / 2737 loss=8.249, nll_loss=7.416, ppl=170.74, wps=11942, ups=4.6, wpb=2557, bsz=108, num_updates=4738, lr=0.000459412, gnorm=1.087, clip=0%, oom=0, wall=1003, train_wall=942
| epoch 002 | loss 8.132 | nll_loss 7.280 | ppl 155.45 | wps 11998 | ups 4.6 | wpb 2578 | bsz 108 | num_updates 5474 | lr 0.000427413 | gnorm 1.055 | clip 0% | oom 0 | wall 1163 | train_wall 1093
| epoch 002 | valid on 'valid' subset | valid_loss 7.65196 | valid_nll_loss 6.66805 | valid_ppl 101.69 | num_updates 5474 | best 7.65196
| epoch 003: 1000 / 2737 loss=7.568, nll_loss=6.633, ppl=99.22, wps=11839, ups=4.4, wpb=2531, bsz=108, num_updates=6475, lr=0.000392989, gnorm=0.987, clip=0%, oom=0, wall=1389, train_wall=1295
| epoch 003: 2000 / 2737 loss=7.493, nll_loss=6.546, ppl=93.44, wps=12108, ups=4.6, wpb=2571, bsz=108, num_updates=7475, lr=0.000365758, gnorm=0.973, clip=0%, oom=0, wall=1600, train_wall=1494
| epoch 003 | loss 7.422 | nll_loss 6.464 | ppl 88.30 | wps 12164 | ups 4.6 | wpb 2578 | bsz 108 | num_updates 8211 | lr 0.000348981 | gnorm 0.974 | clip 0% | oom 0 | wall 1755 | train_wall 1640
| epoch 003 | valid on 'valid' subset | valid_loss 7.14075 | valid_nll_loss 6.06695 | valid_ppl 67.04 | num_updates 8211 | best 7.14075
| epoch 004: 1000 / 2737 loss=7.097, nll_loss=6.089, ppl=68.06, wps=12301, ups=4.5, wpb=2579, bsz=108, num_updates=9212, lr=0.000329475, gnorm=1.008, clip=0%, oom=0, wall=1978, train_wall=1838
| epoch 004: 2000 / 2737 loss=7.053, nll_loss=6.037, ppl=65.68, wps=12269, ups=4.6, wpb=2576, bsz=108, num_updates=10212, lr=0.000312928, gnorm=1.015, clip=0%, oom=0, wall=2188, train_wall=2035
| epoch 004 | loss 7.017 | nll_loss 5.996 | ppl 63.82 | wps 12238 | ups 4.6 | wpb 2578 | bsz 108 | num_updates 10948 | lr 0.000302227 | gnorm 1.018 | clip 0% | oom 0 | wall 2344 | train_wall 2182
| epoch 004 | valid on 'valid' subset | valid_loss 6.8459 | valid_nll_loss 5.71993 | valid_ppl 52.71 | num_updates 10948 | best 6.8459
| epoch 005: 1000 / 2737 loss=6.801, nll_loss=5.746, ppl=53.68, wps=11845, ups=4.4, wpb=2526, bsz=107, num_updates=11949, lr=0.000289291, gnorm=1.072, clip=0%, oom=0, wall=2571, train_wall=2382
| epoch 005: 2000 / 2737 loss=6.771, nll_loss=5.711, ppl=52.38, wps=12056, ups=4.5, wpb=2574, bsz=108, num_updates=12949, lr=0.000277896, gnorm=1.065, clip=0%, oom=0, wall=2785, train_wall=2582
| epoch 005 | loss 6.758 | nll_loss 5.696 | ppl 51.84 | wps 12105 | ups 4.6 | wpb 2578 | bsz 108 | num_updates 13685 | lr 0.00027032 | gnorm 1.071 | clip 0% | oom 0 | wall 2940 | train_wall 2728
| epoch 005 | valid on 'valid' subset | valid_loss 6.72075 | valid_nll_loss 5.56908 | valid_ppl 47.47 | num_updates 13685 | best 6.72075
| epoch 006: 1000 / 2737 loss=6.579, nll_loss=5.488, ppl=44.88, wps=12265, ups=4.4, wpb=2598, bsz=108, num_updates=14686, lr=0.000260945, gnorm=1.080, clip=0%, oom=0, wall=3165, train_wall=2929
| epoch 006: 2000 / 2737 loss=6.576, nll_loss=5.484, ppl=44.76, wps=12589, ups=4.7, wpb=2589, bsz=109, num_updates=15686, lr=0.00025249, gnorm=1.105, clip=0%, oom=0, wall=3365, train_wall=3120
| epoch 006 | loss 6.576 | nll_loss 5.485 | ppl 44.78 | wps 12643 | ups 4.8 | wpb 2578 | bsz 108 | num_updates 16422 | lr 0.000246767 | gnorm 1.117 | clip 0% | oom 0 | wall 3511 | train_wall 3260
| epoch 006 | valid on 'valid' subset | valid_loss 6.58137 | valid_nll_loss 5.40499 | valid_ppl 42.37 | num_updates 16422 | best 6.58137
| epoch 007: 1000 / 2737 loss=6.450, nll_loss=5.339, ppl=40.47, wps=12743, ups=4.7, wpb=2559, bsz=108, num_updates=17423, lr=0.000239573, gnorm=1.163, clip=0%, oom=0, wall=3726, train_wall=3452
| epoch 007: 2000 / 2737 loss=6.450, nll_loss=5.338, ppl=40.46, wps=12840, ups=4.8, wpb=2574, bsz=109, num_updates=18423, lr=0.000232981, gnorm=1.160, clip=0%, oom=0, wall=3926, train_wall=3644
| epoch 007 | loss 6.440 | nll_loss 5.326 | ppl 40.10 | wps 12861 | ups 4.9 | wpb 2578 | bsz 108 | num_updates 19159 | lr 0.000228462 | gnorm 1.166 | clip 0% | oom 0 | wall 4073 | train_wall 3785
| epoch 007 | valid on 'valid' subset | valid_loss 6.48309 | valid_nll_loss 5.2846 | valid_ppl 38.98 | num_updates 19159 | best 6.48309
| epoch 008: 1000 / 2737 loss=6.306, nll_loss=5.172, ppl=36.04, wps=12870, ups=4.7, wpb=2587, bsz=109, num_updates=20160, lr=0.000222718, gnorm=1.180, clip=0%, oom=0, wall=4288, train_wall=3977
| epoch 008: 2000 / 2737 loss=6.322, nll_loss=5.188, ppl=36.47, wps=12877, ups=4.8, wpb=2585, bsz=109, num_updates=21160, lr=0.000217391, gnorm=1.190, clip=0%, oom=0, wall=4488, train_wall=4169
| epoch 008 | loss 6.326 | nll_loss 5.193 | ppl 36.58 | wps 12876 | ups 4.9 | wpb 2578 | bsz 108 | num_updates 21896 | lr 0.000213706 | gnorm 1.201 | clip 0% | oom 0 | wall 4634 | train_wall 4309
| epoch 008 | valid on 'valid' subset | valid_loss 6.43844 | valid_nll_loss 5.23288 | valid_ppl 37.61 | num_updates 21896 | best 6.43844
| epoch 009: 1000 / 2737 loss=6.220, nll_loss=5.070, ppl=33.60, wps=12800, ups=4.7, wpb=2568, bsz=108, num_updates=22897, lr=0.000208983, gnorm=1.242, clip=0%, oom=0, wall=4847, train_wall=4501
| epoch 009: 2000 / 2737 loss=6.234, nll_loss=5.086, ppl=33.96, wps=12811, ups=4.8, wpb=2571, bsz=108, num_updates=23897, lr=0.000204564, gnorm=1.242, clip=0%, oom=0, wall=5048, train_wall=4693
| epoch 009 | loss 6.233 | nll_loss 5.085 | ppl 33.94 | wps 12857 | ups 4.9 | wpb 2578 | bsz 108 | num_updates 24633 | lr 0.000201484 | gnorm 1.241 | clip 0% | oom 0 | wall 5195 | train_wall 4834
| epoch 009 | valid on 'valid' subset | valid_loss 6.37204 | valid_nll_loss 5.14732 | valid_ppl 35.44 | num_updates 24633 | best 6.37204
| epoch 010: 1000 / 2737 loss=6.144, nll_loss=4.982, ppl=31.61, wps=12673, ups=4.7, wpb=2534, bsz=106, num_updates=25634, lr=0.000197511, gnorm=1.279, clip=0%, oom=0, wall=5408, train_wall=5026
| epoch 010: 2000 / 2737 loss=6.146, nll_loss=4.983, ppl=31.62, wps=12814, ups=4.8, wpb=2570, bsz=107, num_updates=26634, lr=0.000193768, gnorm=1.267, clip=0%, oom=0, wall=5609, train_wall=5218
| epoch 010 | loss 6.153 | nll_loss 4.991 | ppl 31.80 | wps 12846 | ups 4.9 | wpb 2578 | bsz 108 | num_updates 27370 | lr 0.000191145 | gnorm 1.275 | clip 0% | oom 0 | wall 5757 | train_wall 5360
| epoch 010 | valid on 'valid' subset | valid_loss 6.34336 | valid_nll_loss 5.11528 | valid_ppl 34.66 | num_updates 27370 | best 6.34336
| epoch 011: 1000 / 2737 loss=6.064, nll_loss=4.888, ppl=29.62, wps=12740, ups=4.7, wpb=2549, bsz=106, num_updates=28371, lr=0.000187743, gnorm=1.309, clip=0%, oom=0, wall=5970, train_wall=5551
| epoch 011: 2000 / 2737 loss=6.078, nll_loss=4.903, ppl=29.93, wps=12829, ups=4.8, wpb=2577, bsz=108, num_updates=29371, lr=0.000184519, gnorm=1.313, clip=0%, oom=0, wall=6171, train_wall=5744
| epoch 011 | loss 6.085 | nll_loss 4.911 | ppl 30.09 | wps 12840 | ups 4.9 | wpb 2578 | bsz 108 | num_updates 30107 | lr 0.000182249 | gnorm 1.310 | clip 0% | oom 0 | wall 6319 | train_wall 5885
| epoch 011 | valid on 'valid' subset | valid_loss 6.29881 | valid_nll_loss 5.06415 | valid_ppl 33.45 | num_updates 30107 | best 6.29881
| epoch 012: 1000 / 2737 loss=6.017, nll_loss=4.832, ppl=28.48, wps=12836, ups=4.7, wpb=2587, bsz=108, num_updates=31108, lr=0.000179293, gnorm=1.317, clip=0%, oom=0, wall=6533, train_wall=6079
| epoch 012: 2000 / 2737 loss=6.022, nll_loss=4.838, ppl=28.60, wps=12854, ups=4.8, wpb=2583, bsz=108, num_updates=32108, lr=0.000176479, gnorm=1.341, clip=0%, oom=0, wall=6733, train_wall=6270
| epoch 012 | loss 6.021 | nll_loss 4.837 | ppl 28.58 | wps 12834 | ups 4.9 | wpb 2578 | bsz 108 | num_updates 32844 | lr 0.000174491 | gnorm 1.342 | clip 0% | oom 0 | wall 6881 | train_wall 6411
| epoch 012 | valid on 'valid' subset | valid_loss 6.2966 | valid_nll_loss 5.05321 | valid_ppl 33.20 | num_updates 32844 | best 6.2966
| epoch 013: 1000 / 2737 loss=5.943, nll_loss=4.747, ppl=26.85, wps=12613, ups=4.6, wpb=2562, bsz=111, num_updates=33845, lr=0.000171891, gnorm=1.368, clip=0%, oom=0, wall=7097, train_wall=6605
| epoch 013: 2000 / 2737 loss=5.959, nll_loss=4.764, ppl=27.17, wps=12500, ups=4.7, wpb=2553, bsz=109, num_updates=34845, lr=0.000169406, gnorm=1.384, clip=0%, oom=0, wall=7303, train_wall=6800
| epoch 013 | loss 5.968 | nll_loss 4.775 | ppl 27.38 | wps 12650 | ups 4.8 | wpb 2578 | bsz 108 | num_updates 35581 | lr 0.000167645 | gnorm 1.373 | clip 0% | oom 0 | wall 7452 | train_wall 6942
| epoch 013 | valid on 'valid' subset | valid_loss 6.28413 | valid_nll_loss 5.03823 | valid_ppl 32.86 | num_updates 35581 | best 6.28413
| epoch 014: 1000 / 2737 loss=5.885, nll_loss=4.678, ppl=25.61, wps=12695, ups=4.6, wpb=2607, bsz=108, num_updates=36582, lr=0.000165336, gnorm=1.369, clip=0%, oom=0, wall=7671, train_wall=7137
| epoch 014: 2000 / 2737 loss=5.898, nll_loss=4.693, ppl=25.86, wps=12745, ups=4.8, wpb=2593, bsz=109, num_updates=37582, lr=0.000163121, gnorm=1.420, clip=0%, oom=0, wall=7873, train_wall=7330
| epoch 014 | loss 5.915 | nll_loss 4.712 | ppl 26.21 | wps 12747 | ups 4.8 | wpb 2578 | bsz 108 | num_updates 38318 | lr 0.000161547 | gnorm 1.426 | clip 0% | oom 0 | wall 8019 | train_wall 7470
| epoch 014 | valid on 'valid' subset | valid_loss 6.24927 | valid_nll_loss 5.00439 | valid_ppl 32.10 | num_updates 38318 | best 6.24927
| epoch 015: 1000 / 2737 loss=5.843, nll_loss=4.630, ppl=24.76, wps=12640, ups=4.7, wpb=2544, bsz=109, num_updates=39319, lr=0.000159477, gnorm=1.425, clip=0%, oom=0, wall=8233, train_wall=7663
| epoch 015: 2000 / 2737 loss=5.859, nll_loss=4.648, ppl=25.07, wps=12776, ups=4.8, wpb=2566, bsz=109, num_updates=40319, lr=0.000157487, gnorm=1.433, clip=0%, oom=0, wall=8434, train_wall=7855
| epoch 015 | loss 5.872 | nll_loss 4.662 | ppl 25.32 | wps 12843 | ups 4.9 | wpb 2578 | bsz 108 | num_updates 41055 | lr 0.000156069 | gnorm 1.437 | clip 0% | oom 0 | wall 8581 | train_wall 7996
| epoch 015 | valid on 'valid' subset | valid_loss 6.25129 | valid_nll_loss 5.00187 | valid_ppl 32.04 | num_updates 41055 | best 6.24927
| epoch 016: 1000 / 2737 loss=5.805, nll_loss=4.586, ppl=24.01, wps=12739, ups=4.8, wpb=2542, bsz=110, num_updates=42056, lr=0.000154201, gnorm=1.476, clip=0%, oom=0, wall=8788, train_wall=8187
| epoch 016: 2000 / 2737 loss=5.820, nll_loss=4.602, ppl=24.28, wps=13033, ups=5.0, wpb=2569, bsz=109, num_updates=43056, lr=0.000152399, gnorm=1.470, clip=0%, oom=0, wall=8982, train_wall=8373
| epoch 016 | loss 5.830 | nll_loss 4.613 | ppl 24.47 | wps 13040 | ups 5.0 | wpb 2578 | bsz 108 | num_updates 43792 | lr 0.000151113 | gnorm 1.464 | clip 0% | oom 0 | wall 9129 | train_wall 8514
| epoch 016 | valid on 'valid' subset | valid_loss 6.23814 | valid_nll_loss 4.9816 | valid_ppl 31.59 | num_updates 43792 | best 6.23814
| epoch 017: 1000 / 2737 loss=5.774, nll_loss=4.548, ppl=23.40, wps=12709, ups=4.7, wpb=2550, bsz=106, num_updates=44793, lr=0.000149415, gnorm=1.486, clip=0%, oom=0, wall=9343, train_wall=8706
| epoch 017: 2000 / 2737 loss=5.787, nll_loss=4.563, ppl=23.65, wps=12796, ups=4.8, wpb=2570, bsz=108, num_updates=45793, lr=0.000147775, gnorm=1.486, clip=0%, oom=0, wall=9544, train_wall=8898
| epoch 017 | loss 5.789 | nll_loss 4.566 | ppl 23.68 | wps 12825 | ups 4.9 | wpb 2578 | bsz 108 | num_updates 46529 | lr 0.000146601 | gnorm 1.493 | clip 0% | oom 0 | wall 9692 | train_wall 9040
| epoch 017 | valid on 'valid' subset | valid_loss 6.23362 | valid_nll_loss 4.97762 | valid_ppl 31.51 | num_updates 46529 | best 6.23362
| epoch 018: 1000 / 2737 loss=5.704, nll_loss=4.469, ppl=22.14, wps=12795, ups=4.7, wpb=2575, bsz=108, num_updates=47530, lr=0.000145049, gnorm=1.505, clip=0%, oom=0, wall=9905, train_wall=9233
| epoch 018: 2000 / 2737 loss=5.740, nll_loss=4.509, ppl=22.76, wps=12805, ups=4.8, wpb=2576, bsz=108, num_updates=48530, lr=0.000143547, gnorm=1.517, clip=0%, oom=0, wall=10106, train_wall=9426
| epoch 018 | loss 5.754 | nll_loss 4.525 | ppl 23.02 | wps 12844 | ups 4.9 | wpb 2578 | bsz 108 | num_updates 49266 | lr 0.000142471 | gnorm 1.529 | clip 0% | oom 0 | wall 10253 | train_wall 9566
| epoch 018 | valid on 'valid' subset | valid_loss 6.20331 | valid_nll_loss 4.94124 | valid_ppl 30.72 | num_updates 49266 | best 6.20331
| epoch 019 | loss 5.671 | nll_loss 4.430 | ppl 21.55 | wps 12874 | ups 4.6 | wpb 2592 | bsz 110 | num_updates 50000 | lr 0.000141421 | gnorm 1.534 | clip 0% | oom 0 | wall 10413 | train_wall 9707
| epoch 019 | valid on 'valid' subset | valid_loss 6.22494 | valid_nll_loss 4.96167 | valid_ppl 31.16 | num_updates 50000 | best 6.20331
| done training in 10417.7 seconds
No, the model generates completely unrelated sentences, too. What could be the cause of this?
The files have the same number of lines and are properly aligned.
I will try the last suggestion given by @myleott and I'll let you know what happens.
@myleott Sorry, I don't quite understand what you mean by setting fairseq-preprocess --nwordssrc $BPE_CODE --nwordstgt $BPE_CODE
. --nwordssrc
and --nwordstgt
are integer parameters, and $BPE_CODE
is a file. Do you mean to set the values of those parameters as the number of lines in the code file?
Yep, I meant the number of codes. Also, that perplexity is quite high. For translation problem we usually get perplexities ~4-5, so I suspect the model is not well trained.
I see. I'll start changing the parameters and see what happens. Any suggestion on what try first would be welcome. I'm not very experience in this. This paper uses the transformer (tensor2tensor) for the same data, so I'll try to use their same configuration in fairseq as a starting point.
I would try increasing words per batch. Since you're training on 1 GPU, try setting --update-freq 8
, which will make the batch size 8 times bigger (and simulates training on 8 GPUs).
@myleott I'm using fairseq for summarization. You said that ppl 4~5 is considered to be good for translation. Mine was decreasing to ppl 14.83 on epoch 14 and then increasing to 21.73 up until epoch 45. Is it something natural?
Hello, I am trying to use the transformer in a sentence simplification dataset. Training seems to run without problems, but at generation time the hypotheses sentences do not make any sense. I was wondering if you could help me with figuring out what I am doing wrong.
I tried to follow this example that you provide for translation using the transformer.
1. Pre-processing: The dataset I am using for training contains aligned sentences such as this pair:
Since the sentences in the dataset are already tokenised, for pre-processing I only lowercased all sentences and learned/applied BPE using the following script:
Then I proceeded to binary the dataset:
2. Training For training, I used the same command as in the example provided. I am aware that I'd need to adapt the parameters to suit the dataset, but I thought it was a good starting point.
3. Generation As in the example, I executed the following commands:
Most output sentences I get are like this:
As can be seen, the generated H sentences make no sense as they are not related at all with the corresponding input.
Am I doing something wrong at training or generation time that causes this? Maybe I am not understanding the parameters properly?
I hope this is the right place to ask this type of question. Thank you.