facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
30.43k stars 6.4k forks source link

Transformer generates unrelated sentences to the input #514

Closed feralvam closed 4 years ago

feralvam commented 5 years ago

Hello, I am trying to use the transformer in a sentence simplification dataset. Training seems to run without problems, but at generation time the hypotheses sentences do not make any sense. I was wondering if you could help me with figuring out what I am doing wrong.

I tried to follow this example that you provide for translation using the transformer.

1. Pre-processing: The dataset I am using for training contains aligned sentences such as this pair:

Original: In Holland they were called Stadspijpers , in Germany Stadtpfeifer and in Italy Pifferi . Simplified: They were called Stadtpfeifer in Germany and Pifferi in Italy .

Since the sentences in the dataset are already tokenised, for pre-processing I only lowercased all sentences and learned/applied BPE using the following script:

src=orig
tgt=simp
prep=data/wikilarge/prep
tmp=$prep/tmp
orig=data/wikilarge

mkdir -p $prep $tmp

for d in train dev test; do
    for l in $src $tgt; do
        perl $LC < $orig/wikilarge.$d.$l > $tmp/wikilarge.$d.low.$l
    done
done

TRAIN=$tmp/train.wikilarge
BPE_CODE=$prep/code
rm -f $TRAIN
for l in $src $tgt; do
    cat $tmp/wikilarge.train.low.$l >> $TRAIN
done

python $BPEROOT/learn_bpe.py -s $BPE_TOKENS < $TRAIN > $BPE_CODE

for L in $src $tgt; do
    for d in train dev test; do
        echo "apply_bpe.py to wikilarge.${d}.low.${L}..."
        python $BPEROOT/apply_bpe.py -c $BPE_CODE < $tmp/wikilarge.$d.low.$L > $prep/$d.$L
    done
done

Then I proceeded to binary the dataset:

TEXT=data/wikilarge/prep
fairseq-preprocess --source-lang orig --target-lang simp \
  --trainpref $TEXT/train --validpref $TEXT/dev --testpref $TEXT/test \
  --destdir data/wikilarge/bin/

2. Training For training, I used the same command as in the example provided. I am aware that I'd need to adapt the parameters to suit the dataset, but I thought it was a good starting point.

mkdir -p models/wikilarge/transformer/checkpoints/
CUDA_VISIBLE_DEVICES=0 fairseq-train data/wikilarge/bin \
  -a transformer --optimizer adam --lr 0.0005 -s orig -t simp \
  --label-smoothing 0.1 --dropout 0.3 --max-tokens 4000 \
  --min-lr '1e-09' --lr-scheduler inverse_sqrt --weight-decay 0.0001 \
  --criterion label_smoothed_cross_entropy --max-update 50000 \
  --warmup-updates 4000 --warmup-init-lr '1e-07' \
  --adam-betas '(0.9, 0.98)' --save-dir models/wikilarge/transformer/checkpoints/

3. Generation As in the example, I executed the following commands:

# Average 10 latest checkpoints:
python scripts/average_checkpoints.py --inputs models/wikilarge/transformer/checkpoints \
   --num-epoch-checkpoints 10 --output models/wikilarge/transformer/checkpoints/model.pt

# Generate
fairseq-generate data/wikilarge/bin \
  --path models/wikilarge/transformer/checkpoints/model.pt \
  --batch-size 128 --beam 5 --remove-bpe

Most output sentences I get are like this:

S-124   the two former presidents were later separately charged with mutiny and treason for their roles in the 1979 coup and the 1980 gwangju massacre .
T-124   the two former presidents were later charged , each on their own , with mutiny and treason for their roles in the 1979 coup and the 1980 gwangju massacre .
H-124   -1.1352218389511108     he was the first woman to win the tour de france .
P-124   -2.4326 -1.1815 -1.0359 -1.1694 -1.9666 -0.0793 -2.0569 -0.5309 -2.4636 -0.2983 -0.0907 -1.3463 -0.1060
S-258   a town may be correctly described as a market town or as having market rights even if it no longer holds a market , provided the right to do so still exists .
T-258   a town may be correctly identified by a market or as having market rights even if it no longer holds a market , provided the right to do so still exists .
H-258   -0.9187995195388794     this is a list of people who live in the city .
P-258   -3.2003 -0.9018 -1.4129 -0.1210 -0.0787 -1.7663 -0.2098 -1.9090 -0.2615 -0.8027 -0.7472 -0.4241 -0.1091

As can be seen, the generated H sentences make no sense as they are not related at all with the corresponding input.

Am I doing something wrong at training or generation time that causes this? Maybe I am not understanding the parameters properly?

I hope this is the right place to ask this type of question. Thank you.

huihuifan commented 5 years ago

Skimming your provided code it looks alright. Does the model training look stable? Is the perplexity decreasing? If you decode on the training set instead of on test/valid, the model can produce the target sentences?

myleott commented 5 years ago

In addition to @huihuifan's suggestions, can you also double-check that your $TEXT/train.orig and $TEXT/train.simp have the same number of lines and that they are properly aligned?

I also noticed that you did not control the vocabulary size when running preprocess, you might try setting fairseq-preprocess --nwordssrc $BPE_CODE --nwordstgt $BPE_CODE (...)

feralvam commented 5 years ago

Thank you for your quick replies.

  1. The perplexity IS decreasing:
| model transformer, criterion LabelSmoothedCrossEntropyCriterion
| num. model params: 59748352 (num. trained: 59748352)
| training on 1 GPUs
| max tokens per GPU = 4000 and max sentences per GPU = None
| no existing checkpoint found models/wikilarge/transformer/checkpoints/checkpoint_last.pt
| epoch 001:   1000 / 2737 loss=10.638, nll_loss=10.191, ppl=1169.35, wps=12889, ups=4.8, wpb=2611, bsz=109, num_updates=1001, lr=0.0001252, gnorm=2.391, clip=0%, oom=0, wall=207, train_wall=193
| epoch 001:   2000 / 2737 loss=9.936, nll_loss=9.378, ppl=665.34, wps=12548, ups=4.8, wpb=2584, bsz=108, num_updates=2001, lr=0.000250175, gnorm=1.961, clip=0%, oom=0, wall=416, train_wall=391
| epoch 001 | loss 9.623 | nll_loss 9.013 | ppl 516.48 | wps 12445 | ups 4.8 | wpb 2578 | bsz 108 | num_updates 2737 | lr 0.000342157 | gnorm 1.779 | clip 0% | oom 0 | wall 571 | train_wall 538
| epoch 001 | valid on 'valid' subset | valid_loss 8.52098 | valid_nll_loss 7.68015 | valid_ppl 205.10 | num_updates 2737
| epoch 002:   1000 / 2737 loss=8.433, nll_loss=7.627, ppl=197.74, wps=11899, ups=4.6, wpb=2522, bsz=106, num_updates=3738, lr=0.000467257, gnorm=1.141, clip=0%, oom=0, wall=787, train_wall=738
| epoch 002:   2000 / 2737 loss=8.249, nll_loss=7.416, ppl=170.74, wps=11942, ups=4.6, wpb=2557, bsz=108, num_updates=4738, lr=0.000459412, gnorm=1.087, clip=0%, oom=0, wall=1003, train_wall=942
| epoch 002 | loss 8.132 | nll_loss 7.280 | ppl 155.45 | wps 11998 | ups 4.6 | wpb 2578 | bsz 108 | num_updates 5474 | lr 0.000427413 | gnorm 1.055 | clip 0% | oom 0 | wall 1163 | train_wall 1093
| epoch 002 | valid on 'valid' subset | valid_loss 7.65196 | valid_nll_loss 6.66805 | valid_ppl 101.69 | num_updates 5474 | best 7.65196
| epoch 003:   1000 / 2737 loss=7.568, nll_loss=6.633, ppl=99.22, wps=11839, ups=4.4, wpb=2531, bsz=108, num_updates=6475, lr=0.000392989, gnorm=0.987, clip=0%, oom=0, wall=1389, train_wall=1295
| epoch 003:   2000 / 2737 loss=7.493, nll_loss=6.546, ppl=93.44, wps=12108, ups=4.6, wpb=2571, bsz=108, num_updates=7475, lr=0.000365758, gnorm=0.973, clip=0%, oom=0, wall=1600, train_wall=1494
| epoch 003 | loss 7.422 | nll_loss 6.464 | ppl 88.30 | wps 12164 | ups 4.6 | wpb 2578 | bsz 108 | num_updates 8211 | lr 0.000348981 | gnorm 0.974 | clip 0% | oom 0 | wall 1755 | train_wall 1640
| epoch 003 | valid on 'valid' subset | valid_loss 7.14075 | valid_nll_loss 6.06695 | valid_ppl 67.04 | num_updates 8211 | best 7.14075
| epoch 004:   1000 / 2737 loss=7.097, nll_loss=6.089, ppl=68.06, wps=12301, ups=4.5, wpb=2579, bsz=108, num_updates=9212, lr=0.000329475, gnorm=1.008, clip=0%, oom=0, wall=1978, train_wall=1838
| epoch 004:   2000 / 2737 loss=7.053, nll_loss=6.037, ppl=65.68, wps=12269, ups=4.6, wpb=2576, bsz=108, num_updates=10212, lr=0.000312928, gnorm=1.015, clip=0%, oom=0, wall=2188, train_wall=2035
| epoch 004 | loss 7.017 | nll_loss 5.996 | ppl 63.82 | wps 12238 | ups 4.6 | wpb 2578 | bsz 108 | num_updates 10948 | lr 0.000302227 | gnorm 1.018 | clip 0% | oom 0 | wall 2344 | train_wall 2182
| epoch 004 | valid on 'valid' subset | valid_loss 6.8459 | valid_nll_loss 5.71993 | valid_ppl 52.71 | num_updates 10948 | best 6.8459
| epoch 005:   1000 / 2737 loss=6.801, nll_loss=5.746, ppl=53.68, wps=11845, ups=4.4, wpb=2526, bsz=107, num_updates=11949, lr=0.000289291, gnorm=1.072, clip=0%, oom=0, wall=2571, train_wall=2382
| epoch 005:   2000 / 2737 loss=6.771, nll_loss=5.711, ppl=52.38, wps=12056, ups=4.5, wpb=2574, bsz=108, num_updates=12949, lr=0.000277896, gnorm=1.065, clip=0%, oom=0, wall=2785, train_wall=2582
| epoch 005 | loss 6.758 | nll_loss 5.696 | ppl 51.84 | wps 12105 | ups 4.6 | wpb 2578 | bsz 108 | num_updates 13685 | lr 0.00027032 | gnorm 1.071 | clip 0% | oom 0 | wall 2940 | train_wall 2728
| epoch 005 | valid on 'valid' subset | valid_loss 6.72075 | valid_nll_loss 5.56908 | valid_ppl 47.47 | num_updates 13685 | best 6.72075
| epoch 006:   1000 / 2737 loss=6.579, nll_loss=5.488, ppl=44.88, wps=12265, ups=4.4, wpb=2598, bsz=108, num_updates=14686, lr=0.000260945, gnorm=1.080, clip=0%, oom=0, wall=3165, train_wall=2929
| epoch 006:   2000 / 2737 loss=6.576, nll_loss=5.484, ppl=44.76, wps=12589, ups=4.7, wpb=2589, bsz=109, num_updates=15686, lr=0.00025249, gnorm=1.105, clip=0%, oom=0, wall=3365, train_wall=3120
| epoch 006 | loss 6.576 | nll_loss 5.485 | ppl 44.78 | wps 12643 | ups 4.8 | wpb 2578 | bsz 108 | num_updates 16422 | lr 0.000246767 | gnorm 1.117 | clip 0% | oom 0 | wall 3511 | train_wall 3260
| epoch 006 | valid on 'valid' subset | valid_loss 6.58137 | valid_nll_loss 5.40499 | valid_ppl 42.37 | num_updates 16422 | best 6.58137
| epoch 007:   1000 / 2737 loss=6.450, nll_loss=5.339, ppl=40.47, wps=12743, ups=4.7, wpb=2559, bsz=108, num_updates=17423, lr=0.000239573, gnorm=1.163, clip=0%, oom=0, wall=3726, train_wall=3452
| epoch 007:   2000 / 2737 loss=6.450, nll_loss=5.338, ppl=40.46, wps=12840, ups=4.8, wpb=2574, bsz=109, num_updates=18423, lr=0.000232981, gnorm=1.160, clip=0%, oom=0, wall=3926, train_wall=3644
| epoch 007 | loss 6.440 | nll_loss 5.326 | ppl 40.10 | wps 12861 | ups 4.9 | wpb 2578 | bsz 108 | num_updates 19159 | lr 0.000228462 | gnorm 1.166 | clip 0% | oom 0 | wall 4073 | train_wall 3785
| epoch 007 | valid on 'valid' subset | valid_loss 6.48309 | valid_nll_loss 5.2846 | valid_ppl 38.98 | num_updates 19159 | best 6.48309
| epoch 008:   1000 / 2737 loss=6.306, nll_loss=5.172, ppl=36.04, wps=12870, ups=4.7, wpb=2587, bsz=109, num_updates=20160, lr=0.000222718, gnorm=1.180, clip=0%, oom=0, wall=4288, train_wall=3977
| epoch 008:   2000 / 2737 loss=6.322, nll_loss=5.188, ppl=36.47, wps=12877, ups=4.8, wpb=2585, bsz=109, num_updates=21160, lr=0.000217391, gnorm=1.190, clip=0%, oom=0, wall=4488, train_wall=4169
| epoch 008 | loss 6.326 | nll_loss 5.193 | ppl 36.58 | wps 12876 | ups 4.9 | wpb 2578 | bsz 108 | num_updates 21896 | lr 0.000213706 | gnorm 1.201 | clip 0% | oom 0 | wall 4634 | train_wall 4309
| epoch 008 | valid on 'valid' subset | valid_loss 6.43844 | valid_nll_loss 5.23288 | valid_ppl 37.61 | num_updates 21896 | best 6.43844
| epoch 009:   1000 / 2737 loss=6.220, nll_loss=5.070, ppl=33.60, wps=12800, ups=4.7, wpb=2568, bsz=108, num_updates=22897, lr=0.000208983, gnorm=1.242, clip=0%, oom=0, wall=4847, train_wall=4501
| epoch 009:   2000 / 2737 loss=6.234, nll_loss=5.086, ppl=33.96, wps=12811, ups=4.8, wpb=2571, bsz=108, num_updates=23897, lr=0.000204564, gnorm=1.242, clip=0%, oom=0, wall=5048, train_wall=4693
| epoch 009 | loss 6.233 | nll_loss 5.085 | ppl 33.94 | wps 12857 | ups 4.9 | wpb 2578 | bsz 108 | num_updates 24633 | lr 0.000201484 | gnorm 1.241 | clip 0% | oom 0 | wall 5195 | train_wall 4834
| epoch 009 | valid on 'valid' subset | valid_loss 6.37204 | valid_nll_loss 5.14732 | valid_ppl 35.44 | num_updates 24633 | best 6.37204
| epoch 010:   1000 / 2737 loss=6.144, nll_loss=4.982, ppl=31.61, wps=12673, ups=4.7, wpb=2534, bsz=106, num_updates=25634, lr=0.000197511, gnorm=1.279, clip=0%, oom=0, wall=5408, train_wall=5026
| epoch 010:   2000 / 2737 loss=6.146, nll_loss=4.983, ppl=31.62, wps=12814, ups=4.8, wpb=2570, bsz=107, num_updates=26634, lr=0.000193768, gnorm=1.267, clip=0%, oom=0, wall=5609, train_wall=5218
| epoch 010 | loss 6.153 | nll_loss 4.991 | ppl 31.80 | wps 12846 | ups 4.9 | wpb 2578 | bsz 108 | num_updates 27370 | lr 0.000191145 | gnorm 1.275 | clip 0% | oom 0 | wall 5757 | train_wall 5360
| epoch 010 | valid on 'valid' subset | valid_loss 6.34336 | valid_nll_loss 5.11528 | valid_ppl 34.66 | num_updates 27370 | best 6.34336
| epoch 011:   1000 / 2737 loss=6.064, nll_loss=4.888, ppl=29.62, wps=12740, ups=4.7, wpb=2549, bsz=106, num_updates=28371, lr=0.000187743, gnorm=1.309, clip=0%, oom=0, wall=5970, train_wall=5551
| epoch 011:   2000 / 2737 loss=6.078, nll_loss=4.903, ppl=29.93, wps=12829, ups=4.8, wpb=2577, bsz=108, num_updates=29371, lr=0.000184519, gnorm=1.313, clip=0%, oom=0, wall=6171, train_wall=5744
| epoch 011 | loss 6.085 | nll_loss 4.911 | ppl 30.09 | wps 12840 | ups 4.9 | wpb 2578 | bsz 108 | num_updates 30107 | lr 0.000182249 | gnorm 1.310 | clip 0% | oom 0 | wall 6319 | train_wall 5885
| epoch 011 | valid on 'valid' subset | valid_loss 6.29881 | valid_nll_loss 5.06415 | valid_ppl 33.45 | num_updates 30107 | best 6.29881
| epoch 012:   1000 / 2737 loss=6.017, nll_loss=4.832, ppl=28.48, wps=12836, ups=4.7, wpb=2587, bsz=108, num_updates=31108, lr=0.000179293, gnorm=1.317, clip=0%, oom=0, wall=6533, train_wall=6079
| epoch 012:   2000 / 2737 loss=6.022, nll_loss=4.838, ppl=28.60, wps=12854, ups=4.8, wpb=2583, bsz=108, num_updates=32108, lr=0.000176479, gnorm=1.341, clip=0%, oom=0, wall=6733, train_wall=6270
| epoch 012 | loss 6.021 | nll_loss 4.837 | ppl 28.58 | wps 12834 | ups 4.9 | wpb 2578 | bsz 108 | num_updates 32844 | lr 0.000174491 | gnorm 1.342 | clip 0% | oom 0 | wall 6881 | train_wall 6411
| epoch 012 | valid on 'valid' subset | valid_loss 6.2966 | valid_nll_loss 5.05321 | valid_ppl 33.20 | num_updates 32844 | best 6.2966
| epoch 013:   1000 / 2737 loss=5.943, nll_loss=4.747, ppl=26.85, wps=12613, ups=4.6, wpb=2562, bsz=111, num_updates=33845, lr=0.000171891, gnorm=1.368, clip=0%, oom=0, wall=7097, train_wall=6605
| epoch 013:   2000 / 2737 loss=5.959, nll_loss=4.764, ppl=27.17, wps=12500, ups=4.7, wpb=2553, bsz=109, num_updates=34845, lr=0.000169406, gnorm=1.384, clip=0%, oom=0, wall=7303, train_wall=6800
| epoch 013 | loss 5.968 | nll_loss 4.775 | ppl 27.38 | wps 12650 | ups 4.8 | wpb 2578 | bsz 108 | num_updates 35581 | lr 0.000167645 | gnorm 1.373 | clip 0% | oom 0 | wall 7452 | train_wall 6942
| epoch 013 | valid on 'valid' subset | valid_loss 6.28413 | valid_nll_loss 5.03823 | valid_ppl 32.86 | num_updates 35581 | best 6.28413
| epoch 014:   1000 / 2737 loss=5.885, nll_loss=4.678, ppl=25.61, wps=12695, ups=4.6, wpb=2607, bsz=108, num_updates=36582, lr=0.000165336, gnorm=1.369, clip=0%, oom=0, wall=7671, train_wall=7137
| epoch 014:   2000 / 2737 loss=5.898, nll_loss=4.693, ppl=25.86, wps=12745, ups=4.8, wpb=2593, bsz=109, num_updates=37582, lr=0.000163121, gnorm=1.420, clip=0%, oom=0, wall=7873, train_wall=7330
| epoch 014 | loss 5.915 | nll_loss 4.712 | ppl 26.21 | wps 12747 | ups 4.8 | wpb 2578 | bsz 108 | num_updates 38318 | lr 0.000161547 | gnorm 1.426 | clip 0% | oom 0 | wall 8019 | train_wall 7470
| epoch 014 | valid on 'valid' subset | valid_loss 6.24927 | valid_nll_loss 5.00439 | valid_ppl 32.10 | num_updates 38318 | best 6.24927
| epoch 015:   1000 / 2737 loss=5.843, nll_loss=4.630, ppl=24.76, wps=12640, ups=4.7, wpb=2544, bsz=109, num_updates=39319, lr=0.000159477, gnorm=1.425, clip=0%, oom=0, wall=8233, train_wall=7663
| epoch 015:   2000 / 2737 loss=5.859, nll_loss=4.648, ppl=25.07, wps=12776, ups=4.8, wpb=2566, bsz=109, num_updates=40319, lr=0.000157487, gnorm=1.433, clip=0%, oom=0, wall=8434, train_wall=7855
| epoch 015 | loss 5.872 | nll_loss 4.662 | ppl 25.32 | wps 12843 | ups 4.9 | wpb 2578 | bsz 108 | num_updates 41055 | lr 0.000156069 | gnorm 1.437 | clip 0% | oom 0 | wall 8581 | train_wall 7996
| epoch 015 | valid on 'valid' subset | valid_loss 6.25129 | valid_nll_loss 5.00187 | valid_ppl 32.04 | num_updates 41055 | best 6.24927
| epoch 016:   1000 / 2737 loss=5.805, nll_loss=4.586, ppl=24.01, wps=12739, ups=4.8, wpb=2542, bsz=110, num_updates=42056, lr=0.000154201, gnorm=1.476, clip=0%, oom=0, wall=8788, train_wall=8187
| epoch 016:   2000 / 2737 loss=5.820, nll_loss=4.602, ppl=24.28, wps=13033, ups=5.0, wpb=2569, bsz=109, num_updates=43056, lr=0.000152399, gnorm=1.470, clip=0%, oom=0, wall=8982, train_wall=8373
| epoch 016 | loss 5.830 | nll_loss 4.613 | ppl 24.47 | wps 13040 | ups 5.0 | wpb 2578 | bsz 108 | num_updates 43792 | lr 0.000151113 | gnorm 1.464 | clip 0% | oom 0 | wall 9129 | train_wall 8514
| epoch 016 | valid on 'valid' subset | valid_loss 6.23814 | valid_nll_loss 4.9816 | valid_ppl 31.59 | num_updates 43792 | best 6.23814
| epoch 017:   1000 / 2737 loss=5.774, nll_loss=4.548, ppl=23.40, wps=12709, ups=4.7, wpb=2550, bsz=106, num_updates=44793, lr=0.000149415, gnorm=1.486, clip=0%, oom=0, wall=9343, train_wall=8706
| epoch 017:   2000 / 2737 loss=5.787, nll_loss=4.563, ppl=23.65, wps=12796, ups=4.8, wpb=2570, bsz=108, num_updates=45793, lr=0.000147775, gnorm=1.486, clip=0%, oom=0, wall=9544, train_wall=8898
| epoch 017 | loss 5.789 | nll_loss 4.566 | ppl 23.68 | wps 12825 | ups 4.9 | wpb 2578 | bsz 108 | num_updates 46529 | lr 0.000146601 | gnorm 1.493 | clip 0% | oom 0 | wall 9692 | train_wall 9040
| epoch 017 | valid on 'valid' subset | valid_loss 6.23362 | valid_nll_loss 4.97762 | valid_ppl 31.51 | num_updates 46529 | best 6.23362
| epoch 018:   1000 / 2737 loss=5.704, nll_loss=4.469, ppl=22.14, wps=12795, ups=4.7, wpb=2575, bsz=108, num_updates=47530, lr=0.000145049, gnorm=1.505, clip=0%, oom=0, wall=9905, train_wall=9233
| epoch 018:   2000 / 2737 loss=5.740, nll_loss=4.509, ppl=22.76, wps=12805, ups=4.8, wpb=2576, bsz=108, num_updates=48530, lr=0.000143547, gnorm=1.517, clip=0%, oom=0, wall=10106, train_wall=9426
| epoch 018 | loss 5.754 | nll_loss 4.525 | ppl 23.02 | wps 12844 | ups 4.9 | wpb 2578 | bsz 108 | num_updates 49266 | lr 0.000142471 | gnorm 1.529 | clip 0% | oom 0 | wall 10253 | train_wall 9566
| epoch 018 | valid on 'valid' subset | valid_loss 6.20331 | valid_nll_loss 4.94124 | valid_ppl 30.72 | num_updates 49266 | best 6.20331
| epoch 019 | loss 5.671 | nll_loss 4.430 | ppl 21.55 | wps 12874 | ups 4.6 | wpb 2592 | bsz 110 | num_updates 50000 | lr 0.000141421 | gnorm 1.534 | clip 0% | oom 0 | wall 10413 | train_wall 9707
| epoch 019 | valid on 'valid' subset | valid_loss 6.22494 | valid_nll_loss 4.96167 | valid_ppl 31.16 | num_updates 50000 | best 6.20331
| done training in 10417.7 seconds
  1. No, the model generates completely unrelated sentences, too. What could be the cause of this?

  2. The files have the same number of lines and are properly aligned.

I will try the last suggestion given by @myleott and I'll let you know what happens.

feralvam commented 5 years ago

@myleott Sorry, I don't quite understand what you mean by setting fairseq-preprocess --nwordssrc $BPE_CODE --nwordstgt $BPE_CODE. --nwordssrc and --nwordstgt are integer parameters, and $BPE_CODE is a file. Do you mean to set the values of those parameters as the number of lines in the code file?

myleott commented 5 years ago

Yep, I meant the number of codes. Also, that perplexity is quite high. For translation problem we usually get perplexities ~4-5, so I suspect the model is not well trained.

feralvam commented 5 years ago

I see. I'll start changing the parameters and see what happens. Any suggestion on what try first would be welcome. I'm not very experience in this. This paper uses the transformer (tensor2tensor) for the same data, so I'll try to use their same configuration in fairseq as a starting point.

myleott commented 5 years ago

I would try increasing words per batch. Since you're training on 1 GPU, try setting --update-freq 8, which will make the batch size 8 times bigger (and simulates training on 8 GPUs).

sajastu commented 5 years ago

@myleott I'm using fairseq for summarization. You said that ppl 4~5 is considered to be good for translation. Mine was decreasing to ppl 14.83 on epoch 14 and then increasing to 21.73 up until epoch 45. Is it something natural?