facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
30.37k stars 6.4k forks source link

how to train a en2de model on iwslt dataset #1134

Closed jzhoubu closed 5 years ago

jzhoubu commented 5 years ago

Hi,

I was trying to train a English-to-German model on iwslt dataset. I trained the model for about 80 epochs(I show the parameters at last) but during inference I obtain weird results as below:

S-1411  you see ?
T-1411  ja ?
H-1411  -2.637051582336426      , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,
P-1411  -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -2.6324 -3.5593

I preprocess the data and train the model using the code below:

SRC_LANG=en
TGT_LANG=de  
SOURCE_DATA=$FAIRSEQ_DIR/examples/translation/iwslt14.tokenized.de-en

fairseq-preprocess --source-lang $SRC_LANG --target-lang $TGT_LANG \
    --trainpref $SOURCE_DATA/train --validpref $SOURCE_DATA/valid --testpref $SOURCE_DATA/test \
    --destdir $FAIRSEQ_DIR/data-bin/iwslt14.tokenized.${SRC_LANG}-${TGT_LANG} \
    --workers 20

ARCH=transformer_iwslt_de_en
DATA_BIN=$FAIRSEQ_DIR/data-bin/iwslt14.tokenized.${SRC_LANG}-${TGT_LANG}
CKPT_DIR=$FAIRSEQ_DIR/checkpoints/iwslt14-${ARCH}-${SRC_LANG}-${TGT_LANG}

fairseq-train \
    $DATA_BIN \
    --arch $ARCH --share-decoder-input-output-embed \
    --save-dir $CKPT_DIR \
    --source-lang $SRC_LANG \
    --target-lang $TGT_LANG \
    --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
    --lr 5e-3 --lr-scheduler inverse_sqrt --warmup-updates 4000 \
    --dropout 0.3 --weight-decay 0.0001 \
    --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
    --max-tokens 4096 \
    --save-interval 2 \

Everything goes fine when I train the de2en model, i.e. SRC_LANG=de and TGT_LANG=en. However, when I train this en2de model, I got the same weird result for every sentence as I mentioned above. I am wondering if I have made any mistake on the parameter setting for training a reversed language model on the original dataset.

huihuifan commented 5 years ago

I think your learning rate is too high, try 5e-4

CUDA_VISIBLE_DEVICES=0 fairseq-train \ data-bin/iwslt14.tokenized.de-en \ --arch transformer_iwslt_de_en --share-decoder-input-output-embed \ --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \ --lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 \ --dropout 0.3 --weight-decay 0.0001 \ --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \ --max-tokens 4096

jzhoubu commented 5 years ago

Problem solved. I re-trained the model with lr=5e-4, reaching BLEU4=9.04 at epoch 5. Yet I don't see any weird results.