facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
30.53k stars 6.41k forks source link

Unable to reproduce the machine translation results (WMT16) reported by Ott et al. (2018), Wu et al. (2019), and Fan et al. (2020) #3828

Closed netw0rkf10w closed 2 years ago

netw0rkf10w commented 3 years ago

❓ Questions and Help

I have been trying and failing to reproduce the results of the WMT16 En-De translation task, which was reported in Ott et al. (2018) Scaling Neural Machine Translation and was later used as a baseline for comparison in subsequent work: Wu et al. (2019) Pay Less Attention with Lightweight and Dynamic Convolutions and Fan et al. (2020) Reducing Transformer Depth on Demand with Structured Dropout.

I followed closely the instructions in the documentation to train a transformer_vaswani_wmt_en_de_big model on 16 GPUs using the following command:

fairseq-train \
    --distributed-port 12345 \
    --distributed-world-size 16 \
    wmt16/data-bin/wmt16_en_de_bpe32k \
    --arch transformer_vaswani_wmt_en_de_big \
    --share-all-embeddings \
    --optimizer adam \
    --adam-betas '(0.9, 0.98)' \
    --clip-norm 0.0 \
    --lr 0.001 \
    --lr-scheduler inverse_sqrt \
    --warmup-updates 4000 \
    --warmup-init-lr 1e-07 \
    --dropout 0.3 \
    --weight-decay 0.0 \
    --criterion label_smoothed_cross_entropy \
    --label-smoothing 0.1 \
    --max-tokens 3584 \
    --update-freq 8 \
    --fp16

After 20 hours of training (>150k steps), using checkpoint averaging (as instructed) I obtained only 27.79 BLEU4 (compound split) instead of the reported 29.29.

As it is not clear how many training steps Ott et al. (2018) and Fan et al. (2020) used in their experiments, I followed Wu et al. (2019) and made sure to use exactly the same training configuration as theirs:

Screen Shot 2021-08-27 at 13 13 38

Therefore, I trained the model on 8 GPUs using the following command:

fairseq-train \
    --distributed-port 12345 \
    --distributed-world-size 8 \
    wmt16/data-bin/wmt16_en_de_bpe32k \
    --arch transformer_vaswani_wmt_en_de_big \
    --share-all-embeddings \
    --optimizer adam \
    --adam-betas '(0.9, 0.98)' \
    --clip-norm 0.0 \
    --lr 0.001 \
    --lr-scheduler inverse_sqrt \
    --warmup-updates 10000 \
    --max-update 30000 \
    --warmup-init-lr 1e-7 \
    --dropout 0.3 \
    --weight-decay 0.0 \
    --criterion label_smoothed_cross_entropy \
    --label-smoothing 0.1 \
    --max-tokens 3584 \
    --update-freq 16 \
    --fp16

(Note that Wu et al. used 10K warmup steps and trained the model only for 30K steps.)

Using checkpoint averaging and a beam size of 8, I obtained only 27.53 in BLEU.

Could you please check this?

What's your environment?

netw0rkf10w commented 3 years ago

I've realized that there have been some major changes in the implementation of Transformer recently, this might be the cause...

netw0rkf10w commented 3 years ago

@myleott @huihuifan Could you please have a look at this issue?

stale[bot] commented 2 years ago

This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize it yet. If you have any new additional information, please include it with your comment!

stale[bot] commented 2 years ago

Closing this issue after a prolonged period of inactivity. If this issue is still present in the latest release, please create a new issue with up-to-date information. Thank you!