Closed netw0rkf10w closed 2 years ago
I've realized that there have been some major changes in the implementation of Transformer recently, this might be the cause...
@myleott @huihuifan Could you please have a look at this issue?
This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize it yet. If you have any new additional information, please include it with your comment!
Closing this issue after a prolonged period of inactivity. If this issue is still present in the latest release, please create a new issue with up-to-date information. Thank you!
❓ Questions and Help
I have been trying and failing to reproduce the results of the WMT16 En-De translation task, which was reported in Ott et al. (2018) Scaling Neural Machine Translation and was later used as a baseline for comparison in subsequent work: Wu et al. (2019) Pay Less Attention with Lightweight and Dynamic Convolutions and Fan et al. (2020) Reducing Transformer Depth on Demand with Structured Dropout.
I followed closely the instructions in the documentation to train a
transformer_vaswani_wmt_en_de_big
model on 16 GPUs using the following command:After 20 hours of training (>150k steps), using checkpoint averaging (as instructed) I obtained only 27.79 BLEU4 (compound split) instead of the reported 29.29.
As it is not clear how many training steps Ott et al. (2018) and Fan et al. (2020) used in their experiments, I followed Wu et al. (2019) and made sure to use exactly the same training configuration as theirs:
Therefore, I trained the model on 8 GPUs using the following command:
(Note that Wu et al. used 10K warmup steps and trained the model only for 30K steps.)
Using checkpoint averaging and a beam size of 8, I obtained only 27.53 in BLEU.
Could you please check this?
What's your environment?
pip
, source): sourcepip install --editable ./