Reproduce results from mBART paper on IWSLT15 En-Vi dataset

What is your question?

I try to follow the mBART example and finetune on IWSLT15 En-Vi dataset to reproduce the result from the paper. However, after running multiple times with different seed, I only obtained about 34.1 BLEU for En->Vi and 34.6 BLEU for Vi->En, compared with 36.1 and 35.4 from the paper.

Code

# Download data
wget https://raw.githubusercontent.com/tensorflow/nmt/master/nmt/scripts/download_iwslt15.sh
sh download_iwslt15.sh

# Apply sentencepiece
spm_encode --model=$MODEL < $DATA/$TRAIN.$SRC > $DATA/$TRAIN.spm.$SRC & \
spm_encode --model=$MODEL < $DATA/$TRAIN.$TGT > $DATA/$TRAIN.spm.$TGT & \
spm_encode --model=$MODEL < $DATA/$VALID.$SRC > $DATA/$VALID.spm.$SRC & \
spm_encode --model=$MODEL < $DATA/$VALID.$TGT > $DATA/$VALID.spm.$TGT & \
spm_encode --model=$MODEL < $DATA/$TEST.$SRC > $DATA/$TEST.spm.$SRC & \
spm_encode --model=$MODEL < $DATA/$TEST.$TGT > $DATA/$TEST.spm.$TGT &

fairseq-preprocess \
  --source-lang $SRC \
  --target-lang $TGT \
  --trainpref $DATA/$TRAIN.spm \
  --validpref $DATA/$VALID.spm \
  --testpref $DATA/$TEST.spm \
  --destdir $DEST/$NAME \
  --thresholdtgt 0 \
  --thresholdsrc 0 \
  --workers 20 \
  --srcdict $DICT \
  --tgtdict $DICT

fairseq-train $DEST/$NAME \
  --encoder-normalize-before --decoder-normalize-before \
  --arch mbart_large --layernorm-embedding \
  --task translation_from_pretrained_bart \
  --source-lang $SRC --target-lang $TGT \
  --criterion label_smoothed_cross_entropy --label-smoothing 0.2 \
  --optimizer adam --adam-eps 1e-06 --adam-betas '(0.9, 0.98)' \
  --lr-scheduler polynomial_decay --lr 3e-05 --warmup-updates 2500 --total-num-update 40000 \
  --dropout 0.3 --attention-dropout 0.1 --weight-decay 0.0 \
  --max-tokens 1024 --update-freq 2 \
  --save-interval 1 --save-interval-updates 5000 --keep-interval-updates 10 --no-epoch-checkpoints \
  --seed 222 --log-format simple --log-interval 2 \
  --restore-file $PRETRAIN \
  --reset-optimizer --reset-meters --reset-dataloader --reset-lr-scheduler \
  --langs $langs \
  --ddp-backend legacy_ddp

Since using fairseq-generate from the example with output hypothesis sentence without space (issue #3103), I log the output of fairseq-generate and extract sentences on line S, T and D to calculate BLEU using sacrebleu

What's your environment?

fairseq Version (e.g., 1.0 or master): 0.10.2
PyTorch Version (e.g., 1.0): 1.7.1
OS (e.g., Linux): Ubuntu 18.04
How you installed fairseq (pip, source): pip
Build command you used (if compiling from source):
Python version: 3.7.10
CUDA/cuDNN version: 10.2
GPU models and configuration: Tesla V100
Any other relevant information:

facebookresearch / fairseq