Why is the BLEU obtained from the training model provided much higher than the value on paper?

I download the provided trained model, and test on test dataset, but get much higher BLEU than the values in paper.

I use the scripts provided, and don't change anything:

python preprocess.py \
  --source-lang de \
  --target-lang en \
  --trainpref data/wmt14.en-de/train \
  --validpref data/wmt14.en-de/valid \
  --testpref data/wmt14.en-de/test \
  --destdir output/data-bin/wmt14.de-en \
  --srcdict output/maskPredict_de_en/dict.de.txt \
  --tgtdict output/maskPredict_de_en/dict.en.txt

python generate_cmlm.py output/data-bin/wmt14.${src}-${tgt}  \
    --path ${model_dir}/checkpoint_best.pt \
    --task translation_self \
    --remove-bpe True \
    --max-sentences 20 \
    --decoding-iterations ${iteration} \
    --decoding-strategy mask_predict

I get 34.42 on WMT14 DE->EN, 35.20 on WMT16 EN->RO, 35.62 on WMT RO->EN. These values are much higher than that in origin paper. This is strange, and what happened?

facebookresearch / Mask-Predict

Why is the BLEU obtained from the training model provided much higher than the value on paper? #11