Not able to reproduce the BLEU score with the saved model for en-de or de-en

I tried reproducing the results with the provided saved model on the newstest corpora found here: https://nlp.stanford.edu/projects/nmt/. I ran the following commands:

python preprocess.py --source-lang ${src} --target-lang ${tgt} --testpref ${raw_text}/newstest2014 --destdir ${data_dir}/data-bin --workers 60 --srcdict ${modelpath}/maskPredict${src}_${tgt}/dict.${src}.txt --tgtdict ${modelpath}/maskPredict${src}_${tgt}/dict.${tgt}.txt

python generate_cmlm.py ${datadir}/data-bin --path ./maskPredict${src}_${tgt}/checkpoint_best.pt --task translation_self --remove-bpe --max-sentences 2 --decoding-iterations 10 --dehyphenate --decoding-strategy mask_predict

at first I got a BLEU score of only 9, then I noticed that the dictionary in data-bin was different than the dictionary in the saved model so I manually removed the "finalize" call when saving the dictionary from preproess.py (it seemed like an optimization only anyway). This made the dictionaries identical, but the BLEU score was still only 16. Looking manually at the results, they seem reasonable in the sense that there doesn't seem to be a discrepancy between vocabulary of trained model and vocab used in evaluation. The one issue that is salient is the fact that there are around 10% UNK tokens in both source and target. 10% UNKs seems rather high.

Does anyone know what the possible issue is? Is it possible that the uploaded model is not the one that got the BLEU scores reported in the paper?

facebookresearch / Mask-Predict

Not able to reproduce the BLEU score with the saved model for en-de or de-en #10