facebookresearch / Mask-Predict

A masked language modeling objective to train a model to predict any subset of the target words, conditioned on both the input text and a partially masked target translation.
Other
240 stars 38 forks source link

Not able to reproduce the BLEU score with the saved model for en-de or de-en #10

Closed mdragus closed 4 years ago

mdragus commented 4 years ago

I tried reproducing the results with the provided saved model on the newstest corpora found here: https://nlp.stanford.edu/projects/nmt/. I ran the following commands:

python preprocess.py --source-lang ${src} --target-lang ${tgt} --testpref ${raw_text}/newstest2014 --destdir ${data_dir}/data-bin --workers 60 --srcdict ${modelpath}/maskPredict${src}_${tgt}/dict.${src}.txt --tgtdict ${modelpath}/maskPredict${src}_${tgt}/dict.${tgt}.txt

python generate_cmlm.py ${datadir}/data-bin --path ./maskPredict${src}_${tgt}/checkpoint_best.pt --task translation_self --remove-bpe --max-sentences 2 --decoding-iterations 10 --dehyphenate --decoding-strategy mask_predict

at first I got a BLEU score of only 9, then I noticed that the dictionary in data-bin was different than the dictionary in the saved model so I manually removed the "finalize" call when saving the dictionary from preproess.py (it seemed like an optimization only anyway). This made the dictionaries identical, but the BLEU score was still only 16. Looking manually at the results, they seem reasonable in the sense that there doesn't seem to be a discrepancy between vocabulary of trained model and vocab used in evaluation. The one issue that is salient is the fact that there are around 10% UNK tokens in both source and target. 10% UNKs seems rather high.

Does anyone know what the possible issue is? Is it possible that the uploaded model is not the one that got the BLEU scores reported in the paper?

mdragus commented 4 years ago

Actually if you download the data referenced by this file you can reproduce the results from the paper almost perfectly: https://github.com/facebookresearch/Mask-Predict/blob/master/get_data.sh. It is unfortunate that there seems to be some pre tokenization step that is not included in this repository.