facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
30.21k stars 6.38k forks source link

Difference between BLEU and SacreBLEU #637

Closed AlexGrinch closed 5 years ago

AlexGrinch commented 5 years ago

Hello!

Could you, please, elaborate on the difference between BLEU and SacreBLEU scores reported in the Fairseq paper? How can I calculate SacreBLEU, for example, for the output of DynamicConv model? I can reproduce 29.7 BLEU with fairseq-score but when I run fairseq-score with flag --sacrebleu, I get ridiculously high score of 33.8.

Thanks

edunov commented 5 years ago

In general: Sacrebleu is the number obtained through this script: https://github.com/mjpost/sacreBLEU And BLEU is from this one: https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/multi-bleu.perl

You can read about the difference between them here: https://arxiv.org/abs/1804.08771

I'm not sure why you see the ridiculously high BLEU, but maybe you don't have --remove-bpe?