Clarification on your paper reported scores

facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

MIT License

30.43k stars 6.4k forks source link

Clarification on your paper reported scores #435

Closed vince62s closed 5 years ago

vince62s commented 5 years ago

Guys, I got a simple question. On Table 2., you report 29.3 for the EN-DE Bleu, comparing with Vaswani's original paper 28.4 Unless I am mistaken, all scores reported by the original paper are detokenized scores as it's most often the case in WMT runs. If si, it has to be compared to the 28.6 SacreBleu. (which is similar to multi-bleu-detok.pl) I have not checked the other paper Shaw et Al 2018 that you cite. Cheers,

myleott commented 5 years ago

We only report detokenized BLEU scores in that paper :)

The 29.3 figure is detokenized BLEU, but with compound splitting, which is what Vaswani's original paper used as well. See https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/utils/get_ende_bleu.sh.

The 28.6 sacrebleu is without compound splitting.

vince62s commented 5 years ago

oh ok then why this comment at the end of 3.1: WMT data for each Paracrawl language pair. We measure case-sensitive tokenized BLEU with multi-bleu.pl2 and de-tokenized BLEU with SacreBLEU3 (Post, 2018)

myleott commented 5 years ago

Yes, it's compound split, so it's technically "tokenized" but really it's detokenized with compound splitting applied in postprocessing.

vince62s commented 5 years ago

well beside the compound split, it is tokenized in the sense of Moses tokenization, which is different from the Nist one (hence the sacrebleu one). But if the original paper used the same, I better undertand the difficulty to match exactly the reported scores when one now uses more the official Nist/sacrebleu scoring which embeds its own tokenization.

myleott commented 5 years ago

Ah sorry, you're right. I just checked the WMT'16 data in the Google drive (which is what we used as well) and it's already tokenized :/ I misremembered that version as being raw.

To be clear, here's what we reported:

Direct comparison to Vaswani et al. (reported in our paper as tokenized BLEU)

1) We downloaded the preprocessed WMT'16 data from the Google drive link in tensor2tensor. No further preprocessing is applied. This is detailed here: https://github.com/pytorch/fairseq/tree/master/examples/translation#replicating-results-from-scaling-neural-machine-translation. 2) Train a model. 3) Remove BPE, apply compound splitting and report BLEU. This is tokenized BLEU because the underlying WMT'16 data (both train and test) was already tokenized by Google.

Detokenized BLEU computed via sacrebleu 1) Take the above model. 2) Remove BPE, apply some hand-crafted regex to try to remove tokenization, pass through sacrebleu and report detokenized BLEU.