Possibly wrong sentBLEU scores for tedtalks dataset

sted19 commented 2 years ago

Trying to reproduce the correlations obtained in the paper (recomputing actual metric scores) for BLEU, I noticed some strange values in the file with BLEU scores at the segment level. There are pairs of candidate-reference sentences that are identical, whose BLEU score should be 100.0 (and it is, when recomputed), but the score that is in BLEU.seg.score.gz is 0.0.

As an example, we can take a specific language pair and system, let's say zh-en and metricsystem5. In BLEU.seg.score.gz the scores of the segments 842, 843, when scored against ref-B, are both 0.0 . If we take a look at the actual candidate and reference sentences, though, they are very short and identical:

842 --> candidate: "Thank you." - reference: "Thank you." 843 --> candidate: "(Applause)" - reference: "(Applause)"

for these two examples, computing BLEU using sacrebleu returns a value of 100.0. Of course also the correlation that I obtain for sentBLEU, for this specific dataset, at segment level, differ from those reported in the results paper.

ricardorei commented 2 years ago

Hmmm, I just looked at those examples and the score is not 0.

Screenshot 2022-03-21 at 17 05 10

ricardorei commented 2 years ago

ahh wait, you mean in the TED Talks?

ricardorei commented 2 years ago

Alright, I found the error.

On the script that I used to extract the baseline results, I used the corpus_bleu function to compute sentence-level BLEU scores. corpus_bleu returns 0 for those cases (probably because you don't have 4-gram matches.

This is not a problem for system-level scores or other baseline metrics. I am adding here a link to the script..

To run the script you have to install mt-telescope.

The sacrebleu version that we used was 1.5.

Basically, this might have an impact on segment-level results in the Appendix tables.

ricardorei commented 2 years ago

@sted19 thanks for finding and reporting this! Please tell me if you are having problems reproducing anything else or if you find something suspicious.

WMT-Metrics-task / wmt21-metrics-data

Possibly wrong sentBLEU scores for tedtalks dataset #1