Time-span BLEU vs. sacreBLEU

obo commented 3 years ago

NLTK BLEU is strangely low for mwerSegmented output. On the other hand, sacreBLEU is strangely low for the span-based scoring. This needs to be explained. (Add Mohammad to this issue once he accepts the invitation.)

obo commented 3 years ago

I know that we are dropping NLTK BLEU, but I would still like this discrepancy to be resolved. Can you put the explanation here as comments?

obo commented 3 years ago

@mohammad2928 Could you please use the discussion here to add an example of such a discrepancy and explain why the scores differ?

mohammad2928 commented 3 years ago

NLTK BLEU is strangely low for mwerSegmented output.

When we have used mwerSegmenter, sometimes there were empty lines and they were the cause of the problem. Also, the number of lines in the hypothesis and reference are equal then we used corpus bleu in NLTK and sacrebleu modules. Please see the following example, when the hypo contains empty lines, the Bleu score in the NLTK will be low.

hypo = [ ['It', 'is', 'a', 'guide', 'to', 'action', 'which', 'ensures', 'that', 'the', 'military', 'always'], [] ] ref = [ ['It', 'is', 'a', 'guide', 'to', 'action', 'that','ensures', 'that', 'the', 'military', 'will'], ['forever','heed', 'Party', 'commands'], ]

Then we have:

NLTK bleu 41.441088962133435 sacre bleu 45.43142611141303

sacreBLEU is strangely low for the span-based scoring

On the other hand, in the time-span method, sometimes the length of the hypothesis would be short, and for the short length, the sacrebleu bleu score will be zero. For more information please follow the following link: https://github.com/mjpost/sacreBLEU/issues/48#issuecomment-534731984

example: hypo = ['It', 'is', 'a', 'guide', 'to', 'action'] ref = ['It', 'is', 'is' 'a', 'guide', 'to', 'action', 'that']

NLTK bleu 32.44629234142678 sacre bleu 0.0

You can use the following python code for tracing:


import nltk
import sacrebleu
from nltk.translate.bleu_score import SmoothingFunction
smoothie = SmoothingFunction().method4

"""
normal state
"""  
hypo = ['It', 'is', 'a', 'guide', 'to', 'action', 'which',
                'ensures', 'that', 'the', 'military', 'always',
                'obeys', 'the', 'commands', 'of', 'the', 'party']
ref = ['It', 'is', 'a', 'guide', 'to', 'action', 'that',
    'ensures', 'that', 'the', 'military', 'will', 'forever',
    'heed', 'Party', 'commands']
print("NLTK bleu", 100* nltk.translate.bleu_score.sentence_bleu([ref], hypo))
print("sacre bleu", sacrebleu.sentence_bleu(' '.join(hypo), [' '.join(ref)]).score)

"""
 when empty lines exist, NLTK bleu score will be low
"""
hypo = [
    ['It', 'is', 'a', 'guide', 'to', 'action', 'which', 'ensures', 'that', 'the', 'military', 'always'],
    []
]
ref = [
    ['It', 'is', 'a', 'guide', 'to', 'action', 'that','ensures', 'that', 'the', 'military', 'will'],
    ['forever','heed', 'Party', 'commands'], 
]
print("NLTK bleu", 100* nltk.translate.bleu_score.corpus_bleu([[i] for i in ref], hypo, smoothing_function=smoothie))
print("sacre bleu", sacrebleu.corpus_bleu([' '.join(i) for i in hypo], [[' '.join(i) for i in ref]]).score)

"""
when the length of hypo would be short, sacrebleu score will be 0. 
"""
hypo = ['It', 'is', 'a', 'guide', 'to', 'action']
ref = ['It', 'is', 'is' 'a', 'guide', 'to', 'action', 'that']
print("NLTK bleu", 100* nltk.translate.bleu_score.sentence_bleu([ref], hypo, smoothing_function=smoothie))

print("sacre bleu", sacrebleu.sentence_bleu(' '.join(hypo), [' '.join(ref)]).score)

`  `  `

ELITR / SLTev

Time-span BLEU vs. sacreBLEU #4