According the paper for self-bleu calculations each generation is compared against all the other references.
The current Self-BLEU implementation includes the selected hypothesis in the list of references. This risks inflation in the self-bleu scores as there will be always a direct match between the hypothesis and one of the references.
def get_bleu(self):
ngram = self.gram
bleu = list()
reference = self.get_reference()
weight = tuple((1. / ngram for _ in range(ngram)))
with open(self.test_data) as test_data:
for hypothesis in test_data:
hypothesis = nltk.word_tokenize(hypothesis)
bleu.append(nltk.translate.bleu_score.sentence_bleu(reference, hypothesis, weight,
smoothing_function=SmoothingFunction().method1))
return sum(bleu) / len(bleu)
should we remove the target hypothesis from the set of references or am I missing something here?
According the paper for self-bleu calculations each generation is compared against all the other references.
The current Self-BLEU implementation includes the selected hypothesis in the list of references. This risks inflation in the self-bleu scores as there will be always a direct match between the hypothesis and one of the references.
should we remove the target hypothesis from the set of references or am I missing something here?
Thanks for the help in advance