potential bug in self-bleu calculations

According the paper for self-bleu calculations each generation is compared against all the other references.

The current Self-BLEU implementation includes the selected hypothesis in the list of references. This risks inflation in the self-bleu scores as there will be always a direct match between the hypothesis and one of the references.

    def get_bleu(self):
        ngram = self.gram
        bleu = list()
        reference = self.get_reference()
        weight = tuple((1. / ngram for _ in range(ngram)))
        with open(self.test_data) as test_data:
            for hypothesis in test_data:
                hypothesis = nltk.word_tokenize(hypothesis)
                bleu.append(nltk.translate.bleu_score.sentence_bleu(reference, hypothesis, weight,
                                                                    smoothing_function=SmoothingFunction().method1))
        return sum(bleu) / len(bleu)

should we remove the target hypothesis from the set of references or am I missing something here?

Thanks for the help in advance

geek-ai / Texygen

potential bug in self-bleu calculations #46