Open weilinie opened 6 years ago
I also found this problem, when the test data and reference data is the same, self-bleu is always 1. However many papers in this domain use it as a diversity metric, which is quite misleading. I think it only measures to which extent the generated samples with GAN training is different from MLE training samples, and not necessarily the lower the better. What do you think about the Forward-Backward Bleu metric used in the ''Toward Diverse Text Generation with Inverse Reinforcement Learning'' paper?
Fully agree! Thank you for your follow up. In terms of the Forward-Backward Bleu metric you mentioned, I didn't try it. But since it is based on (self-)BLEU scores, I think there also exist the issues we observed.
As far as I know, the basic idea of self-BLEU scores is to calculate the BLEU scores by choosing each sentence in the set of generated sentences as hypothesis and the others as reference, and then take an average of BLEU scores over all the generated sentences.
However, when looking into the implementation of self-BLEU scores: https://github.com/geek-ai/Texygen/blob/master/utils/metrics/SelfBleu.py, I found an issue inside for evaluating self-BLEU over training: Only in the first time of evaluation that the reference and hypothesis come from the same “test data” (i.e. the set of generated sentences). After that, the hypothesis keeps updated but the reference remains unchanged (due to “is_first=False”), which means hypothesis and reference are not from the same “test data” any more, and thus the scores obtained under this implementation are not self-BLEU scores.
To this end, I modified the implementation to make sure that the hypothesis and reference are always from the same “test data” (by simply removing the variables "self.reference" and "self.is_first") and found that the self-BLEU (2-5) scores are always 1 when evaluating all the models.
Please let me know if my concern makes sense or just misunderstand the definition of self-BLEU scores?