Regarding the relative self-BLEU scores you calculated

desire2020 / CoT

(Beta Version!) Experiment Code for Paper ``CoT: Cooperative Training for Generative Modeling of Discrete Data''

MIT License

73 stars 27 forks source link

Isn't RSBLEU being closer to 1.0 much more important than being lower than 1.0, since lack of diversity (mode-collapse) and too much diversity (exposure bias) are equally bad? Aren't MaliGAN and RankGAN superior to CoT in Table 3, though the test loss is much worse?

In fact, I just realized that Texygen has two options: get_bleu_fast and get_bleu, the latter of which uses a whole test data as a reference rather than 500 sentences from it. I hope all the BLEU scores for WMT News published came from get_bleu. It was mentioned in the original BLEU paper by Papineni et. al that using different numbers of reference sentences produces different results. Also, Texygen made all the sentences in lower case, which I hope you did, too. I calculated the self-BLEU-2 of WMT test dataset and obtained 0.862. On the other hand, from the BLEU-2 of MLE in your survey paper and the self-BLEU-2 of MLE in your cot paper, I calculated your self-BLEU-2 of test dataset to be 0.875. This is strange, since the value should match exactly. What do you think is the cause of this discrepancy? If you're fine, could you tell me the self-BLEU-n of test dataset for other n?

For the first question, in mathematical definition diversity are usually measured by information entropy. RSBLEU can be viewed as a n-gram level approximation of KL(P||G). As is analyzed in Section 2, assigning high penalty to distribution with RSBLEU higher than 1.0 is rational since it is obvious that such distributions have lower entropy than the original distribution. In order to measure the DIVERSITY, such situations should not be better than that is at least as diversified or more diversified than the original data. The other constraint of RSBLEU (lower and closer) is to penalize exposure bias. However, such constraints do seem a little bit sophisticated. As a result, we plan to replace it with estimated Wasserstein-1 Distance in the coming version.

For the second question, when doing the experiments in CoT, we performed a different formatting strategy different from that in Texygen, yet we keep the setting for all models listed in the table. Thanks for your suggestions about the lower-case/upper-case formatting issue, we will update the experiment results.

Thanks for your generous suggestions!

desire2020 / CoT

Regarding the relative self-BLEU scores you calculated #3