When you calculate BLEU, the size of the reference list should match exactly, since a reference list with greater number of sentences produces higher BLEU, which was mentioned in the paper that introduced BLEU score. The aforementioned line curtails the original reference list consisting of 10k sentences to 500 sentences, which results in lower BLEU score. The same thing can be said for self-BLEU. Did you calculate COCO (self)-BLEU with get_bleu_fast?
https://github.com/geek-ai/Texygen/blob/08c67a1fc37d9b3ec923ac9e3b6daeabce79fa3f/utils/metrics/Bleu.py#L65
When you calculate BLEU, the size of the reference list should match exactly, since a reference list with greater number of sentences produces higher BLEU, which was mentioned in the paper that introduced BLEU score. The aforementioned line curtails the original reference list consisting of 10k sentences to 500 sentences, which results in lower BLEU score. The same thing can be said for self-BLEU. Did you calculate COCO (self)-BLEU with get_bleu_fast?