Closed friskit-china closed 5 years ago
I think it's because for BLEU we're using option='closest'
:
https://github.com/Maluuba/nlg-eval/blob/master/nlgeval/pycocoevalcap/bleu/bleu.py#L40
You can try with option='average'
.
OR in your code example:
bleu_1_result = np.asarray(bleu_1_list).min()
You can look at https://github.com/Maluuba/nlg-eval/issues/10#issuecomment-369888781
Your first approach computes corpus-level Bleu and the second approach computes sentence-level Bleu. These are expected to be different.
Hi,
I used this repository for evaluating my result.
However, I received two different result.
I tried to use (I have only one reference for each hypothesis):
nlg-eval --hypothesis=hyp.txt --references=ref.txt
and I got the result: Bleu_1: 0.225156 Bleu_2: 0.124906 Bleu_3: 0.071296 Bleu_4: 0.042867
But when I using the following code:
this will get the BLEU results: Bleu_1=0.187041729287695 Bleu_2=0.08295724762832153 Bleu_3=0.029044826012821708 Bleu_4=0.012461408083621252
Why I got the different result?
The only difference is that I take all hyp and ref together in the first one (standalong command) and I calculate the result for each sample one by one in the second one (python API).