Maluuba / nlg-eval

Evaluation code for various unsupervised automated metrics for Natural Language Generation.
http://arxiv.org/abs/1706.09799
Other
1.35k stars 224 forks source link

got different result when using standalone and python API #84

Closed friskit-china closed 5 years ago

friskit-china commented 5 years ago

Hi,

I used this repository for evaluating my result.

However, I received two different result.

I tried to use (I have only one reference for each hypothesis): nlg-eval --hypothesis=hyp.txt --references=ref.txt

and I got the result: Bleu_1: 0.225156 Bleu_2: 0.124906 Bleu_3: 0.071296 Bleu_4: 0.042867

But when I using the following code:

nlgeval = NLGEval(no_glove=True, no_skipthoughts=True, metrics_to_omit={'CIDEr'})
bleu_1_list = []
bleu_2_list = []
bleu_3_list = []
bleu_4_list = []

for gt, pred in zip(gt_list, label_list):
    nlg_eval_result = nlgeval.compute_individual_metrics([gt], pred)
    bleu_1_list.append(nlg_eval_result['Bleu_1'])
    bleu_2_list.append(nlg_eval_result['Bleu_2'])
    bleu_3_list.append(nlg_eval_result['Bleu_3'])
    bleu_4_list.append(nlg_eval_result['Bleu_4'])

bleu_1_result = np.asarray(bleu_1_list).mean()
bleu_2_result = np.asarray(bleu_2_list).mean()
bleu_3_result = np.asarray(bleu_3_list).mean()
bleu_4_result = np.asarray(bleu_4_list).mean()

print(bleu_1_result, bleu_2_result, bleu_3_result, bleu_4_result)

this will get the BLEU results: Bleu_1=0.187041729287695 Bleu_2=0.08295724762832153 Bleu_3=0.029044826012821708 Bleu_4=0.012461408083621252

Why I got the different result?

The only difference is that I take all hyp and ref together in the first one (standalong command) and I calculate the result for each sample one by one in the second one (python API).

juharris commented 5 years ago

I think it's because for BLEU we're using option='closest': https://github.com/Maluuba/nlg-eval/blob/master/nlgeval/pycocoevalcap/bleu/bleu.py#L40 You can try with option='average'.

OR in your code example: bleu_1_result = np.asarray(bleu_1_list).min()

kracwarlock commented 5 years ago

You can look at https://github.com/Maluuba/nlg-eval/issues/10#issuecomment-369888781

Your first approach computes corpus-level Bleu and the second approach computes sentence-level Bleu. These are expected to be different.