Open DLiquor opened 1 year ago
my bleu-4 score is around 27
我也遇见了这个问题,同问? @Robert-xiaoqiang
Hi, all Actually, I employ the pycocoevalcap repo as the evaluation script to assess the syntactic quality of generation, which seems to have the same implementation of BLEU-4 and ROUGE-L as "nlg-eval". The key difference may lie in the usage of this script, I leverge the function "compute_individual_metrics" to compute segment-level quality and average it on the test split. I hope that my answer is helpful to you.
thank you very much!!
i have tried to run a simple baseline with bart-base, but i got a higher bleu-4 score with the package "nlg-eval". i am wondering whethere i used the wrong package since your directory about the metric is None