评价指标的含义 - Githubissues

RUC-NLPIR / FlashRAG

⚡FlashRAG: A Python Toolkit for Efficient RAG Research

https://arxiv.org/abs/2405.13576

MIT License

891 stars 69 forks source link

评价指标的含义 #15

Closed lwj2001 closed 1 month ago

lwj2001 commented 1 month ago

metrics: ['em','f1','sub_em','precision','recall'] 您好，我想请问：f1,precision,recall是比较常见的指标，'em'和'sub_em'代表的含义是什么呢？

lwj2001 commented 1 month ago

For evaluating the quality of generation, we support five metrics including token-level F1 score, exact match, accuracy, BLEU [69], and ROUGE-L [70]. 请问上述的指标和论文中说明的指标有对应关系吗

DaoD commented 1 month ago

论文中用的是em（exact match）和f1（token-level f1 score)

lwj2001 commented 1 month ago

那请问仓库里用的指标em和sub_em的含义和是如何计算的呢？

ignorejjj commented 1 month ago

@lwj2001 em计算模型的输出是否与标准答案完全一致，sub_em(也就是论文中的acc)计算模型的输出中是否包含标准答案。

对于有多个标准答案的case，依次计算模型的输出与每个标准答案的得分，并取最大的作为最终的分数。