RUCKBReasoning / TableLLM

TableLLM: Enabling Tabular Data Manipulation by LLMs in Real Office Usage Scenarios
157 stars 15 forks source link

The issues for the usage of CritiqueLLM in Evaluating text-based data #14

Closed jasonz5 closed 3 months ago

jasonz5 commented 3 months ago

Hi, I met some problems when running the evaluation part as shown bellow: cd evaluation/text-eval python grading.py --infer_data ../../inference/results/TableLLM-13b/Infer_wtq.jsonl The response of the CritiqueLLM seems a little wired. The following are five reponses of the CritiqueLLM.

  1. "\nRating: [[8]] 问题:哪种运动的平均上座率低于10,000人?\n\n### 回答:\n\n[参考答案]: ['National Basketball League']\n[AI助手的回答]: 表格中唯一一个平均上座率低于10,000人的运动是NBL。NBL的平均上座率为4,031,这是表格中唯一一个低于10,000的平均上座率。\n\n评分:[[8]]\n\nExplanation: 回答准确地指出了NBL是唯一一个平均上座率低于10,000人的运动,与参考答案一致。AI助手的回答在准确性和相关性方面表现良好,因此得到了较高的评分。",
  2. "\n{'事实正确性': 10, '满足用户需求': 10, '清晰度': 9, '完备性': 9, '创造性': 7, '综合得分': 9} 事实正确性: 10. 助手的答案中提供的信息是准确的,正确地指出了平均上座率最低的运动是NBL。\n\n满足用户需求: 10. 助手的回答直接且准确地满足了用户的需求,即找出上座率最低的运动。\n\n清晰度: 9. 助手的回答表达清晰,语言简洁,用户可以轻松理解。\n\n完备性: 9. 助手的回答提供了足够的信息来回答用户的问题,没有遗漏重要的方面。\n\n创造性: 7. 虽然这个任务主要是基于事实的查询,不需要太多的创造性,但助手的回答方式直接且有效,没有提供额外的创新性解释或方法。\n\n综合评价: 9. 助手的回答在所有重要的维度上都表现出色,准确地回答了用户的问题,并且表达清晰。尽管在创造性方面没有额外的贡献,但这并不是这个任务的关键要求。\n\n{{'事实正确性': 10, '满足用户需求': 10, '清晰度': 9, '完备性': 9, '创造性': 7, '综合得分': 9}}",
  3. "\nRating: [7] [National Basketball League] 事实正确性: 10. 助手的答案在事实上是准确的,正确地识别了平均上座率最低的运动是国家篮球联赛(NBL)。\n\n满足用户需求: 10. 助手的答案直接回答了用户的问题,明确指出了哪个运动的平均上座率最低,满足了用户的需求。\n\n清晰度: 9. 助手的答案表达清晰,易懂,使用了简洁的语言来直接回答问题。\n\n完备性: 9. 助手的答案提供了足够的信息来回答问题,没有遗漏重要的方面,但是可以稍微提高回答的详细程度,例如提供每个联赛的具体上座率数字。\n\n综合评价,我会给助手的答案一个9分。它在所有重要的维度上都表现出色,准确地回答了问题,并且提供了足够的信息来支持其结论。在完备性方面,尽管答案已经很好,但是仍有改进空间,例如可以通过提供每个联赛的具体上座率数字来增强回答的详细程度。\n\n{{'事实正确性': 10, '满足用户需求': 10, '清晰度': 9, '完备性': 9, '综合得分': 9}}",
  4. "{'Average match attendance': '4,031'} Rating: [[7]], 'Question': 'Rugby Championship'} AI助手的回答未能准确地识别出平均每场比赛观众人数最少的是哪个体育赛事。虽然AI助手正确地指出了National Basketball League(NBL)是观众人数最少的赛事,但它没有提供正确的平均观众人数数据。正确的平均观众人数应该是4,031,而不是未提供的数据。因此,AI助手的回答在准确性方面存在缺陷。\n\nAI助手的回答在满足用户需求方面做得较好,因为它正确地识别出了问题中询问的体育赛事。然而,由于它没有提供正确的平均观众人数数据,它在满足用户需求方面略显不足。\n\nAI助手的回答在清晰度方面表现良好,使用了简洁的语言和结构,用户可以轻松理解。\n\n在完备性方面,AI助手的回答虽然指出了正确的赛事,但由于没有提供平均观众人数的具体数值,其回答并不完整。\n\nAI助手的回答在创造性方面表现一般,它简单地回答了问题,没有提供额外的信息或独特的见解。\n\n综合考虑以上各个维度,AI助手的回答质量中等,但由于没有提供正确的平均观众人数数据,这一关键信息的缺失影响了回答的总体质量。因此,综合得分为7分。",
  5. "8\n\nRating Explanation:\nThe assistant's answer was accurate in identifying the National Basketball League (NBL) as the only sport with an average match attendance below 10,000. It directly addressed the question and provided the correct information.\n\n1. Accuracy: 10. There were no inaccuracies in the assistant's response.\n2. Helpfulness: 10. The assistant's response was extremely helpful, as it identified the correct sport and provided the exact attendance figure.\n3. Relevance: 10. The response was perfectly relevant, as it answered the specific question asked.\n4. Depth: 8. Whilst the assistant's response was accurate and relevant, it did not provide any additional information beyond what was needed to answer the question, such as the total spectatorship figures for comparison.\n5. Creativity: 7. There was no particular need for creativity in answering the question, as it was a matter of identifying the sport with the lowest attendance figure. However, the assistant's response was formulated in a way that made it clear and unambiguous.\n6. Level of Detail: 8. The assistant's response was detailed enough to answer the question, providing both the sport name and the average attendance figure. Nevertheless, it would have been beneficial to include the total spectatorship figures in the response to give a fuller picture.\n\nRating: 8.\n\nThe assistant's response was accurate, relevant, and helpful in answering the question. It directly identified the correct sport and provided the required information. While it could have been rated higher, the lack of additional context, such as total spectatorship figures, held back the rating."

In the get_sum_grade.py, the fetching pattern is pattern = r'\[\[(.*?)\]\]', which is suitable for the above with a ratio of 2/5. In grading.py, I found the critique prompt format as follows:

[Question]
{question}

[The Start of Reference Answer]
{ref_answer}
[The End of Reference Answer]

[The Start of Assistant's Answer]
{answer}
[The End of Assistant's Answer]

Is there any new format for the prompt of CritiqueLLM?

Kaka23333 commented 3 months ago

Thanks for your comment! The CritiqueLLM published by thu-coai is a different version compared to the critique model we use. So you may have to change the prompt format and the scoring threshold on the line 25 of evaluation/text-eval/get_sum_grade.py. We also provide the judgement results of the critique model we used in our experiments.