RAISEDAL / RAISEReadingList

This repository contains a reading list of Software Engineering papers and articles!
0 stars 0 forks source link

Paper Review: A Survey on Evaluation of Large Language Models #78

Open mehilshah opened 1 month ago

mehilshah commented 1 month ago

Publisher

ACM TIST (ACM Transactions on Intelligent Systems and Technology)

Link to The Paper

https://dl.acm.org/doi/pdf/10.1145/3641289

Name of The Authors

Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, Wei Ye, Yue Zhang, Yi Chang, Philip S. Yu, Qiang Yang, Xing Xie

Year of Publication

2024

Summary

This paper presents a comprehensive survey on the evaluation of large language models (LLMs), focusing on three key dimensions: what to evaluate, where to evaluate, and how to evaluate. The authors aim to provide valuable insights to researchers in the field of LLMs evaluation, supporting the development of more effective LLMs. The introduction emphasizes the importance of evaluating LLMs, as they play a crucial role in research and daily use. Proper evaluation is necessary to understand their potential risks and capabilities. The paper also highlights the need for novel evaluation protocols to keep pace with their rapid evolution.

In the "What to Evaluate" section, the authors summarize existing evaluation tasks across various areas, including general natural language processing tasks, reasoning, medical usage, ethics, education, natural and social sciences, and agent applications. They provide insightful conclusions on the successes and failures of LLMs in these tasks. The "Where to Evaluate" section focuses on evaluation metrics, datasets, and benchmarks, providing a deep understanding of current LLM evaluation practices. The authors discuss the importance of selecting appropriate datasets and benchmarks for meaningful evaluation. In the "How to Evaluate" section, the paper explores current evaluation protocols and summarizes novel evaluation approaches. This includes dynamic and interactive evaluation, evaluation of open-ended generation tasks, assessing emergent abilities, and considering societal impact.

The paper also discusses future challenges in evaluating LLMs, such as the need for more dynamic and interactive evaluation methods, assessing open-ended generation tasks, evaluating emergent abilities, and considering the societal impact of LLMs.

Key findings of the survey include:

  1. LLMs exhibit robust performance in various natural language tasks and reasoning scenarios but still face limitations in certain areas like temporal and spatial reasoning.
  2. Existing static evaluation protocols may not sufficiently capture the true capabilities and potential risks of LLMs, necessitating the development of novel, dynamic evaluation approaches.
  3. Evaluating LLMs' performance in specific domains such as healthcare, social sciences, and agent applications is crucial for their safe and effective deployment in real-world scenarios.
  4. Comparing LLMs' performance to human baselines is important for assessing their reasoning capabilities and identifying areas for improvement.

The authors maintain an open-source repository of LLM evaluation resources to foster collaboration and advancement in the field.

Contributions of The Paper

  1. Comprehensive overview of LLM evaluation: The paper provides a comprehensive overview of LLM evaluation from three key aspects: what to evaluate, where to evaluate, and how to evaluate. This categorization encompasses the entire life cycle of LLM evaluation, offering a structured approach for researchers to understand and advance the field.

  2. Summarization of evaluation tasks: The authors summarize existing evaluation tasks across various domains, including general natural language processing, reasoning, medical usage, ethics, education, natural and social sciences, and agent applications. This provides a broad understanding of the current state of LLM evaluation and the specific areas where LLMs have been applied and assessed.

  3. Insights on LLM successes and failures: By reviewing the literature on LLM evaluation, the authors offer valuable insights into the successes and failures of LLMs in different tasks. This information can guide future research efforts, highlighting areas where LLMs excel and where they need improvement.

Comments

Task Ideas for our Dynamic Benchmarking Idea!

  1. Code reasoning tasks: Design dynamic evaluation scenarios that test the model's ability to reason about code semantics, logic, and functionality. Generate test cases that cover a range of programming languages, domains, and complexity levels.
  2. Code generation tasks: Evaluate the model's ability to generate code based on natural language descriptions, comments, or incomplete code snippets. Assess the generated code's correctness, efficiency, and adherence to best practices.
  3. Code translation and refactoring: Test the model's capability to translate code between different programming languages or refactor existing code to improve its readability, maintainability, or performance.
  4. Code security and robustness: Generate evaluation tasks that test the model's ability to identify and mitigate security vulnerabilities, handle edge cases, and generate robust code that can withstand various input scenarios.
  5. Benchmarking against human-written code: Compare the performance of the code-generating LLM against code written by human programmers of varying skill levels.
parvezmrobin commented 1 month ago

Did they perform any evaluation specific to domains like healthcare where you are just not allowed to be wrong? By now, we all are aware of hallucinating models, which can pose great risks when applied to the medical domain.

So, I'm just curious do they evaluate medical usage like any other domain or pose special attention to that?

mehilshah commented 1 month ago

Many of these studies have focused on three aspects of medical applications: medical queries, medical examinations, and medical assistants. In fact, medicine received the most attention in the domain of Natural Sciences and Engineering.

Medical Queries - testing the accuracy and reliability of generated answers (Q&A-based testing) Medical Examination - performance of LLMs in medical examination assessments (standardized tests, USMLE) Medical Assistants - evaluating the understanding of LLMs in specific subfields of medicine (understanding of surgical clinical information, translating radiology reports, etc.)

If you want to learn more about eval of LLMs in Medicine, Section 3.5 is where you should look :)