The paper proposes Meta Probing Agents (MPA), a novel dynamic evaluation protocol for large language models (LLMs) that addresses two major challenges: data contamination in existing benchmarks and lack of interpretability of LLMs' abilities.
MPA takes inspiration from psychometric theory, categorising cognitive abilities into three core areas: language understanding, problem-solving, and domain knowledge. It employs two types of agents:
Probing agents: LLMs that dynamically transform existing evaluation samples into new ones based on principles corresponding to the three cognitive abilities.
Judging agents: LLMs that validate the consistency and correctness of the transformed samples.
This multi-agent approach with judges allows MPA to dynamically generate evaluation benchmarks while maintaining relevance to the original tasks.
Key Findings:
Significant performance degradation of LLMs on MPA benchmarks compared to original datasets like MMLU and ARC-C, indicating potential data contamination issues with static benchmarks.
Strong correlations between the three core cognitive abilities, with language understanding and problem-solving exhibiting the highest correlation.
An implicit "Matthew effect" - larger LLMs tend to have stronger correlations between the three abilities.
Error analysis revealed different failure modes of LLMs corresponding to each cognitive ability, highlighting room for improvement.
MPA can also serve as a data augmentation approach to improve LLM performance when fine-tuned on its generated data.
Contributions of The Paper
The core contribution is the psychometrics-inspired MPA protocol, which provides a general, flexible, and dynamic approach to evaluating LLMs while mitigating data contamination. MPA supports diverse tasks, enables multifaceted analysis of core cognitive abilities, and opens new ways to interpret and improve LLM capabilities.
Comments
Interesting way of adding/modifying knowledge in the existing tasks, based on psychometrics theory.
Can be integrated with adaptive testing/curriculum learning to simulate user behaviour and increase difficulty.
Extendible to code-based tasks, flexible, and the psychometric principles/transformations could be defined by us and be code-specific.
Publisher
ICML
Link to The Paper
https://arxiv.org/abs/2402.14865
Name of The Authors
Kaijie Zhu, Jindong Wang, Qinlin Zhao, Ruochen Xu, Xing Xie
Year of Publication
2024
Summary
The paper proposes Meta Probing Agents (MPA), a novel dynamic evaluation protocol for large language models (LLMs) that addresses two major challenges: data contamination in existing benchmarks and lack of interpretability of LLMs' abilities.
MPA takes inspiration from psychometric theory, categorising cognitive abilities into three core areas: language understanding, problem-solving, and domain knowledge. It employs two types of agents:
This multi-agent approach with judges allows MPA to dynamically generate evaluation benchmarks while maintaining relevance to the original tasks.
Key Findings:
Contributions of The Paper
The core contribution is the psychometrics-inspired MPA protocol, which provides a general, flexible, and dynamic approach to evaluating LLMs while mitigating data contamination. MPA supports diverse tasks, enables multifaceted analysis of core cognitive abilities, and opens new ways to interpret and improve LLM capabilities.
Comments