The paper introduces DYVAL, a novel dynamic evaluation protocol for assessing large language models (LLMs) on reasoning tasks. The main motivations are to mitigate potential data contamination issues with existing static benchmarks, as well as enable evaluating LLMs on samples with dynamically controlled complexity levels.
The core idea of DYVAL is to dynamically generate evaluation samples on the fly using generation algorithms and complexity constraints rather than relying on a fixed dataset. For reasoning tasks, the authors design a "graph-informed" instantiation of DYVAL that leverages directed acyclic graphs (DAGs) to compose fundamental elements into more complex reasoning problems in fields like mathematics, logic, and algorithms.
The authors then use DYVAL to evaluate state-of-the-art LLMs like GPT-4, GPT-3.5, LLaMa, etc. on 7 reasoning tasks across increasing complexity levels. Some key findings include:
LLM performance decreases as problem complexity increases on DYVAL, highlighting lack of compositional reasoning abilities in current models as problems become more multi-step.
Models like Phi, WizardMath that claim large gains on existing benchmarks perform poorly on DYVAL, suggesting potential data contamination/quality issues with their training data.
Error analysis reveals LLM failure modes like making mistakes in intermediate calculation steps, incorrect logical reasoning, unsubstantiated responses potentially from memorization, etc.
No single prompting technique works best across all DYVAL tasks.
DYVAL-generated samples can be effectively used to fine-tune LLMs and improve their performance on existing benchmarks as well.
Contributions of The Paper
A dynamic evaluation protocol (DYVAL): The main contribution is proposing a general and flexible protocol called DYVAL for dynamically generating evaluation samples for large language models, rather than using fixed datasets. This mitigates issues of potential data contamination and static complexity in existing benchmarks.
Graph-informed DYVAL for reasoning tasks: The authors design a specific instantiation of DYVAL that uses directed acyclic graphs (DAGs) to dynamically compose and control the complexity of reasoning problems across various domains like mathematics, logic, and algorithms.
Extensive evaluation and analysis of state-of-the-art LLMs: The paper conducts comprehensive experiments to evaluate a range of large language models like GPT-4, GPT-3.5, LLaMa, Vicuna, etc. on 7 different reasoning tasks generated by DYVAL at varying complexity levels.
Comments
More relevant to us than DyVal2, graph-based structure seems interesting in reasoning, plus constraint-based difficulty seems in line with our intuition of increasing difficulty (similar to developers' algorithm evaluations)
Publisher
ICLR
Link to The Paper
https://arxiv.org/abs/2309.17167
Name of The Authors
Kaijie Zhu, Jiaao Chen, Jindong Wang, Neil Zhenqiang Gong, Diyi Yang, Xing Xie
Year of Publication
2024
Summary
The paper introduces DYVAL, a novel dynamic evaluation protocol for assessing large language models (LLMs) on reasoning tasks. The main motivations are to mitigate potential data contamination issues with existing static benchmarks, as well as enable evaluating LLMs on samples with dynamically controlled complexity levels.
The core idea of DYVAL is to dynamically generate evaluation samples on the fly using generation algorithms and complexity constraints rather than relying on a fixed dataset. For reasoning tasks, the authors design a "graph-informed" instantiation of DYVAL that leverages directed acyclic graphs (DAGs) to compose fundamental elements into more complex reasoning problems in fields like mathematics, logic, and algorithms.
The authors then use DYVAL to evaluate state-of-the-art LLMs like GPT-4, GPT-3.5, LLaMa, etc. on 7 reasoning tasks across increasing complexity levels. Some key findings include:
Contributions of The Paper
Comments