RAISEDAL / RAISEReadingList

This repository contains a reading list of Software Engineering papers and articles!
0 stars 0 forks source link

Paper Review: DyVal: Dynamic Evaluation of Large Language Models for Reasoning Tasks #87

Open mehilshah opened 1 month ago

mehilshah commented 1 month ago

Publisher

ICLR

Link to The Paper

https://arxiv.org/abs/2309.17167

Name of The Authors

Kaijie Zhu, Jiaao Chen, Jindong Wang, Neil Zhenqiang Gong, Diyi Yang, Xing Xie

Year of Publication

2024

Summary

The paper introduces DYVAL, a novel dynamic evaluation protocol for assessing large language models (LLMs) on reasoning tasks. The main motivations are to mitigate potential data contamination issues with existing static benchmarks, as well as enable evaluating LLMs on samples with dynamically controlled complexity levels.

The core idea of DYVAL is to dynamically generate evaluation samples on the fly using generation algorithms and complexity constraints rather than relying on a fixed dataset. For reasoning tasks, the authors design a "graph-informed" instantiation of DYVAL that leverages directed acyclic graphs (DAGs) to compose fundamental elements into more complex reasoning problems in fields like mathematics, logic, and algorithms.

The authors then use DYVAL to evaluate state-of-the-art LLMs like GPT-4, GPT-3.5, LLaMa, etc. on 7 reasoning tasks across increasing complexity levels. Some key findings include:

Contributions of The Paper

Comments