EleutherAI / lm-evaluation-harness

A framework for few-shot evaluation of language models.
https://www.eleuther.ai
MIT License
6.34k stars 1.68k forks source link

Multi-round evaluation for chat models #1816

Open YilunZhou opened 4 months ago

YilunZhou commented 4 months ago

Is it possible to conduct multi-round evaluations for chat models? For example, I want to study how a chat model can take hints to solve math problems. Say, I have a multiple-choice math problem, with one correct choice, and one hint for each wrong choice. I first ask the question to the model, get a model answer (using generate_until and some parsing), in the following workflow:

User: Solve the following math question, and output a single letter corresponding to your selection. [question]
Agent: B

If this is correct, I stop here. If not, I continue, where the agent left off:

User: Solve the following math question, and output a single letter corresponding to your selection. [question]
Agent: C
User: This is not correct. Consider the hint and try again: [hint corresponding to choice C]
Agent: B

In the evaluation, I want to calculate the percentage of the first round correct answer, as well as the percentage of the second round correct answer when the first round answer is incorrect.

Is it possible to do in the current framework? Or is the library not suitable for this kind of evaluation?

artemorloff commented 3 months ago

right now I'm working on solution for your problem! take a look at #1571 here I suggest the way multi-step and multi-round tasks may be handled. The magic lies in update_request and update_storage funcs. You can customize them to cover your problem it would be great, if you can take a look at this PR and leave your feedback, so that I can make it better and also draw @haileyschoelkopf @lintangsutawika attention to the desired feature