Multi-round evaluation for chat models

Is it possible to conduct multi-round evaluations for chat models? For example, I want to study how a chat model can take hints to solve math problems. Say, I have a multiple-choice math problem, with one correct choice, and one hint for each wrong choice. I first ask the question to the model, get a model answer (using generate_until and some parsing), in the following workflow:

User: Solve the following math question, and output a single letter corresponding to your selection. [question]
Agent: B

If this is correct, I stop here. If not, I continue, where the agent left off:

User: Solve the following math question, and output a single letter corresponding to your selection. [question]
Agent: C
User: This is not correct. Consider the hint and try again: [hint corresponding to choice C]
Agent: B

In the evaluation, I want to calculate the percentage of the first round correct answer, as well as the percentage of the second round correct answer when the first round answer is incorrect.

Is it possible to do in the current framework? Or is the library not suitable for this kind of evaluation?

EleutherAI / lm-evaluation-harness

Multi-round evaluation for chat models #1816