UKGovernmentBEIS / inspect_ai

Inspect: A framework for large language model evaluations
https://inspect.ai-safety-institute.org.uk/
MIT License
594 stars 107 forks source link

Occasional incorrect grading by choice scorer with chain-of-thought prompting #721

Closed lennijusten closed 1 week ago

lennijusten commented 1 week ago

When running multiple-choice evaluations using chain-of-thought prompting and the choice scorer, there are occasional instances where the scorer appears to incorrectly grade the model's response. This occurs despite the model providing a clear final answer in the specified format.

I've seen

  1. The scorer fails to extract any answer at all, despite the model providing a clear final answer in the specified format.
  2. The scorer extracts an incorrect answer, interpreting the entire response as the answer or extracting something like "A, B, C, D" despite a clear, single-letter answer.

My setup

I'm running multiple-choice evals with chain-of-thought prompting using the choice scorer and the following prompt template:

MULTIPLE_CHOICE_TEMPLATE_COT = r"""
Answer the following multiple choice question. The last line of your response should be of the following format: 'ANSWER: $LETTER' (without quotes) where LETTER is one of {letters}. Before answering, reason in a step-by-step manner as to get the right answer.

{question}

{choices}
""".strip()

I'm using Inspect v0.3.25.

Expected behavior

The scorer should correctly identify and grade the final answer provided by the model, which is formatted as "ANSWER: $LETTER", even when there is "reasoning text" before the final answer.

Actual behavior

In rare cases (approximately 1 in 100 samples), the scorer appears to incorrectly grade the response. It may interpret the entire response as the answer or fail to recognize the correct final answer line.

Below are two toy examples illustrating the issue. I'm using the toy example to avoid leaking actual eval data onto Github, but I'm happy to share logs where this is happening privately via email.

No answer extracted

Model Response: 
Let's approach this step-by-step:
1. [reasoning steps...]
...
Therefore, the most likely scenario is that the grass is green.
ANSWER: D

Scorer Output:
ANSWER: 
EXPLANATION: [full model response repeated]
SCORE: I

Incorrect answer extracted

Model Response:
To determine the number of moons that Earth has, I will reason step-by-step...
[reasoning steps...]
Conclusion:
The Earth has one moon according to my most recent knowledge cutoff. 
ANSWER: B

Scorer Output:
ANSWER: A, B, C, D
EXPLANATION: [full model response repeated]
SCORE: I
dragonstyle commented 1 week ago

Hello! It would be great to get a look at the log (and ideally a specific sample) where you're seeing this behavior to see if I can form a bit more of a theory as to what about the sample is tripping up the scorer. You can email me directly at either cteague at gmail or cteague at rand.org.

Thanks and sorry for the issue!

lennijusten commented 1 week ago

Sent you an email!

dragonstyle commented 1 week ago

Thank you! I can see is the issue based upon those logs. I will get a fix put together ASAP!