Closed lennijusten closed 1 week ago
Hello! It would be great to get a look at the log (and ideally a specific sample) where you're seeing this behavior to see if I can form a bit more of a theory as to what about the sample is tripping up the scorer. You can email me directly at either cteague at gmail or cteague at rand.org.
Thanks and sorry for the issue!
Sent you an email!
Thank you! I can see is the issue based upon those logs. I will get a fix put together ASAP!
When running multiple-choice evaluations using chain-of-thought prompting and the
choice
scorer, there are occasional instances where the scorer appears to incorrectly grade the model's response. This occurs despite the model providing a clear final answer in the specified format.I've seen
My setup
I'm running multiple-choice evals with chain-of-thought prompting using the
choice
scorer and the following prompt template:I'm using Inspect v0.3.25.
Expected behavior
The scorer should correctly identify and grade the final answer provided by the model, which is formatted as "ANSWER: $LETTER", even when there is "reasoning text" before the final answer.
Actual behavior
In rare cases (approximately 1 in 100 samples), the scorer appears to incorrectly grade the response. It may interpret the entire response as the answer or fail to recognize the correct final answer line.
Below are two toy examples illustrating the issue. I'm using the toy example to avoid leaking actual eval data onto Github, but I'm happy to share logs where this is happening privately via email.
No answer extracted
Incorrect answer extracted