Closed sidnarayanan closed 2 days ago
To reproduce these (though options
ordering isn't exactly the same):
In [1]: from paperqa.litqa import LitQAEvaluation
In [2]: qa_prompt, eval_fn = LitQAEvaluation.from_question(
...: ideal="42", distractors=["-84", "11", "cheesecake"],
...: question="What is the meaning of life?", seed=0
...: )
In [3]: await eval_fn("14")
Out[3]: <LitQAEvaluation.CORRECT: 0>
In [4]: qa_prompt2, eval_fn2 = LitQAEvaluation.from_question(
...: ideal="94107", distractors=["-8", "94106", "cheesecake"],
...: question="What is my office's zip code?", seed=0
...: )
In [5]: await eval_fn2("the answer is 14004")
Out[5]: <LitQAEvaluation.INCORRECT: 1>
The issues here are twofold:
"T"
(not an option). By chance, we correctly labelled "incorrect"
We have observed two incorrect edge cases when the evaluation language model is provided an answer not in the options list
Here is one example:
The LM ignores
14
(the supplied answer) and tries to answer directly. The response then breaks our single-letter-extraction regex, which returnsT
. This ultimately leads to us marking the answer asINCORRECT
(as expected), but for the wrong reason.Here is another example:
Again, the supplied answer is ignored and another one is chosen (E). The longform response is parsed as
T
, and so we mark this asINCORRECT
, notUNSURE
.