EleutherAI / lm-evaluation-harness

A framework for few-shot evaluation of language models.
https://www.eleuther.ai
MIT License
6.36k stars 1.68k forks source link

Wrong calculation of score when there are ties? #2007

Open apohllo opened 2 months ago

apohllo commented 2 months ago

Even though this seems very unlikely, for the multiple choice task the models might return the same scores for some of the options. If these options have the highest score, and the reference label is among these labels, the result will be treated as true positive.

I think this is not a good strategy, since the model is not sure which of the options is the correct one.

I would suggest changing the algorithm, to treat the result as false negative whenever there's a tie between the top-scoring answers.

apohllo commented 2 months ago

I have run some test with Quality dataset (my own definition) and openchat and PHI models. Seems like it's not such an issue I thought it is. For 100 samples there were 13 cases for PHI model with the same values, but only 2 of them were cases when the value was the same for the highest scores.

StellaAthena commented 2 months ago

Why is tiebreaking "incorrect" better than tiebreaking "correct"?