Open apohllo opened 2 months ago
I have run some test with Quality dataset (my own definition) and openchat and PHI models. Seems like it's not such an issue I thought it is. For 100 samples there were 13 cases for PHI model with the same values, but only 2 of them were cases when the value was the same for the highest scores.
Why is tiebreaking "incorrect" better than tiebreaking "correct"?
Even though this seems very unlikely, for the multiple choice task the models might return the same scores for some of the options. If these options have the highest score, and the reference label is among these labels, the result will be treated as true positive.
I think this is not a good strategy, since the model is not sure which of the options is the correct one.
I would suggest changing the algorithm, to treat the result as false negative whenever there's a tie between the top-scoring answers.