Closed Kong-Aobo closed 5 months ago
Hi - I think your concern is well-founded, and if you want an interesting read on the many ways you can implement MMLU and what it can do, I recommend this piece on the HF blog. Our implementation is designed to hew closer to the original MMLU repository and match the numbers reported in the original Llama paper, but this is not necessarily the "one correct" way to do these things!
In the future, we hope to add more evaluations and so try to evaluate a model from many possible perspectives to cases like these, where looking at one number for one task with one prompt is unfair for some models.
By reading the source code, the evaluation method for MMLU I think is as follows: comparing the first character generated by the model with the true label. I have a question about this. Though you add "Answer: " to the question, the model usually do not output options directly. Below is a example of related code.
Is my concern reasonable? Could you provide a relatively detailed explanation? It seems that there are no very reliable methods for extracting answers under unsupervised conditions in the field of instruction tuning.