Open eldarkurtic opened 2 weeks ago
Hi there, we think our paper said that it's augmented to 10 options. Then our strict human and machine quality checker will remove the low-quality options. So 17% of them for impacted.
Do you plan to remove those 17% from HF-hub or you plan to augment them somehow to get 10 choices? I am asking this because the OpenLLM Leaderboard v2 is using MMLU-Pro as one of the tasks, and normalization of scores is impacted by this as it was implemented by following the claim that each question has 10 choices (so scores were normalized by 1/10).
We have an ongoing discussion about it here https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard/discussions/947#66f3e0bae2e1cb781da1c769
I see. A simple solution would be just padding "N/A" options to the 17% questions so that each question physically reaches 10 options. Would that approach work for the normalization strategy?
Unfortunately no, because the idea of normalization is to subtract the random baseline accuracy first, and then to rescale it back to 0-100 (more details here: https://huggingface.co/spaces/open-llm-leaderboard/blog).
I have read the blog. I still think that padding the 17% questions with "N/A" options to 10 options should work. Would you mind pointing out what's the issue here?
The issue with "N/A" options is that answering a question with 10 actual choices is much harder than answering a question with 3 actual choices and 7 "N/A", and therefore shouldn't be weighted the same way.
Hi folks, thanks for creating the dataset. In your paper and the dataset card, you claim that MMLU-PRO has 10 choices for each question which seems to be false. By opening the Viewer tab, and selecting
test
split one can see that only 83% of questions have 10 choices, and the remaining ones have anywhere from 3 to 10. What is happening here?