TIGER-AI-Lab / MMLU-Pro

The code and data for "MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark" [NeurIPS 2024]
Apache License 2.0
102 stars 17 forks source link

Paper claims there are 10-choices but the test split has varying number of choices (anywhere from 3 to 10) #24

Open eldarkurtic opened 2 weeks ago

eldarkurtic commented 2 weeks ago

Hi folks, thanks for creating the dataset. In your paper and the dataset card, you claim that MMLU-PRO has 10 choices for each question which seems to be false. By opening the Viewer tab, and selecting test split one can see that only 83% of questions have 10 choices, and the remaining ones have anywhere from 3 to 10. What is happening here?

wenhuchen commented 2 weeks ago

Hi there, we think our paper said that it's augmented to 10 options. Then our strict human and machine quality checker will remove the low-quality options. So 17% of them for impacted.

eldarkurtic commented 2 weeks ago

Do you plan to remove those 17% from HF-hub or you plan to augment them somehow to get 10 choices? I am asking this because the OpenLLM Leaderboard v2 is using MMLU-Pro as one of the tasks, and normalization of scores is impacted by this as it was implemented by following the claim that each question has 10 choices (so scores were normalized by 1/10).

We have an ongoing discussion about it here https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard/discussions/947#66f3e0bae2e1cb781da1c769

wenhuchen commented 2 weeks ago

I see. A simple solution would be just padding "N/A" options to the 17% questions so that each question physically reaches 10 options. Would that approach work for the normalization strategy?

eldarkurtic commented 2 weeks ago

Unfortunately no, because the idea of normalization is to subtract the random baseline accuracy first, and then to rescale it back to 0-100 (more details here: https://huggingface.co/spaces/open-llm-leaderboard/blog).

wenhuchen commented 2 weeks ago

I have read the blog. I still think that padding the 17% questions with "N/A" options to 10 options should work. Would you mind pointing out what's the issue here?

eldarkurtic commented 1 week ago

The issue with "N/A" options is that answering a question with 10 actual choices is much harder than answering a question with 3 actual choices and 7 "N/A", and therefore shouldn't be weighted the same way.