TIGER-AI-Lab / MMLU-Pro

The code and data for "MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark" [NeurIPS 2024]
Apache License 2.0
133 stars 22 forks source link

Duplicates in test split #6

Closed Pupy101 closed 4 months ago

Pupy101 commented 4 months ago

Hello, can you help me? There are 159 questions with duplicates in the test part. Here is the code to find duplicates:

from collections import defaultdict
import datasets

test = datasets.load_dataset("TIGER-Lab/MMLU-Pro", split="test")

mapping = defaultdict(int)

for item in test:
    mapping[(item["category"], item["question"], "".join(item["options"]), item["answer"])] += 1

count_doubles = 0
for (category, question, *_), count in mapping.items():
    if count > 1:
        print(category, repr(question))
        count_doubles += 1
print(count_doubles)
Wyyyb commented 4 months ago

Thank you for pointing out these duplicates. These duplicate data will have minimal impact on the evaluation results, so we have decided not to remove them for the time being to maintain consistency in data quantity.