EleutherAI / lm-evaluation-harness

A framework for few-shot evaluation of language models.
https://www.eleuther.ai
MIT License
6.48k stars 1.72k forks source link

New Task: `mmlu` professionaly translated by OpenAI as part of o1 release #2305

Open giuliolovisotto opened 6 days ago

giuliolovisotto commented 6 days ago

From the OpenAI o1 System Card:

"we translated MMLU’s[39] test set into 14 languages using professional human translators. This approach differs from the GPT-4 Paper where MMLU was machine translated with Azure Translate [14]. Relying on human translators for this evaluation increases confidence in the accuracy of the translations"

The datasets are included in their library here -> https://github.com/openai/simple-evals .

Is anyone working on this? I'd be interested in adding these to lm-evaluation-harness. What's a good way to structure this new task in terms of co-existing with the already present mmlu versions (kmllu, cmmlu, arabicmmlu, ...)

Tagging @baberabb 😄

baberabb commented 5 days ago

Hi! This would be great! We should be able to use the same nomenclature as they use here, maybe prepending it with openai?

This script to generate the task boilerplate should come in handy, and let me know if I can help!