EleutherAI / lm-evaluation-harness

A framework for few-shot evaluation of language models.
https://www.eleuther.ai
MIT License
7.05k stars 1.89k forks source link

New Task: `openai_mmmlu` professionaly translated by OpenAI as part of o1 release #2305

Open giuliolovisotto opened 2 months ago

giuliolovisotto commented 2 months ago

From the OpenAI o1 System Card:

"we translated MMLU’s[39] test set into 14 languages using professional human translators. This approach differs from the GPT-4 Paper where MMLU was machine translated with Azure Translate [14]. Relying on human translators for this evaluation increases confidence in the accuracy of the translations"

The datasets are included in their library here -> https://github.com/openai/simple-evals .

Is anyone working on this? I'd be interested in adding these to lm-evaluation-harness. What's a good way to structure this new task in terms of co-existing with the already present mmlu versions (kmllu, cmmlu, arabicmmlu, ...)

Tagging @baberabb 😄

baberabb commented 2 months ago

Hi! This would be great! We should be able to use the same nomenclature as they use here, maybe prepending it with openai?

This script to generate the task boilerplate should come in handy, and let me know if I can help!