EleutherAI / lm-evaluation-harness

A framework for few-shot evaluation of language models.
https://www.eleuther.ai
MIT License
7k stars 1.87k forks source link

Central repository for results from running the evaluations #662

Open c1505 opened 1 year ago

c1505 commented 1 year ago

Motivation

I want to use MMLU results by task to better understand the characteristics of LLMs. I am curious to see the differences between architectures and how performance in the tasks change as the parameter count increases. I have found a lot of reporting of a single final result for MMLU, but I can't find data broken down by task for most of these models. I see some results are here https://github.com/EleutherAI/lm-evaluation-harness/tree/master/results .

Other results that do exist are often conflicting.

If full result data was available, it would be easier to spot these discrepancies in results. Hopefully it would also encourage other groups running the evaluations to do the same.

Suggestions

c1505 commented 1 year ago

It does look like hugging face is working on making the results of running evaluations public . Unsure how long that will take. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard/discussions/73#64a8483ab35f48e37df1a7c8

haileyschoelkopf commented 1 year ago

Thanks for raising this! This is something we might hope to do with the repo, but don't have the manpower to maintain.

Will see about talking to folks from HF about working together on having this set up and if that might be feasible! Likewise, if you have any ideas or willingness to help out in setting up a system for this, definitely also open to that.

c1505 commented 1 year ago

Thank you for your attention to this issue. I understand the bandwidth concerns and am happy to help with all parts of encouraging more public evaluation results.

c1505 commented 1 year ago

Preliminary data analysis here https://coreymorrisdata.medium.com/preliminary-analysis-of-mmlu-evaluation-data-insights-from-500-open-source-models-e67885aa364b

haileyschoelkopf commented 1 year ago

May be of interest to you: we have a project we are hoping to push forward where we want to measure how models' performance + predictions differ/are not robust to small variations in task formatting or how answer choice is evaluated, including on MMLU. This would give us a sense of which benchmarks are relatively more robust to such changes in eval decisions and which ones are brittle and lack construct validity.

Link to the thread on our discord where we're organizing this: https://discord.com/channels/729741769192767510/1120714014964588637

c1505 commented 1 year ago

thanks! i'll check it out :)

c1505 commented 1 year ago

May be of interest to you: we have a project we are hoping to push forward where we want to measure how models' performance + predictions differ/are not robust to small variations in task formatting or how answer choice is evaluated, including on MMLU. This would give us a sense of which benchmarks are relatively more robust to such changes in eval decisions and which ones are brittle and lack construct validity.

Link to the thread on our discord where we're organizing this: https://discord.com/channels/729741769192767510/1120714014964588637

I found some issues related to formatting on the moral scenarios task, but there are also some other questions with similar formats to those that were shown to be problematic with the moral scenarios task https://medium.com/@coreymorrisdata/is-it-really-about-morality-74fd6e512521 .