c1505 commented 1 year ago

Motivation

I want to use MMLU results by task to better understand the characteristics of LLMs. I am curious to see the differences between architectures and how performance in the tasks change as the parameter count increases. I have found a lot of reporting of a single final result for MMLU, but I can't find data broken down by task for most of these models. I see some results are here https://github.com/EleutherAI/lm-evaluation-harness/tree/master/results .

Other results that do exist are often conflicting.

Some are actually resulting from just running a few tasks https://github.com/stanford-crfm/helm/issues/1335
https://github.com/EleutherAI/lm-evaluation-harness/issues/475
Different implementations https://huggingface.co/blog/evaluating-mmlu-leaderboard

If full result data was available, it would be easier to spot these discrepancies in results. Hopefully it would also encourage other groups running the evaluations to do the same.

Suggestions

Given that hugging face is running this harness already for their open LLM leaderboard, collaborate to get the full results of running the evaluations uploaded in some centralized place. Hugging Face datasets or some other repository seems like it would be easiest.
Suggest a repository and tag for others running the evaluation.
Report evaluation results for each code release

c1505 commented 1 year ago

It does look like hugging face is working on making the results of running evaluations public . Unsure how long that will take. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard/discussions/73#64a8483ab35f48e37df1a7c8

haileyschoelkopf commented 1 year ago

Thanks for raising this! This is something we might hope to do with the repo, but don't have the manpower to maintain.

Will see about talking to folks from HF about working together on having this set up and if that might be feasible! Likewise, if you have any ideas or willingness to help out in setting up a system for this, definitely also open to that.

c1505 commented 1 year ago

Thank you for your attention to this issue. I understand the bandwidth concerns and am happy to help with all parts of encouraging more public evaluation results.

Hugging Face has recently made their full evaluation results public https://huggingface.co/datasets/open-llm-leaderboard/results . I had errors trying to use it as a dataset, but was able to clone the repo and process the data.
I created a sortable leaderboard that breaks down MMLU by task, which can be accessed here: https://huggingface.co/spaces/CoreyMorris/MMLU-by-task-Leaderboard. The site offers download options for the CSV data and includes scatterplots for enhanced data understanding.
I intend to update this leaderboard regularly, at least until Hugging Face develops one that provides a task-wise MMLU breakdown.
I’ll try to make the leaderboard more useful to folks with visualizations, filtering, and potentially interesting findings from analysis of the data. If you have any suggestions, please let me know.
1. Strategies to Encourage Public Sharing of Results:
  - Make it easier to upload evaluation data
  - Provide easy to follow instructions
  - Integrating code into the evaluation harness to facilitate easy uploading of results to the Hugging Face hub or other platforms.
  - Providing Nudges:
  - Instructional materials can offer a significant nudge.
  - Collaboration with ArXiv and/or Hugging Face to establish a dedicated section for evaluation data can create a "fill-in-the-blank" effect, nudging researchers to contribute.
2. Repository Options:
  - Hugging Face Datasets:
  - Pros:
    - Familiar repository for models and training datasets.
    - Hugging Face paper pages can link to datasets.
    - Use of tags (like "Evaluation/Benchmark") can improve discoverability of evaluation datasets. Here is a dataset I created with some initial tags https://huggingface.co/datasets/CoreyMorris/hugging-face-LLM-evaluation-results
  - Cons:
    - Hugging Face datasets are primarily designed and optimized for training data.
  - Kaggle:
  - Pros:
    - Known for its capability to handle tabular data for data analysis.
    - Built-in visualization in the web interface.
  - Cons:
    - Likely not as popular in the community training, evaluating, and running open source LLMs

c1505 commented 1 year ago

Preliminary data analysis here https://coreymorrisdata.medium.com/preliminary-analysis-of-mmlu-evaluation-data-insights-from-500-open-source-models-e67885aa364b

haileyschoelkopf commented 1 year ago

May be of interest to you: we have a project we are hoping to push forward where we want to measure how models' performance + predictions differ/are not robust to small variations in task formatting or how answer choice is evaluated, including on MMLU. This would give us a sense of which benchmarks are relatively more robust to such changes in eval decisions and which ones are brittle and lack construct validity.

Link to the thread on our discord where we're organizing this: https://discord.com/channels/729741769192767510/1120714014964588637

c1505 commented 1 year ago

thanks! i'll check it out :)

c1505 commented 1 year ago

May be of interest to you: we have a project we are hoping to push forward where we want to measure how models' performance + predictions differ/are not robust to small variations in task formatting or how answer choice is evaluated, including on MMLU. This would give us a sense of which benchmarks are relatively more robust to such changes in eval decisions and which ones are brittle and lack construct validity.

Link to the thread on our discord where we're organizing this: https://discord.com/channels/729741769192767510/1120714014964588637

I found some issues related to formatting on the moral scenarios task, but there are also some other questions with similar formats to those that were shown to be problematic with the moral scenarios task https://medium.com/@coreymorrisdata/is-it-really-about-morality-74fd6e512521 .

EleutherAI / lm-evaluation-harness

Central repository for results from running the evaluations #662

Motivation

Suggestions