Stability-AI / lm-evaluation-harness

A framework for few-shot evaluation of autoregressive language models.
MIT License
144 stars 47 forks source link

Set up the ability to run eval suites #114

Closed polm-stability closed 10 months ago

polm-stability commented 10 months ago

This PR includes changes to allow the running of eval suites with a single command. An example command looks like this:

python scripts/run_suite.py my_model my_eval_suite my_prompt

The suite is specified as a list of tasks, with versions and fewshot specs, in a config file. Because the spec is in a file, it can be versioned and shared across models, while each model can vary the prompt it uses (as well as args related to loading the model). Prompts are specified using names rather than numbers to make it clear what they refer to and avoid mistakes.

polm-stability commented 10 months ago

This is still pretty bare-bones, but it's functional and should be good for automating eval across different models. Basically we can run the same eval we've been running, but with a simpler invocation, and without worrying about copying versions or fewshot parameters the wrong way.

polm-stability commented 10 months ago

Good point, docs should be updated now.

mkshing commented 10 months ago

Awesome! LGTM!