Closed polm-stability closed 10 months ago
This is still pretty bare-bones, but it's functional and should be good for automating eval across different models. Basically we can run the same eval we've been running, but with a simpler invocation, and without worrying about copying versions or fewshot parameters the wrong way.
Good point, docs should be updated now.
Awesome! LGTM!
This PR includes changes to allow the running of eval suites with a single command. An example command looks like this:
The suite is specified as a list of tasks, with versions and fewshot specs, in a config file. Because the spec is in a file, it can be versioned and shared across models, while each model can vary the prompt it uses (as well as args related to loading the model). Prompts are specified using names rather than numbers to make it clear what they refer to and avoid mistakes.