EleutherAI / lm-evaluation-harness

A framework for few-shot evaluation of language models.
https://www.eleuther.ai
MIT License
5.73k stars 1.53k forks source link

Add Regression Testing #1883

Open haileyschoelkopf opened 1 month ago

haileyschoelkopf commented 1 month ago

As per the issue title.

It would be great to have regression tests for some core tasks set up, checking that 1. some key tasks' scores don't regress or change and 2. that the printouts for results stay the same (results tables).

ideally, we'd have:

If any contributors are interested in taking this on, it'd be a huge help! Otherwise, I hope to get to this.

zafstojano commented 1 month ago

@haileyschoelkopf

  1. Which tasks would you consider as core?
  2. Which models do you believe are most effective/efficient for doing this?
giorgossideris commented 1 month ago

Hello @haileyschoelkopf, I would be interested in helping with this.

haileyschoelkopf commented 1 month ago

Thanks all for the interest!

@clefourrier @KonradSzafer are tackling some regression testing now actually--if anything can be passed off maybe they can share.

models that would be useful to test would be

and core tasks would be

Something that would be helpful which isn't currently being done is checking equivalence of the printed results tables, to ensure that the formatting there or printing of task groupings does not get modified by a new PR. This could be done by extending the current tests/test_evaluator.py tests.

Something which could be taken on and is not yet being worked on would be to implement a greater number of tests for the evaluate() and simple_evaluate() functions, or to mock the CLI and test cli_evaluate() with various options to ensure that these components are functioning as intended. I could describe a number of tests that would be useful if this is of interest! the CLI testing would be a big one.

giorgossideris commented 1 month ago

@haileyschoelkopf could you expand on the new tests that you consider as useful?