Open haileyschoelkopf opened 1 month ago
@haileyschoelkopf
Hello @haileyschoelkopf, I would be interested in helping with this.
Thanks all for the interest!
@clefourrier @KonradSzafer are tackling some regression testing now actually--if anything can be passed off maybe they can share.
models that would be useful to test would be
and core tasks would be
mmlu
arc_easy
, arc_challenge
lambada_openai
wikitext
Something that would be helpful which isn't currently being done is checking equivalence of the printed results tables, to ensure that the formatting there or printing of task groupings does not get modified by a new PR. This could be done by extending the current tests/test_evaluator.py
tests.
Something which could be taken on and is not yet being worked on would be to implement a greater number of tests for the evaluate()
and simple_evaluate()
functions, or to mock the CLI and test cli_evaluate()
with various options to ensure that these components are functioning as intended. I could describe a number of tests that would be useful if this is of interest! the CLI testing would be a big one.
@haileyschoelkopf could you expand on the new tests that you consider as useful?
As per the issue title.
It would be great to have regression tests for some core tasks set up, checking that 1. some key tasks' scores don't regress or change and 2. that the printouts for results stay the same (results tables).
ideally, we'd have:
limit=10
) that can always runIf any contributors are interested in taking this on, it'd be a huge help! Otherwise, I hope to get to this.