Add Regression Testing - Githubissues

haileyschoelkopf commented 1 month ago

As per the issue title.

It would be great to have regression tests for some core tasks set up, checking that 1. some key tasks' scores don't regress or change and 2. that the printouts for results stay the same (results tables).

ideally, we'd have:

on-CPU tests (with limit=10) that can always run
full on-GPU tests that can be run when desired for larger changes.

If any contributors are interested in taking this on, it'd be a huge help! Otherwise, I hope to get to this.

zafstojano commented 1 month ago

@haileyschoelkopf

Which tasks would you consider as core?
Which models do you believe are most effective/efficient for doing this?

giorgossideris commented 1 month ago

Hello @haileyschoelkopf, I would be interested in helping with this.

haileyschoelkopf commented 1 month ago

Thanks all for the interest!

@clefourrier @KonradSzafer are tackling some regression testing now actually--if anything can be passed off maybe they can share.

models that would be useful to test would be

gpt2
a small model which uses a Llama tokenizer / sentencepiece

and core tasks would be

mmlu
arc_easy, arc_challenge
lambada_openai
wikitext
other Open LLM Leaderboard tasks
more could be added based on computational cost of running these tests.

Something that would be helpful which isn't currently being done is checking equivalence of the printed results tables, to ensure that the formatting there or printing of task groupings does not get modified by a new PR. This could be done by extending the current tests/test_evaluator.py tests.

Something which could be taken on and is not yet being worked on would be to implement a greater number of tests for the evaluate() and simple_evaluate() functions, or to mock the CLI and test cli_evaluate() with various options to ensure that these components are functioning as intended. I could describe a number of tests that would be useful if this is of interest! the CLI testing would be a big one.

giorgossideris commented 1 month ago

@haileyschoelkopf could you expand on the new tests that you consider as useful?

EleutherAI / lm-evaluation-harness

Add Regression Testing #1883