Open mtasic85 opened 1 month ago
Thanks for suggesting. That's a good idea, in my opinion. I was just reading through https://github.com/EleutherAI/lm-evaluation-harness/issues/1157 and HumanEval and MBPP might eventually come to the lm-evaluation-harness, but it's hard to say when.
So, in the meantime, I think it's a good idea to add support as you suggested with the --framework "lm-evaluation-harness"
default flag. (Please feel free to open a PR if you are interested and have time.)
Code evaluation task/benchmark such as HumanEval and MBPP are missing from lm-evaluation-harness, but are present and maintained in bigcode-evaluation-harness.
https://github.com/bigcode-project/bigcode-evaluation-harness
Since, we would need to parse tasks and check if they are in lm-evaluation-harness or bigcode-evaluation-harness, I propose to keep
litgpt evaluate
but add argument--framework "lm-evaluation-harness"
(default if not specified) or--framework "bigcode-evaluation-harness"
.