Lightning-AI / litgpt

20+ high-performance LLMs with recipes to pretrain, finetune and deploy at scale.
https://lightning.ai
Apache License 2.0
10.66k stars 1.06k forks source link

Code evaluation using bigcode-evaluation-harness framework #1776

Open mtasic85 opened 1 month ago

mtasic85 commented 1 month ago

Code evaluation task/benchmark such as HumanEval and MBPP are missing from lm-evaluation-harness, but are present and maintained in bigcode-evaluation-harness.

https://github.com/bigcode-project/bigcode-evaluation-harness

Since, we would need to parse tasks and check if they are in lm-evaluation-harness or bigcode-evaluation-harness, I propose to keep litgpt evaluate but add argument --framework "lm-evaluation-harness" (default if not specified) or --framework "bigcode-evaluation-harness".

rasbt commented 1 month ago

Thanks for suggesting. That's a good idea, in my opinion. I was just reading through https://github.com/EleutherAI/lm-evaluation-harness/issues/1157 and HumanEval and MBPP might eventually come to the lm-evaluation-harness, but it's hard to say when.

So, in the meantime, I think it's a good idea to add support as you suggested with the --framework "lm-evaluation-harness" default flag. (Please feel free to open a PR if you are interested and have time.)