bigcode-project / bigcode-evaluation-harness

A framework for the evaluation of autoregressive code generation language models.
Apache License 2.0
771 stars 201 forks source link

Refactor code to separate tasks #19

Closed loubnabnl closed 1 year ago

loubnabnl commented 1 year ago

This PR refactors the codebase to build tasks in separate files similarly to the approach of lm-evaluation-harness.

Currently the main Python tasks are added, I need to add the other few-shot tasks

I removed spider since it needed to change anyway (https://github.com/bigcode-project/bigcode-evaluation-harness/issues/9) and APPS-fewshot (to be ported in another PR after more testing -current performance is low but it could also be because the benchmark is hard-)

loubnabnl commented 1 year ago

Thanks a lot @Muennighoff for the review! I will address your comments. I'll probably change the refactoring a little, as with @lvwerra we discussed doing some integration with evaluate since they started supporting evaluation suites

loubnabnl commented 1 year ago

I continued the refactoring for the other tasks and updated the task guide and readme (we can address evaluate integration in another PR here or in the library).

For these added tasks, I made some short runs they seem to work properly, but I didn’t do full evaluation to compare against litterature (finetuned models). The only task missing now is few-shot apps that I will add later. I didn't port Spider, since it needed to change anyway (issue).

loubnabnl commented 1 year ago

Going to merge :tada: thanks for the reviews @Muennighoff