Closed loubnabnl closed 1 year ago
Thanks a lot @Muennighoff for the review! I will address your comments. I'll probably change the refactoring a little, as with @lvwerra we discussed doing some integration with evaluate
since they started supporting evaluation suites
I continued the refactoring for the other tasks and updated the task guide and readme (we can address evaluate
integration in another PR here or in the library).
For these added tasks, I made some short runs they seem to work properly, but I didn’t do full evaluation to compare against litterature (finetuned models). The only task missing now is few-shot apps that I will add later. I didn't port Spider, since it needed to change anyway (issue).
Going to merge :tada: thanks for the reviews @Muennighoff
This PR refactors the codebase to build tasks in separate files similarly to the approach of lm-evaluation-harness.
Currently the main Python tasks are added, I need to add the other few-shot tasks
[x] HumanEval, MBPP and APPS
[x] Fewshot tasks: code-to-text, conala, concode
I removed spider since it needed to change anyway (https://github.com/bigcode-project/bigcode-evaluation-harness/issues/9) and APPS-fewshot (to be ported in another PR after more testing -current performance is low but it could also be because the benchmark is hard-)