Consider a refactoring - Githubissues

lvwerra commented 1 year ago

Before adding more tasks it could be a good time to take a step back and see if it makes sense to do a bit of refactoring of the code. A few aspects to consider:

how can we make it as easy as possible to add new metrics. it's possible that we may want to add a few dozen more datasets each with some quirks. we can look at other frameworks like the lm-evaluation-harness to see how it's done there and if it make sense to build on top of it or just take inspiration. e.g. i think it would be nice if adding a new evaluation would require changes in as few places as possible.
going for multilinguality we might need to run the code execution in different environments. maybe we should decouple generation and execution by saving the results on disk in between.
for the execution part we probably will need to think about docker environments to execute code in different frameworks.

These are just a few thoughts, let me know if you think this makes sense @loubnabnl.

loubnabnl commented 1 year ago

Summary of what we discussed:

Will be looking into how the current framework and tasks integrate with lm-evaluation-harness.
Regarding docker environements, we can do something similar to what’s done in the HumanEval-X benchmark
The generations can currently be saved on disk, I will add an option to disable the evaluation.

loubnabnl commented 1 year ago

Refactoring in PR https://github.com/bigcode-project/bigcode-evaluation-harness/pull/19:

build tasks in separte files like in lm-evaluation-harness
separate generation from evaluation

bigcode-project / bigcode-evaluation-harness

Consider a refactoring #11