Before adding more tasks it could be a good time to take a step back and see if it makes sense to do a bit of refactoring of the code. A few aspects to consider:
how can we make it as easy as possible to add new metrics. it's possible that we may want to add a few dozen more datasets each with some quirks. we can look at other frameworks like the lm-evaluation-harness to see how it's done there and if it make sense to build on top of it or just take inspiration. e.g. i think it would be nice if adding a new evaluation would require changes in as few places as possible.
going for multilinguality we might need to run the code execution in different environments. maybe we should decouple generation and execution by saving the results on disk in between.
for the execution part we probably will need to think about docker environments to execute code in different frameworks.
These are just a few thoughts, let me know if you think this makes sense @loubnabnl.
Before adding more tasks it could be a good time to take a step back and see if it makes sense to do a bit of refactoring of the code. A few aspects to consider:
These are just a few thoughts, let me know if you think this makes sense @loubnabnl.