Thanks for this neat repo, very convenient to evaluate LLM!
As a feature request, I would like to suggest adding an option to save results of an evaluation for the implemented tasks to allow for easier analytics.
My understanding is that the current main.py only print results.
It could be useful to store scores per sub-task as well for tasks like MMLU or BBH
Thanks for this neat repo, very convenient to evaluate LLM!
As a feature request, I would like to suggest adding an option to save results of an evaluation for the implemented tasks to allow for easier analytics. My understanding is that the current main.py only print results. It could be useful to store scores per sub-task as well for tasks like MMLU or BBH