We want to use LangTest to create model comparison on different benchmark datasets. The aim is to generate tables like:
For that, I think we need to make it possible to pass data as a list of data dictionaries in the Harness.
We already have model comparison test by test, we need to find a good way to show not only the accuracy, but also the other dimensions of evaluation (bias, robustness, etc.) summarized by dataset.
My first idea is to add the dataset name to the generated tests so that it is easy to groupby and summarize the tests. Then we have to find a way to represent them (maybe also summarize by test category: bias, robustness, etc).
We want to use LangTest to create model comparison on different benchmark datasets. The aim is to generate tables like:
For that, I think we need to make it possible to pass
data
as a list of data dictionaries in theHarness
.We already have model comparison test by test, we need to find a good way to show not only the accuracy, but also the other dimensions of evaluation (bias, robustness, etc.) summarized by dataset.
My first idea is to add the dataset name to the generated tests so that it is easy to groupby and summarize the tests. Then we have to find a way to represent them (maybe also summarize by test category: bias, robustness, etc).