Add feature to compare models on different benchmark datasets

We want to use LangTest to create model comparison on different benchmark datasets. The aim is to generate tables like:

For that, I think we need to make it possible to pass data as a list of data dictionaries in the Harness.

We already have model comparison test by test, we need to find a good way to show not only the accuracy, but also the other dimensions of evaluation (bias, robustness, etc.) summarized by dataset.

My first idea is to add the dataset name to the generated tests so that it is easy to groupby and summarize the tests. Then we have to find a way to represent them (maybe also summarize by test category: bias, robustness, etc).

JohnSnowLabs / langtest

Add feature to compare models on different benchmark datasets #952