JohnSnowLabs / langtest

Deliver safe & effective language models
http://langtest.org/
Apache License 2.0
490 stars 37 forks source link

Add feature to compare models on different benchmark datasets #952

Closed dcecchini closed 6 months ago

dcecchini commented 8 months ago

We want to use LangTest to create model comparison on different benchmark datasets. The aim is to generate tables like:

Image

For that, I think we need to make it possible to pass data as a list of data dictionaries in the Harness.

We already have model comparison test by test, we need to find a good way to show not only the accuracy, but also the other dimensions of evaluation (bias, robustness, etc.) summarized by dataset.

My first idea is to add the dataset name to the generated tests so that it is easy to groupby and summarize the tests. Then we have to find a way to represent them (maybe also summarize by test category: bias, robustness, etc).