standardize descriptive stats

KennethEnevoldsen commented 6 days ago

Currently, descriptive stats are quite inconsistent. This leads problems e.g. if we want to calculate the number of characters pr. task to estimate the number of compute tokens needed.

All of these calculations could be automated and are in the _calculate_metrics_from_split, however, it is not calculated for all datasets. It would be great to have a test that tests that these are calculated consistently across all tasks.

Additionally, this data is currently included in the metadata, which might not be ideal (often requiring copy-paste, which could lead to potential errors). A solution could be to write it to a json from which the data is fetched when needed. Tests can then fail if this cache is not full.

Samoed commented 6 days ago

I've been considering improvements to metadata_metrics as well.

A solution could be to write it to a JSON file from which the data is fetched when needed. Tests can then fail if this cache is not complete.

That's a great suggestion! Are you suggesting storing a JSON file with all the metadata directly in the mteb repository?

KennethEnevoldsen commented 5 days ago

Yep - packaged into the package.

embeddings-benchmark / mteb

standardize descriptive stats #1321