Allow aggregated tasks within benchmarks

embeddings-benchmark / mteb

MTEB: Massive Text Embedding Benchmark

https://arxiv.org/abs/2210.07316

Apache License 2.0

1.95k stars 271 forks source link

Allow aggregated tasks within benchmarks #1231

Open KennethEnevoldsen opened 1 month ago

KennethEnevoldsen commented 1 month ago

We currently have only one aggregated task (CQGDupstack), however, we can def. imagien more in the future (e.g. for CoIR in https://github.com/embeddings-benchmark/leaderboard/pull/27).

A proposed solution is to use the benchmark (they are already a group of tasks) and then allow a benchmark to be a list[task | benchmark]

This will require updated to the MTEB.MTEB, as well as the create_meta and potentially for CLI.k

This approach should also solve: https://github.com/embeddings-benchmark/mteb/issues/1171

Samoed commented 1 month ago

I think that can be added average result for each subset for multilingual datasets

KennethEnevoldsen commented 1 month ago

Not entirely sure what is meant @Samoed - should we add it for multilingual datasets? (isn't that there?)

Samoed commented 1 month ago

Yes, the author of the COIR benchmark wanted an average score for the task. I believe this can be done if all subsets of the task are included in the results. This could also be implemented in the results repository. Currently, there are some tasks where the average is calculated.

KennethEnevoldsen commented 1 month ago

This seems like a quick fix (which I am more than happy to add for now), but it does not specify within benchmark specification within mteb how the scores should be aggregated.