Open KennethEnevoldsen opened 1 month ago
I think that can be added average result for each subset for multilingual datasets
Not entirely sure what is meant @Samoed - should we add it for multilingual datasets? (isn't that there?)
Yes, the author of the COIR benchmark wanted an average score for the task. I believe this can be done if all subsets of the task are included in the results. This could also be implemented in the results repository. Currently, there are some tasks where the average is calculated.
This seems like a quick fix (which I am more than happy to add for now), but it does not specify within benchmark specification within mteb how the scores should be aggregated.
We currently have only one aggregated task (CQGDupstack), however, we can def. imagien more in the future (e.g. for CoIR in https://github.com/embeddings-benchmark/leaderboard/pull/27).
A proposed solution is to use the benchmark (they are already a group of tasks) and then allow a benchmark to be a
list[task | benchmark]
This will require updated to the
MTEB.MTEB
, as well as thecreate_meta
and potentially for CLI.kThis approach should also solve: https://github.com/embeddings-benchmark/mteb/issues/1171