Closed Muennighoff closed 6 months ago
Creating a standard format for the results would be great. I would probably suggestion something like:
{
"dataset_revision": str,
"mteb_dataset_name": str,
"mteb_version": str
"scores" Dict[Split, Dict[Lang, Scores]]
}
E.g.
{
"dataset_revision": "f9bd92144ed76200d6eb3ce73a8bd4eba9ffdc85",
"mteb_dataset_name": "ArxivClassification",
"mteb_version": "1.6.36",
"scores": {"test": {"eng": {
"accuracy": 0.6315999999999999,
"accuracy_stderr": 0.01738597135624007,
"evaluation_time": 1743.42,
"f1": 0.6073746478668843,
"f1_stderr": 0.01929407142185538,
"main_score": 0.6315999999999999
}
}
The problem is atm. we don't have "Lang" as language but rather as the HuggingFace 'language' tag (e.g. "en-de", which could be arbitrary. I would suggest that we just keep it as hf language tag.
Yeah I agree with @KennethEnevoldsen we should keep the language code everywhere even for monolingual tasks.
Maybe we can do something about the language mapping to MTEB standard before saving the results in Evaluators functions? We built a mapping between HF langs and our standard in the tasks definitions, so maybe we can use it?
We already have a mapping between the languages and HF languages, it is in the eval_langs
. So we would only really need to do the potential aggregation (eg. if the dataset has more English subsets) or language pairs. In both cases it is always possible to do if after the raw format (as the reverse is not possible). I would probably implement a utility function for doing so. Something like:
res = MTEBResults(...)
res.to_disk(...)
res = MTEBResults.from_disk(...)
res.get_main_score(languages=["eng"]) # get language score for all English subset
Sometimes a dataset goes from monolingual to multilingual, e.g. https://github.com/embeddings-benchmark/mteb/pull/556
This changes the result file structure (adds language dict) & also how the dataset is represented in the leaderboard (with brackets & language code vs without). Now there are some models that ran on
MLSUMClusteringP2P
when it was still monolingual thus their metadata has no(fr)
e.g. this one, while newer models now have(fr)
e.g. this one. This currently prevents the voyage-law-2 MLSUMClusteringP2P clustering scores from showing up as it is still defined without the(fr)
here. If we change the leaderboard definition to(fr)
then models with the old one likeSolon-embeddings-large-0.1
will no longer show up.I'm not sure what is the best solution --- maybe always having a language code even for English datasets & always having the language dict structure? (also mentioned here: https://github.com/embeddings-benchmark/mteb/issues/251#issuecomment-2036569253)
cc @tomaarsen @KennethEnevoldsen & anyone else who may have thoughts 😊