embeddings-benchmark / mteb

MTEB: Massive Text Embedding Benchmark
https://arxiv.org/abs/2210.07316
Apache License 2.0
1.94k stars 269 forks source link

Standardizing results format #639

Closed Muennighoff closed 6 months ago

Muennighoff commented 6 months ago

Sometimes a dataset goes from monolingual to multilingual, e.g. https://github.com/embeddings-benchmark/mteb/pull/556

This changes the result file structure (adds language dict) & also how the dataset is represented in the leaderboard (with brackets & language code vs without). Now there are some models that ran on MLSUMClusteringP2P when it was still monolingual thus their metadata has no (fr) e.g. this one, while newer models now have (fr) e.g. this one. This currently prevents the voyage-law-2 MLSUMClusteringP2P clustering scores from showing up as it is still defined without the (fr) here. If we change the leaderboard definition to (fr) then models with the old one like Solon-embeddings-large-0.1 will no longer show up.

I'm not sure what is the best solution --- maybe always having a language code even for English datasets & always having the language dict structure? (also mentioned here: https://github.com/embeddings-benchmark/mteb/issues/251#issuecomment-2036569253)

cc @tomaarsen @KennethEnevoldsen & anyone else who may have thoughts 😊

KennethEnevoldsen commented 6 months ago

Creating a standard format for the results would be great. I would probably suggestion something like:

{
  "dataset_revision": str,
  "mteb_dataset_name": str,
  "mteb_version": str
  "scores" Dict[Split, Dict[Lang, Scores]]
}

E.g.

{
  "dataset_revision": "f9bd92144ed76200d6eb3ce73a8bd4eba9ffdc85",
  "mteb_dataset_name": "ArxivClassification",
  "mteb_version": "1.6.36",
  "scores": {"test": {"eng": {
    "accuracy": 0.6315999999999999,
    "accuracy_stderr": 0.01738597135624007,
    "evaluation_time": 1743.42,
    "f1": 0.6073746478668843,
    "f1_stderr": 0.01929407142185538,
    "main_score": 0.6315999999999999
  }
}

The problem is atm. we don't have "Lang" as language but rather as the HuggingFace 'language' tag (e.g. "en-de", which could be arbitrary. I would suggest that we just keep it as hf language tag.

imenelydiaker commented 6 months ago

Yeah I agree with @KennethEnevoldsen we should keep the language code everywhere even for monolingual tasks.

Maybe we can do something about the language mapping to MTEB standard before saving the results in Evaluators functions? We built a mapping between HF langs and our standard in the tasks definitions, so maybe we can use it?

KennethEnevoldsen commented 6 months ago

We already have a mapping between the languages and HF languages, it is in the eval_langs. So we would only really need to do the potential aggregation (eg. if the dataset has more English subsets) or language pairs. In both cases it is always possible to do if after the raw format (as the reverse is not possible). I would probably implement a utility function for doing so. Something like:

res = MTEBResults(...)
res.to_disk(...)
res = MTEBResults.from_disk(...)
res.get_main_score(languages=["eng"]) # get language score for all English subset