French MTEB is out of date

ManuelFay commented 3 months ago

Hello, I tried running the french scripts and lots of errors out of the box.

AlloprofRetrieval task has an updated dataset so it is required to modify code to access the dataset columns. Furthermore, the "train" split is the only available one so either the dataset should be updated, or both the eval split and the mteb_metadata.py script should be made so that results reflect that. Currently train results are ignored so not exportable in the leaderboard.
OpusParcus Pair classification exports results on splits named test.full and validation.full. they also seem to be weird compared to old results in which many models have 100% scores so probably something off there... Similarly since the exports are not named like the rest, the export scripts ignore them.
The Bsard retrieval task presents very small ndcg@10 scores for all models (best is under 3%). Might be completely correct but seems like either a weird task or a main metric that could be better chosen ?
some datasets, notably Flores seem to run forever (and are not even reported in the main leaderboard french panel ?)
Syntec retrieval dataset also seems to have been updated causing errors
MLsum S2s and P2P tasks (french) require slight updates to select the dataset splits

I might forget a few things, best would be to run a mock model through the run_mteb_french script on the main branch, then run the mteb_metadata script, upload them to the HF spaces leaderboard, and verify everything runs smoothly...

In general, the issue here seems with dataset drift. More than fixing these issues for coherent score reporting, maybe it would be nice to have some e2e testing on the whole test suite, or locked datasets for testing. The issues are not long to fix but make reported scores incoherent and somewhat arbitrary...

Some models also might need to be retested given the weird 100% scores on the pair classification tasks ...

Thanks for the great work on the leaderboard, Best

KennethEnevoldsen commented 3 months ago

Thanks for the report @ManuelFay

@Muennighoff can you fix 1?

@imenelydiaker the other questions seems best passed on to you.

Re. testing on mteb:

We are currently heavily working on updating mteb, which includes updating the tests suite. It has already seen notable improvements (e.g. checking that dataset + revision is valid).

Currently, most benchmarks are so large that running a model through the benchmark would take way too long for the tests to be feasible. We are currently examining approaches for reducing the size of the benchmark, though we realize that this would invalidate all previous results, thus we take these changes quite serious.

ManuelFay commented 3 months ago

I am not proposing a PR is because a lot of my fixes seem hacky and I am not sure I made the correct choices (typically, every time I force using the train set, I am sure it is better fixed at the dataset level).

If it helps, here is the git diff for code that runs for the run_mteb_french script, I fixed the rest by hand.


--- a/mteb/tasks/Clustering/fr/MLSUMClusteringP2P.py
+++ b/mteb/tasks/Clustering/fr/MLSUMClusteringP2P.py
@@ -48,8 +48,9 @@ class MLSUMClusteringP2P(AbsTaskClustering):
         """
         self.dataset = self.dataset.map(self.create_description)
         self.dataset = self.dataset.remove_columns(["summary", "url", "date", "title"])
-        texts = self.dataset["text"]
-        topics = self.dataset["topic"]
+        print(self.dataset)
+        texts = self.dataset["test"]["text"]
+        topics = self.dataset["test"]["topic"]
         new_format = {
             "sentences": [split.tolist() for split in np.array_split(texts, 10)],
             "labels": [split.tolist() for split in np.array_split(topics, 10)],
diff --git a/mteb/tasks/Clustering/fr/MLSUMClusteringS2S.py b/mteb/tasks/Clustering/fr/MLSUMClusteringS2S.py
index 99667ed..32c3f90 100644
--- a/mteb/tasks/Clustering/fr/MLSUMClusteringS2S.py
+++ b/mteb/tasks/Clustering/fr/MLSUMClusteringS2S.py
@@ -45,10 +45,10 @@ class MLSUMClusteringS2S(AbsTaskClustering):
         self.dataset = self.dataset.remove_columns(["summary", "text", "url", "date"])
         new_format = {
             "sentences": [
-                split.tolist() for split in np.array_split(self.dataset["title"], 10)
+                split.tolist() for split in np.array_split(self.dataset["test"]["title"], 10)
             ],
             "labels": [
-                split.tolist() for split in np.array_split(self.dataset["topic"], 10)
+                split.tolist() for split in np.array_split(self.dataset["test"]["topic"], 10)
             ],
         }
         self.dataset = {
diff --git a/mteb/tasks/Retrieval/fr/AlloprofRetrieval.py b/mteb/tasks/Retrieval/fr/AlloprofRetrieval.py
index 5009377..1bdae46 100644
--- a/mteb/tasks/Retrieval/fr/AlloprofRetrieval.py
+++ b/mteb/tasks/Retrieval/fr/AlloprofRetrieval.py
@@ -18,7 +18,7 @@ class AlloprofRetrieval(AbsTaskRetrieval):
         },
         type="Retrieval",
         category="s2p",
-        eval_splits=["test"],
+        eval_splits=["train"],
         eval_langs=["fr"],
         main_score="ndcg_at_10",
         date=None,
@@ -53,7 +53,7 @@ class AlloprofRetrieval(AbsTaskRetrieval):
         }
         self.corpus = {
             eval_split: {
-                str(d["uuid"]): {"text": d["text"]} for d in corpus_raw["documents"]
+                str(d["uuid"]): {"text": d["text"]} for d in corpus_raw[eval_split]
diff --git a/mteb/tasks/Retrieval/fr/SyntecRetrieval.py b/mteb/tasks/Retrieval/fr/SyntecRetrieval.py
index 64d40a1..1916bc4 100644
--- a/mteb/tasks/Retrieval/fr/SyntecRetrieval.py
+++ b/mteb/tasks/Retrieval/fr/SyntecRetrieval.py
@@ -52,11 +52,11 @@ class SyntecRetrieval(AbsTaskRetrieval):

         self.queries = {
             self._EVAL_SPLITS[0]: {
-                str(i): q["Question"] for i, q in enumerate(queries_raw["queries"])
+                str(i): q["Question"] for i, q in enumerate(queries_raw["train"])
             }
         }

-        corpus_raw = corpus_raw["documents"]
+        corpus_raw = corpus_raw["train"]
         corpus_raw = corpus_raw.rename_column("content", "text")
         self.corpus = {
             self._EVAL_SPLITS[0]: {str(row["id"]): row for row in corpus_raw}
@@ -65,7 +65,7 @@ class SyntecRetrieval(AbsTaskRetrieval):
         self.relevant_docs = {
             self._EVAL_SPLITS[0]: {
                 str(i): {str(q["Article"]): 1}
-                for i, q in enumerate(queries_raw["queries"])
+                for i, q in enumerate(queries_raw["train"])
             }
         }

diff --git a/scripts/run_mteb_french.py b/scripts/run_mteb_french.py
index fdec20d..39de1dd 100644
--- a/scripts/run_mteb_french.py
+++ b/scripts/run_mteb_french.py
@@ -50,7 +50,7 @@ TASK_LIST_STS = ["SummEvalFr", "STSBenchmarkMultilingualSTS", "STS22", "SICKFr"]

 TASK_LIST_BITEXTMINING = [
     "DiaBLaBitextMining",
-    "FloresBitextMining",
+#    "FloresBitextMining",
     "TatoebaBitextMining",
     "BUCCBitextMining",
 ]

Muennighoff commented 3 months ago

AlloprofRetrieval task has an updated dataset so it is required to modify code to access the dataset columns. Furthermore, the "train" split is the only available one so either the dataset should be updated, or both the eval split and the mteb_metadata.py script should be made so that results reflect that. Currently train results are ignored so not exportable in the leaderboard.

I am also not 100% what's the right fix here - @imenelydiaker / @wissam-sib / @GabrielSequeira could one of you or someone else take a look?

imenelydiaker commented 3 months ago

Hello @ManuelFay, thanks for the comments I'll try to answer some and review the PR:

AlloprofRetrieval task has an updated dataset so it is required to modify code to access the dataset columns. Furthermore, the "train" split is the only available one so either the dataset should be updated, or both the eval split and the mteb_metadata.py script should be made so that results reflect that. Currently train results are ignored so not exportable in the leaderboard.

There have been an error when uploading the dataset to HF here, the split should have been named test instead of train. We created the dataset and we didn't create a train set since it was only for the benchmark. I'll let @MathieuCiancone complete or correct if I'm mistaken.

OpusParcus Pair classification exports results on splits named test.full and validation.full. they also seem to be weird compared to old results in which many models have 100% scores so probably something off there... Similarly since the exports are not named like the rest, the export scripts ignore them.

Thinking about removing this dataset because it's too trivial. We've already discussed it with the team and we think it should be removed.

The Bsard retrieval task presents very small ndcg@10 scores for all models (best is under 3%). Might be completely correct but seems like either a weird task or a main metric that could be better chosen ?

Same thing here, the dataset is too hard. It's about legal text so maybe not very interesting for a language benchmark. The metric we used was ndcg@100. I'd prefer removing it.

some datasets, notably Flores seem to run forever (and are not even reported in the main leaderboard french panel ?)

Yes because bitexmining tasks should be run separately. We should have removed the tasks from the script since they need 2 input languages to run. I'll fix the script.

KennethEnevoldsen commented 3 months ago

@imenelydiaker will you guys make a PR with the changes when you have the time?

It seems like a good idea to remove too trivial tasks. It might be relevant to keep really hard tasks assuming the quality of the data is good and that better models gain a meaningful performance increase on this task?

imenelydiaker commented 3 months ago

@imenelydiaker will you guys make a PR with the changes when you have the time?

Yes I'll open a PR soon.

It seems like a good idea to remove too trivial tasks. It might be relevant to keep really hard tasks assuming the quality of the data is good and that better models gain a meaningful performance increase on this task?

I'm removing OpusParcus yes. For BSARD, the paper introducing the dataset have been published in ACL, so I supppose the quality of the dataset is good: https://aclanthology.org/2022.acl-long.468/, https://huggingface.co/datasets/maastrichtlawtech/bsard. Although I find it really hard for all models (scores are very low), so I'm thinking if it is a good idea to keep it?

ManuelFay commented 3 months ago

The OpusParcus I am not even sure it's trivial, the models I uploaded (sentence croissant alpha) recently have 93% performance (as well as the solon large, also in the top 5/10). What is weird is that all other models, up to the 40th position and including very small models have 100% so might just be some training stuff that we do that degrades performance but something feels off....

imenelydiaker commented 3 months ago

It may be due to the training of Solon and Sentence-croissant-alpha. The other possibility can be that other models either have been trained on the test set, or the train set contains samples from the test set and those models have been trained on it.

I can try running the models again, but I'm almost sure that there have been no error when running them, they were all run on the test.full split (this split contains positive and negative pairs, as requested by the task). I'll also try it on English and other languages to see if it's also trivial. I'll let you know.

KennethEnevoldsen commented 3 months ago

As also mentioned in #342 I believe the pair classification formulation of the task is too easy. There is already a PR adding open subtitles as a bitext mining task.

imenelydiaker commented 3 months ago

I was thinking about improving the French benchmark by adding the FQuad [paper] QA dataset, it can be used for retrieval tasks. @ManuelFay would you be interested in helping? (I'm missing some splits on HF, is it okay to download it from here and re-upload it again?)

ManuelFay commented 3 months ago

You can use this one if you want:

https://huggingface.co/datasets/manu/fquad2_test

Although probably need to filter out the "is_impossible" samples ?

imenelydiaker commented 2 months ago

@ManuelFay: Regarding the Pair classification task with OpusParcus dataset, the results are updated here. The problem was as mentionned in https://github.com/embeddings-benchmark/mteb/pull/342, the use of the wrong test split to run experiments (we mistakenly used the one with only positive pairs [test] instead of positivce + negatives ones [test.full]).

I believe the leaderboard will be updated after the PR is merged. @KennethEnevoldsen @Muennighoff the split name is test.full, will it be handled by the current leaderboard script?

imenelydiaker commented 2 months ago

Issues to fix:

[x] Removing BitextMining from fr evaluation script.
[x] Updating OpusParcus results.
[x] Fix split name of AlloprofRetrieval and SyntecRetrieval.
[ ] What to do about BSARD? Too hard for current models?
[x] MLSUM S2S and P2P tasks (check bug and allow this task to handle multiple languages because the dataset is multilingual)

KennethEnevoldsen commented 2 months ago

Seems like we should split these up in independent issues.

Fix split name of AlloprofRetrieval and SyntecRetrieval.

Anything holding this back?

imenelydiaker commented 2 months ago

Seems like we should split these up in independent issues.

You prefer creating different issues? I'm okay with keeping eveything here and linking PRs (most of them are on HF) to this issue.

Fix split name of AlloprofRetrieval and SyntecRetrieval.

Anything holding this back?

I'm checking, the test split was removed after we removed the custom loading script. My best guess is that it was just named train by default. I'm looking into it.

KennethEnevoldsen commented 2 months ago

You prefer creating different issues? I'm okay with keeping eveything here and linking PRs (most of them are on HF) to this issue.

At least if they are at a point where they can be assigned it can be nice to split them out.

imenelydiaker commented 2 months ago

okay, I'll create different issues then 🙂

embeddings-benchmark / mteb

French MTEB is out of date #334

Re. testing on mteb: