Leaks and duplications in the MTEB leaderboard

lbourdois commented 2 days ago

Hello,

I've analyzed the quality of the datasets composing the EN and FR leaderboards, i.e. whether they contain leaks or not, as well as duplicate data (especially in the test split).

Below is an example of how I proceeded with the amazon_massive_intent dataset. The idea has been applied to all datasets.

dataset = load_dataset("mteb/amazon_massive_intent", "fr")

def concatenate_columns(example):
    return {'concatenated_column':  str(example["text"])+" "+str(example["label"])}
dataset = dataset.map(concatenate_columns)

train_inputs = dataset["train"]["concatenated_column"]
val_inputs = dataset["validation"]["concatenated_column"]
test_inputs = dataset["test"]["concatenated_column"]

leakage_train = set(train_inputs).intersection(set(test_inputs))
leakage_validation = set(val_inputs).intersection(set(test_inputs))

print("Leakage between train split and test split:",len(leakage_train))
print("Leakage between validation split and test split:",len(leakage_validation))
print("Duplicated lines in the train split:", len(train_inputs) - len(set(train_inputs)))
print("Duplicated lines in the validation split:", len(val_inputs) - len(set(val_inputs)))
print("Duplicated lines in the test split:", len(test_inputs) - len(set(test_inputs)))
print("Percentage of test split biased:",str(round((len(leakage_train)+len(leakage_validation)+(len(test_inputs) - len(set(test_inputs))))/len(dataset["test"])*100,3))+"%")

This code returns:

Leakage between train split and test split: 125
Leakage between validation split and test split: 36
Duplicated lines in the train split: 307
Duplicated lines in the validation split: 15
Duplicated lines in the test split: 30
Percentage of test split biased: 6.422%

We can therefore see that the dataset used in the French leaderboard contains 6.4% biased data (leaks + duplications).

The results for all analyzed datasets can be found at https://huggingface.co/datasets/lbourdois/MTEB_leaks_and_duplications (see column text_and_label_test_biased).
Rows containing "OK" are datasets containing only one test split. In the absence of train or validation splits, there can be no leaks. Visible "NR" correspond to irrelevant columns (e.g. an "NR" will be shown in a column related to the validation split if the dataset contains only a train and test split).

I can observe that 24% of MTEB EN datasets contain leaks, and 46% for MTEB FR (the figure for the French part is for 24 datasets on the 26 available in MTEB, as there are two datasets I haven't managed to download, cf. the README).

It should be noted that the percentages reported are individual evaluations of the datasets. Biases may be greater than this in reality. Indeed, if you concatenate datasets (for example, all the train splits available for the STS task in a given language), a data in the train split of dataset A may not be present in the test split of A, but may be present in the test split of dataset B, thus creating a leak. The same logic applies to duplicate data.

I open this issue to invite users to take care when training their models (and even to avoid using train splits from all the datasets listed here as having leaks).
Also invite MTEB maintainers to clean up their leaderboards to maintain users' confidence in their tool for evaluating or choosing a model for their practical case. Re-evaluating the 600+ models available in the leaderboard can be time-consuming, so a quicker interim action might be to add a confidence interval to each model, along with a message about its interpretation.

Note that I've limited myself here to English and French. Similar work should be carried out for Chinese and Polish (and probably the future languages planned for the MMTEB).

KennethEnevoldsen commented 2 days ago

Thanks for making us aware of this @lbourdois. I see a few directions to go from here.

It is worth noting that many of the tasks in MTEB are considered zero-shot tasks. However, it hasn't been clearly communicated (we e.g. plan to add a zero-shot tab to the leaderboard). This has generally also been the case for most models benchmarked on the benchmark with the exception of NVIDIAs embedding model (and 1-3 of the dataset). The addition of NVIDIAs model was what first prompted this debate.

However, I def. agree that we should be notably more careful when evaluating to make these assumptions clear.

1) ensure new datasets don't have leaks: We can do this by testing all available splits in a similar fashion to the bot you propose in the blog post.
2) update previous existing tasks to not have leaks - we already have a versioning system for this. I see two ways of doing this:
- 1) removing them from the training set
- 2) removing them from the test set (probably the one I would go for)
3) Add functionality for checking if your training dataset contains near-duplicates from the test set (a set of tasks you wish to test on in MTEB)

What do you think @lbourdois?

@Muennighoff, @imenelydiaker you should probably take a look at this one as well.

imenelydiaker commented 2 days ago

Hey @lbourdois, very intresting and relevant analysis!

Completing what @KennethEnevoldsen said:

Since we don't really have information about whether train sets of MTEB datasets are used for training, we can still make some checks on our end to minimize the leakage (e.g., making sure no train sample is in the test set, removing similar texts with BM25-maybe?-, etc.)

@KennethEnevoldsen for your suggestions:

i. ensure new datasets don't have leaks: We can do this by testing all available splits in a similar fashion to the bot you propose in the blog post.

Definetly!

ii. update previous existing tasks to not have leaks - we already have a versioning system for this. I see two ways of doing this:

removing them from the training set

removing them from the test set (probably the one I would go for

Second option is the best. We can't really update public train sets, so it would be better to make sure test sets are clean on our end.

iii. Add functionality for checking if your training dataset contains near-duplicates from the test set (a set of tasks you wish to test on in MTEB)

It woudl be really nice to have a feature that checks no train samples are present in the evaluation sets. If it's the case then we should remove them from the test set as benchmarks run on this split and we can't really change public train sets. I'm just wondering if it's compute-efficient to run this check at each evaluation? It can work with hashes of texts? We've been doing this for some tasks as MindSmall 🤔

All points i, ii and iii are somehow similar we can implement them in one feature I guess.

KennethEnevoldsen commented 2 days ago

for iii) the intention was to have it be full training set of the model. It would work something like:

my_training_data_generator = ...

EN_MTEB_TASKS = mteb.get_tasks(...)
checker = mteb.create_dataset_checker(EN_MTEB_TASKS) # encode all samples
checker.check(my_training_data_generator) # check if any of the samples are included in the test set.

So iii) as opposed to i) includes the full training set of the model.

imenelydiaker commented 2 days ago

for iii) the intention was to have it be full training set of the model. It would work something like:
my_training_data_generator = ...

EN_MTEB_TASKS = mteb.get_tasks(...)
checker = mteb.create_dataset_checker(EN_MTEB_TASKS) # encode all samples
checker.check(my_training_data_generator) # check if any of the samples are included in the test set.
So iii) as opposed to i) includes the full training set of the model.

Ah yeah misunderstood it sorry.

So you suggest computing the "overlap" between one train set and all test sets of different datasets we have in the benchmark? It can be a great insight authors can share with users, it would also be useful for the leaderboard.

lbourdois commented 2 days ago

Hi @KennethEnevoldsen, @imenelydiaker

Since we don't really have information about whether train sets of MTEB datasets are used for training, we can still make some checks on our end to minimize the leakage (e.g., making sure no train sample is in the test set, removing similar texts with BM25-maybe?-, etc.)

In my previous message I indicated how to find leaks/duplications, but not how to remove them 😅 Below is the code I use when I have to clean some datasets:

dataset["train"] = dataset["train"].to_pandas()
dataset["train"] = dataset["train"].drop_duplicates(subset=['concatenated_column'], keep='first')
results = [row for row in dataset["train"]["concatenated_column"] if row in leakage_train]
dataset["train"] = dataset["train"][~dataset["train"]["concatenated_column"].isin(results)].dropna().reset_index(drop=True)

dataset["validation"] = dataset["validation"].to_pandas()
dataset["validation"] = dataset["validation"].drop_duplicates(subset=['concatenated_column'], keep='first')
results = [row for row in dataset["validation"]["concatenated_column"] if row in leakage_validation]
dataset["validation"] = dataset["validation"][~dataset["validation"]["concatenated_column"].isin(results)].dropna().reset_index(drop=True)

dataset["test"] = dataset["test"].to_pandas()
dataset["test"] = dataset["test"].drop_duplicates(subset=['concatenated_column'], keep='first')

dataset = DatasetDict({"train": Dataset.from_pandas(dataset['train']), "validation": Dataset.from_pandas(dataset['validation']), "test": Dataset.from_pandas(dataset['test'])})

Re-executing the code in my previous message, we obtain :

Leakage between train split and test split: 0
Leakage between validation split and test split: 0
Duplicated lines in the train split: 0
Duplicated lines in the validation split: 0
Duplicated lines in the test split: 0
Percentage of test split biased: 0.0%

(you just need to remember to delete the concatenated_column column once the cleanup is complete)

ii. update previous existing tasks to not have leaks - we already have a versioning system for this. I see two ways of doing this: • a) removing them from the training set • b) removing them from the test set (probably the one I would go for

In a clean-up logic, this seems to me to be the easiest approach to begin with. Option a) would mean having to re-train models, which can be time-consuming, whereas in option b) you "just" need to redo inferences. So I think b) is the simplest option at first. The only tricky point is that you may have to be careful about the size of the test split, so that it doesn't become too small if it's not very large to begin with, and then shrinks in size after cleaning.
The code to do this should look like the one in this post. In it, I remove the leaks in the train and validation splits rather than in the test one. It would then have to be modified to do the opposite.

i. ensure new datasets don't have leaks: We can do this by testing all available splits in a similar fashion to the bot you propose in the blog post.

On this point, I tried to think about designing a bot a few months ago. The problem was that the columns of interest varied from one dataset to another. I also tried to get the Hugging Face team to handle this automatically when a user adds a dataset to the Hub, but without success. Maybe I should talk to them again.

iii. Add functionality for checking if your training dataset contains near-duplicates from the test set (a set of tasks you wish to test on in MTEB)

It would be very interesting to avoid leaks when concatenating several train splits. The code should look like the one in my first post, where the dataset object is not an individual but the concatenation of several. To automate things, we need to make sure that all datasets have the same name.

KennethEnevoldsen commented 2 days ago

On this point, I tried to think about designing a bot a few months ago. The problem was that the columns of interest varied from one dataset to another.

An option here is to make some assumption (e.g. "text", "label") and accept that you don't hit everything.

It would be very interesting to avoid leaks when concatenating several train splits. The code should look like the one in my first post, where the dataset object is not an individual but the concatenation of several. To automate things, we need to make sure that all datasets have the same name.

At least within MTEB once loaded all datasets of a given type have the same format. So it shouldn't be too problematic.

imenelydiaker commented 2 days ago

Hi @KennethEnevoldsen? @imenelydiaker

Since we don't really have information about whether train sets of MTEB datasets are used for training, we can still make some checks on our end to minimize the leakage (e.g., making sure no train sample is in the test set, removing similar texts with BM25-maybe?-, etc.)

In my previous message I indicated how to find leaks/duplications, but not how to remove them 😅 Below is the code I use when I have to clean some datasets:

Yeah I've seen that, your code removes exact-duplicates. I was pointing that we can go further with other methods like BM25 to look for near-duplicates and remove them. The idea is to implement both to make the benchmark more relevant.

And in general, we don't know if a model is using MTEB train sets in their training or not unless it is specified by authors. Even by removing duplicate smaples, distributions of train and test sets are not supposed to be as different as in a zero-shot evaluation dataset (e.g., we use classification and clusetering datasets). So models will still overfit the benchmark when using train sets even with dedup.

imenelydiaker commented 2 days ago

i. ensure new datasets don't have leaks: We can do this by testing all available splits in a similar fashion to the bot you propose in the blog post.

On this point, I tried to think about designing a bot a few months ago. The problem was that the columns of interest varied from one dataset to another. I also tried to get the Hugging Face team to handle this automatically when a user adds a dataset to the Hub, but without success. Maybe I should talk to them again.

Columns are different from a task category to another. We have a function data_transform that users use to normalize the name of the columns and their format. We have a docstring at the start of each AbsTask to specify the expected format (see BitextMining for example).

So the dedup function should only handle the expected formats per task categories (approx ~12 categories).

KennethEnevoldsen commented 2 days ago

I will give this issue a day for people to see it. Before we decide what is the next thing to do.

However, my proposed first step is to create a test that checks if a dataset contains duplicates in splits not in eval_splits, and write a file about these statistics. We can then cache this test so that it is only run for new datasets and allow exceptions for existing datasets (this will act as a list of datasets to fix). To begin with this we can do this with simply exact duplicates.

Muennighoff commented 2 days ago

However, my proposed first step is to create a test that checks if a dataset contains duplicates in splits not in eval_splits, and write a file about these statistics. We can then cache this test so that it is only run for new datasets and allow exceptions for existing datasets (this will act as a list of datasets to fix). To begin with this we can do this with simply exact duplicates.

This makes sense to me! Maybe for MTEB lite we should ensure there's no contamination cc @vaibhavad

embeddings-benchmark / mteb

Leaks and duplications in the MTEB leaderboard #1036