embeddings-benchmark / mteb

MTEB: Massive Text Embedding Benchmark
https://arxiv.org/abs/2210.07316
Apache License 2.0
1.63k stars 212 forks source link

Issues with stratified_subsampling() #519

Open imenelydiaker opened 2 months ago

imenelydiaker commented 2 months ago

I ran into the following error when using our subsampling funciton on GreekLegalCodeClassification task:

... in train_test_split 
raise ValueError(
ValueError: The least populated class in label column has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.

@isaac-chung it seems that train_test_split doesn't handle stratification when there is only one sample per class. Any idea on how to solve this?

isaac-chung commented 2 months ago

Re: Problem 1, we might have to default back to a shuffle with a try/except:

self.dataset["test"] = (
    self.dataset["test"].shuffle(seed=self.seed).select(range(TEST_SAMPLES))
)

wdyt?

imenelydiaker commented 2 months ago

I personally find it weird to have classes with only 1 sample, maybe we shouldn't handle them? We can filter the dataset and remove rows with only 1 sample, wdyt? The shuffle will just don't consider the class imbalance and I'm not sure it's good to use it in a function we named stratified_subsampling() 🤔

isaac-chung commented 2 months ago

Good point. It wouldn't be true to the name anymore. I feel that we can put a note in this method to say that we will remove rows with only 1 sample. Use with caution.