embeddings-benchmark / mteb

MTEB: Massive Text Embedding Benchmark
https://arxiv.org/abs/2210.07316
Apache License 2.0
1.95k stars 272 forks source link

stratified_subsampling not handling multilingual datasets yet #743

Closed isaac-chung closed 6 months ago

isaac-chung commented 6 months ago

While working on #742 , this error is shown when running mteb -t WikiClusteringFastP2P -m sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2

Code:

class WikiClusteringFastP2P(AbsTaskClusteringFast, MultilingualTask):
...
def dataset_transform(self):
        self.dataset = self.stratified_subsampling(
            self.dataset,
            self.seed,
            self.metadata.eval_splits,
            label="labels",
            n_samples=2048,
        )

Error:

Traceback (most recent call last):
  File "/home/ubuntu/miniconda3/bin/mteb", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/ubuntu/isaac/work/mteb/mteb/cmd.py", line 187, in main
    eval.run(
  File "/home/ubuntu/isaac/work/mteb/mteb/evaluation/MTEB.py", line 356, in run
    raise e
  File "/home/ubuntu/isaac/work/mteb/mteb/evaluation/MTEB.py", line 301, in run
    task.load_data(eval_splits=task_eval_splits, **kwargs)
  File "/home/ubuntu/isaac/work/mteb/mteb/abstasks/MultiSubsetLoader.py", line 18, in load_data
    self.dataset_transform()
  File "/home/ubuntu/isaac/work/mteb/mteb/tasks/Clustering/multilingual/WikiClusteringP2P.py", line 84, in dataset_transform
    self.dataset = self.stratified_subsampling(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/isaac/work/mteb/mteb/abstasks/AbsTask.py", line 68, in stratified_subsampling
    if not isinstance(dataset_dict[splits[0]].features[label], datasets.ClassLabel):
                      ~~~~~~~~~~~~^^^^^^^^^^^
KeyError: 'test'

Starting to notice that to handle multilingual datasets, the method will need access to self.is_multilingual and self.metadata_dict["eval_langs"]. So part of me would like to convert this to a class method from a static method. Any thoughts on this?

cc @imenelydiaker @KennethEnevoldsen @x-tabdeveloping or anyone else who's interested.

imenelydiaker commented 6 months ago

For multilingual datasets you could do something like:

def dataset_transform(self):
      for lang in self.langs: # it's self.hf_subsets now I think
            self.dataset[lang] = self.stratified_subsampling(
                self.dataset[lang],
                self.seed,
                self.metadata.eval_splits,
                label="labels",
                n_samples=2048,
            )

Do you find doing this in a loop redundant?

No strong preferences if you'd prefer handling all languages at once, maybe you'd need to add a check on the languages you want to subsample as all languages may not have a number of samples > 2048.

isaac-chung commented 6 months ago

Sometimes the simplest solutions are right under our noses! I think this works better. It would be better if we separate out languages in the subsampling and set a slightly lower number if needed.

Thanks!