embeddings-benchmark / mteb

MTEB: Massive Text Embedding Benchmark
https://arxiv.org/abs/2210.07316
Apache License 2.0
1.9k stars 254 forks source link

FSTimeoutError for MLSUMClusteringP2P & MLSUMClusteringS2S #1311

Open bourdoiscatie opened 1 day ago

bourdoiscatie commented 1 day ago

Hi!

I've just trained an embedding model in French and would like to test it on the MTEB_FR. I used the following code:

import mteb

benchmark = mteb.get_benchmark("MTEB(fra)")
evaluation = mteb.MTEB(tasks=benchmark)
evaluation.run(my_model, eval_splits=["test"], output_folder=f"results")

and everything ran fine until MLSUMClusteringP2P, where I got the following error:

---------------------------------------------------------------------------
TimeoutError                              Traceback (most recent call last)
File ~/.local/lib/python3.12/site-packages/fsspec/asyn.py:56, in _runner(event, coro, result, timeout)
     55 try:
---> 56     result[0] = await coro
     57 except Exception as ex:

File ~/.local/lib/python3.12/site-packages/fsspec/implementations/http.py:254, in HTTPFileSystem._get_file(self, rpath, lpath, chunk_size, callback, **kwargs)
    253 while chunk:
--> 254     chunk = await r.content.read(chunk_size)
    255     outfile.write(chunk)

File /usr/lib/python3.12/site-packages/aiohttp/streams.py:393, in StreamReader.read(self, n)
    392 while not self._buffer and not self._eof:
--> 393     await self._wait("read")
    395 return self._read_nowait(n)

File /usr/lib/python3.12/site-packages/aiohttp/streams.py:311, in StreamReader._wait(self, func_name)
    310 try:
--> 311     with self._timer:
    312         await waiter

File /usr/lib/python3.12/site-packages/aiohttp/helpers.py:713, in TimerContext.__exit__(self, exc_type, exc_val, exc_tb)
    712 if exc_type is asyncio.CancelledError and self._cancelled:
--> 713     raise asyncio.TimeoutError from None
    714 return None

TimeoutError: 

The above exception was the direct cause of the following exception:

FSTimeoutError                            Traceback (most recent call last)
Cell In[9], line 5
      3 benchmark = mteb.get_benchmark("MTEB(fra)")
      4 evaluation = mteb.MTEB(tasks=benchmark)
----> 5 evaluation.run(model, eval_splits=["test"], output_folder=f"results")

File ~/.local/lib/python3.12/site-packages/mteb/evaluation/MTEB.py:465, in MTEB.run(self, model, verbosity, output_folder, eval_splits, overwrite_results, raise_error, co2_tracker, encode_kwargs, **kwargs)
    461 logger.error(
    462     f"Error while evaluating {task.metadata_dict['name']}: {e}"
    463 )
    464 if raise_error:
--> 465     raise e
    466 logger.error(
    467     f"Please check all the error logs at: {self.err_logs_path}"
    468 )
    469 with open(self.err_logs_path, "a") as f_out:

File ~/.local/lib/python3.12/site-packages/mteb/evaluation/MTEB.py:395, in MTEB.run(self, model, verbosity, output_folder, eval_splits, overwrite_results, raise_error, co2_tracker, encode_kwargs, **kwargs)
    393 logger.info(f"Loading dataset for {task.metadata_dict['name']}")
    394 task.check_if_dataset_is_superseeded()
--> 395 task.load_data(eval_splits=task_eval_splits, **kwargs)
    397 # run evaluation
    398 task_results = {}

File ~/.local/lib/python3.12/site-packages/mteb/tasks/Clustering/multilingual/MLSUMClusteringP2P.py:66, in MLSUMClusteringP2P.load_data(self, **kwargs)
     64 self.dataset = {}
     65 for lang in self.hf_subsets:
---> 66     self.dataset[lang] = datasets.load_dataset(
     67         name=lang,
     68         **self.metadata_dict["dataset"],
     69     )
     70     self.dataset_transform(lang)
     71 self.data_loaded = True

File ~/.local/lib/python3.12/site-packages/datasets/load.py:2096, in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, verification_mode, keep_in_memory, save_infos, revision, token, streaming, num_proc, storage_options, trust_remote_code, **config_kwargs)
   2093     return builder_instance.as_streaming_dataset(split=split)
   2095 # Download and prepare data
-> 2096 builder_instance.download_and_prepare(
   2097     download_config=download_config,
   2098     download_mode=download_mode,
   2099     verification_mode=verification_mode,
   2100     num_proc=num_proc,
   2101     storage_options=storage_options,
   2102 )
   2104 # Build dataset for splits
   2105 keep_in_memory = (
   2106     keep_in_memory if keep_in_memory is not None else is_small_dataset(builder_instance.info.dataset_size)
   2107 )

File ~/.local/lib/python3.12/site-packages/datasets/builder.py:924, in DatasetBuilder.download_and_prepare(self, output_dir, download_config, download_mode, verification_mode, dl_manager, base_path, file_format, max_shard_size, num_proc, storage_options, **download_and_prepare_kwargs)
    922 if num_proc is not None:
    923     prepare_split_kwargs["num_proc"] = num_proc
--> 924 self._download_and_prepare(
    925     dl_manager=dl_manager,
    926     verification_mode=verification_mode,
    927     **prepare_split_kwargs,
    928     **download_and_prepare_kwargs,
    929 )
    930 # Sync info
    931 self.info.dataset_size = sum(split.num_bytes for split in self.info.splits.values())

File ~/.local/lib/python3.12/site-packages/datasets/builder.py:1647, in GeneratorBasedBuilder._download_and_prepare(self, dl_manager, verification_mode, **prepare_splits_kwargs)
   1646 def _download_and_prepare(self, dl_manager, verification_mode, **prepare_splits_kwargs):
-> 1647     super()._download_and_prepare(
   1648         dl_manager,
   1649         verification_mode,
   1650         check_duplicate_keys=verification_mode == VerificationMode.BASIC_CHECKS
   1651         or verification_mode == VerificationMode.ALL_CHECKS,
   1652         **prepare_splits_kwargs,
   1653     )

File ~/.local/lib/python3.12/site-packages/datasets/builder.py:977, in DatasetBuilder._download_and_prepare(self, dl_manager, verification_mode, **prepare_split_kwargs)
    975 split_dict = SplitDict(dataset_name=self.dataset_name)
    976 split_generators_kwargs = self._make_split_generators_kwargs(prepare_split_kwargs)
--> 977 split_generators = self._split_generators(dl_manager, **split_generators_kwargs)
    979 # Checksums verification
    980 if verification_mode == VerificationMode.ALL_CHECKS and dl_manager.record_checksums:

File ~/.cache/huggingface/modules/datasets_modules/datasets/reciTAL--mlsum/1b2d4e3020a63e9423caeccd38a04b02b7472d5694a33f4b2ae5b09e11e1f4cb/mlsum.py:74, in Mlsum._split_generators(self, dl_manager)
     68 lang = self.config.name
     69 urls_to_download = {
     70     "train": f"{_URL}/{lang}_train.jsonl?inline=false",
     71     "validation": f"{_URL}/{lang}_val.jsonl?inline=false",
     72     "test": f"{_URL}/{lang}_test.jsonl?inline=false",
     73 }
---> 74 downloaded_files = dl_manager.download(urls_to_download)
     76 return [
     77     datasets.SplitGenerator(
     78         name=split,
   (...)
     83     for split in [datasets.Split.TRAIN, datasets.Split.VALIDATION, datasets.Split.TEST]
     84 ]

File ~/.local/lib/python3.12/site-packages/datasets/download/download_manager.py:159, in DownloadManager.download(self, url_or_urls)
    157 start_time = datetime.now()
    158 with stack_multiprocessing_download_progress_bars():
--> 159     downloaded_path_or_paths = map_nested(
    160         download_func,
    161         url_or_urls,
    162         map_tuple=True,
    163         num_proc=download_config.num_proc,
    164         desc="Downloading data files",
    165         batched=True,
    166         batch_size=-1,
    167     )
    168 duration = datetime.now() - start_time
    169 logger.info(f"Downloading took {duration.total_seconds() // 60} min")

File ~/.local/lib/python3.12/site-packages/datasets/utils/py_utils.py:512, in map_nested(function, data_struct, dict_only, map_list, map_tuple, map_numpy, num_proc, parallel_min_length, batched, batch_size, types, disable_tqdm, desc)
    509         batch_size = max(len(iterable) // num_proc + int(len(iterable) % num_proc > 0), 1)
    510     iterable = list(iter_batched(iterable, batch_size))
    511 mapped = [
--> 512     _single_map_nested((function, obj, batched, batch_size, types, None, True, None))
    513     for obj in hf_tqdm(iterable, disable=disable_tqdm, desc=desc)
    514 ]
    515 if batched:
    516     mapped = [mapped_item for mapped_batch in mapped for mapped_item in mapped_batch]

File ~/.local/lib/python3.12/site-packages/datasets/utils/py_utils.py:380, in _single_map_nested(args)
    373         return function(data_struct)
    374 if (
    375     batched
    376     and not isinstance(data_struct, dict)
    377     and isinstance(data_struct, types)
    378     and all(not isinstance(v, (dict, types)) for v in data_struct)
    379 ):
--> 380     return [mapped_item for batch in iter_batched(data_struct, batch_size) for mapped_item in function(batch)]
    382 # Reduce logging to keep things readable in multiprocessing with tqdm
    383 if rank is not None and logging.get_verbosity() < logging.WARNING:

File ~/.local/lib/python3.12/site-packages/datasets/download/download_manager.py:216, in DownloadManager._download_batched(self, url_or_filenames, download_config)
    202     return thread_map(
    203         download_func,
    204         url_or_filenames,
   (...)
    212         tqdm_class=tqdm,
    213     )
    214 else:
    215     return [
--> 216         self._download_single(url_or_filename, download_config=download_config)
    217         for url_or_filename in url_or_filenames
    218     ]

File ~/.local/lib/python3.12/site-packages/datasets/download/download_manager.py:225, in DownloadManager._download_single(self, url_or_filename, download_config)
    222 if is_relative_path(url_or_filename):
    223     # append the relative path to the base_path
    224     url_or_filename = url_or_path_join(self._base_path, url_or_filename)
--> 225 out = cached_path(url_or_filename, download_config=download_config)
    226 out = tracked_str(out)
    227 out.set_origin(url_or_filename)

File ~/.local/lib/python3.12/site-packages/datasets/utils/file_utils.py:205, in cached_path(url_or_filename, download_config, **download_kwargs)
    202             raise FileNotFoundError(str(e)) from e
    203     # Download external files
    204     else:
--> 205         output_path = get_from_cache(
    206             url_or_filename,
    207             cache_dir=cache_dir,
    208             force_download=download_config.force_download,
    209             user_agent=download_config.user_agent,
    210             use_etag=download_config.use_etag,
    211             token=download_config.token,
    212             storage_options=storage_options,
    213             download_desc=download_config.download_desc,
    214             disable_tqdm=download_config.disable_tqdm,
    215         )
    216 elif os.path.exists(url_or_filename):
    217     # File, and it exists.
    218     output_path = url_or_filename

File ~/.local/lib/python3.12/site-packages/datasets/utils/file_utils.py:415, in get_from_cache(url, cache_dir, force_download, user_agent, use_etag, token, storage_options, download_desc, disable_tqdm)
    413     logger.info(f"{url} not found in cache or force_download set to True, downloading to {temp_file.name}")
    414     # GET file object
--> 415     fsspec_get(url, temp_file, storage_options=storage_options, desc=download_desc, disable_tqdm=disable_tqdm)
    417 logger.info(f"storing {url} in cache at {cache_path}")
    418 shutil.move(temp_file.name, cache_path)

File ~/.local/lib/python3.12/site-packages/datasets/utils/file_utils.py:334, in fsspec_get(url, temp_file, storage_options, desc, disable_tqdm)
    321 fs, path = url_to_fs(url, **(storage_options or {}))
    322 callback = TqdmCallback(
    323     tqdm_kwargs={
    324         "desc": desc or "Downloading",
   (...)
    332     }
    333 )
--> 334 fs.get_file(path, temp_file.name, callback=callback)

File ~/.local/lib/python3.12/site-packages/fsspec/asyn.py:118, in sync_wrapper.<locals>.wrapper(*args, **kwargs)
    115 @functools.wraps(func)
    116 def wrapper(*args, **kwargs):
    117     self = obj or args[0]
--> 118     return sync(self.loop, func, *args, **kwargs)

File ~/.local/lib/python3.12/site-packages/fsspec/asyn.py:101, in sync(loop, func, timeout, *args, **kwargs)
     98 return_result = result[0]
     99 if isinstance(return_result, asyncio.TimeoutError):
    100     # suppress asyncio.TimeoutError, raise FSTimeoutError
--> 101     raise FSTimeoutError from return_result
    102 elif isinstance(return_result, BaseException):
    103     raise return_result

FSTimeoutError: 

I then ran the code on each individual task and everything ran, with the exception of MLSUMClusteringP2P but also for MLSUMClusteringS2S, where I received the same error. This suggests to me that there may be a problem with these two datasets, but I can't say what it is. I haven't found any other issues with this problem.

Note that I'm using version 1.16.1 of the library.

If you can enlighten me on this point, I'd be very grateful 🙏

isaac-chung commented 1 day ago

Hi @bourdoiscatie!

This looks like an TimeoutError that occurs when the system takes too long to download a dataset. MLSUM takes a few GB. Without knowing what hardware you ran this on, I could only suggest that we check for internet connectivity, and rerun the task.

Here's my successful run on a linux machine:

printout ``` $ mteb run -t MLSUMClusteringP2P -m sentence-transformers/all-MiniLM-L6-v2 2024-10-23 18:22:41.338893: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`. 2024-10-23 18:22:41.356253: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2024-10-23 18:22:41.373459: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-10-23 18:22:41.380070: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2024-10-23 18:22:41.394350: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2024-10-23 18:22:42.270947: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT INFO:mteb.cli:Running with parameters: Namespace(model='sentence-transformers/all-MiniLM-L6-v2', task_types=None, categories=None, tasks=['MLSUMClusteringP2P'], languages=None, benchmarks=None, device=None, output_folder='results', verbosity=2, co2_tracker=False, eval_splits=None, model_revision=None, batch_size=None, overwrite=False, save_predictions=False, func=) INFO:mteb.evaluation.MTEB: ## Evaluating 1 tasks: ────────────────────────────────────────────────────────────────────────── Selected tasks ────────────────────────────────────────────────────────────────────────── Clustering - MLSUMClusteringP2P, p2p, multilingual 4 / 4 Subsets INFO:mteb.evaluation.MTEB: ********************** Evaluating MLSUMClusteringP2P ********************** INFO:mteb.evaluation.MTEB:Loading dataset for MLSUMClusteringP2P WARNING:mteb.abstasks.AbsTask:Dataset 'MLSUMClusteringP2P' is superseeded by 'MLSUMClusteringP2P.v2', you might consider using the newer version of the dataset. Downloading builder script: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 3.72k/3.72k [00:00<00:00, 19.6MB/s] Downloading metadata: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 12.7k/12.7k [00:00<00:00, 36.6MB/s] Downloading readme: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 11.0k/11.0k [00:00<00:00, 48.0MB/s] Downloading data: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 905M/905M [01:04<00:00, 13.9MB/s] Downloading data: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 50.3M/50.3M [00:07<00:00, 6.73MB/s] Downloading data: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 50.0M/50.0M [00:05<00:00, 9.86MB/s] Generating train split: 100%|██████████████████████████████████████████████████████████████████████████████████████| 220887/220887 [00:12<00:00, 17652.64 examples/s] Generating validation split: 100%|███████████████████████████████████████████████████████████████████████████████████| 11394/11394 [00:00<00:00, 16784.54 examples/s] Generating test split: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 10701/10701 [00:00<00:00, 17135.01 examples/s] Map: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 11394/11394 [00:01<00:00, 10121.87 examples/s] Map: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 10701/10701 [00:00<00:00, 12863.89 examples/s] Downloading data: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.69G/1.69G [01:26<00:00, 19.5MB/s] Downloading data: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 81.8M/81.8M [00:05<00:00, 14.9MB/s] Downloading data: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 80.8M/80.8M [00:07<00:00, 11.5MB/s] Generating train split: 100%|██████████████████████████████████████████████████████████████████████████████████████| 392902/392902 [00:24<00:00, 15981.81 examples/s] Generating validation split: 100%|███████████████████████████████████████████████████████████████████████████████████| 16059/16059 [00:01<00:00, 14387.65 examples/s] Generating test split: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 15828/15828 [00:01<00:00, 14447.19 examples/s] Map: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 16059/16059 [00:01<00:00, 11564.95 examples/s] Map: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15828/15828 [00:01<00:00, 9738.31 examples/s] Downloading data: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 714M/714M [00:45<00:00, 15.8MB/s] Downloading data: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 25.3M/25.3M [00:02<00:00, 9.95MB/s] Downloading data: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 26.8M/26.8M [00:04<00:00, 6.27MB/s] Generating train split: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 25556/25556 [00:04<00:00, 5554.02 examples/s] Generating validation split: 100%|████████████████████████████████████████████████████████████████████████████████████████| 750/750 [00:00<00:00, 4160.37 examples/s] Generating test split: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 757/757 [00:00<00:00, 4876.21 examples/s] Map: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 750/750 [00:00<00:00, 5663.21 examples/s] Map: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 757/757 [00:00<00:00, 6580.97 examples/s] Downloading data: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.32G/1.32G [01:27<00:00, 15.2MB/s] Downloading data: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 55.1M/55.1M [00:06<00:00, 8.45MB/s] Downloading data: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 77.5M/77.5M [00:08<00:00, 8.98MB/s] Generating train split: 100%|██████████████████████████████████████████████████████████████████████████████████████| 266367/266367 [00:17<00:00, 14969.93 examples/s] Generating validation split: 100%|███████████████████████████████████████████████████████████████████████████████████| 10358/10358 [00:00<00:00, 12266.27 examples/s] Generating test split: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 13920/13920 [00:01<00:00, 13378.31 examples/s] Map: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 10358/10358 [00:00<00:00, 11415.53 examples/s] Map: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 13920/13920 [00:01<00:00, 11853.92 examples/s] INFO:mteb.abstasks.AbsTask: Task: MLSUMClusteringP2P, split: validation, subset: de. Running... Clustering: 0%| | 0/10 [00:00

Maybe @imenelydiaker or @KennethEnevoldsen have seen this error before?

imenelydiaker commented 1 day ago

Never seen this error before, looks like an internet issue, but it can be anything related to the network you're using. As @isaac-chung mentioned it MLSUM is quite a big dataset that requires some time to load.

bourdoiscatie commented 1 day ago

Thank you for your feedback @isaac-chung @imenelydiaker For my evaluation I'm using an A100 on a remote server to which I was given access for this purpose. Unfortunately, I don't have control over the server's internet connection. So I'll probably download this dataset on my side and then upload it to the server. Is it enough to put it in HF's cache, or do I need to put it in a particular place so that the MTEB library can find it later?

isaac-chung commented 1 day ago

@bourdoiscatie for reference I'm using an A10 on a remote server, and the dataset was downloaded into the default HF cache location.

bourdoiscatie commented 1 day ago

Thanks for the information, I should be able to manage with all that 🤗 I close the issue.

bourdoiscatie commented 1 day ago

For those who have the same problem, it seems to be due to the datasets library since version 3.x. https://github.com/huggingface/datasets/issues/7175 Downgrade the library seems to be a temporary solution.

lhoestq commented 1 day ago

Hi ! I'm Quentin from HF :)

Unfortunately we had to limit our support of script-based datasets for obvious security reasons, and apparently it made some issues related to relying on bad hosts resurface :/ Have you considered uploading the data on HF instead (ideally in Parquet to avoid using a dataset script) ?

bourdoiscatie commented 1 day ago

Looking at the code, I realize that the train split that is massive is not even used in practice : https://github.com/embeddings-benchmark/mteb/blob/bac8bd7212a90fb814d5c92e4d39ee12e92e5fe7/mteb/tasks/Clustering/multilingual/MLSUMClusteringP2P.py#L80 Wouldn't it be more appropriate to load only the validation and test splits to speed things up? And as Quentin points out, possibly host these two splits on the Hub.

imenelydiaker commented 22 hours ago

Thanks @lhoestq and @bourdoiscatie for pointing this out.

The best solution (imo) is to re-upload the dataset to HF using parquet, validation and test splits are also generated using a script. If we want to avoid this error again, better re-upload to a supported format.

We're working on it and will let you know when it's fixed, thank you. 🙏