ExpressAI / DataLab

The unified platform for data-related resources.
https://expressai.github.io/DataLab/
Apache License 2.0
131 stars 27 forks source link

Incompatibility with huggingface-datasets #425

Open helpmefindaname opened 1 year ago

helpmefindaname commented 1 year ago

Hello, I am currently working on a project where both DataLab and datasets are subdependencies. I noticed that I cannot import both libraries, as they both register FileSystems in fsspec, expecting the FileSystems not being registered before.

Versions

datalabs==0.4.15
datasets==2.12.0

Replication

import datasets
import datalabs

Error

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\Bened\anaconda3\envs\ner-eval-dashboard2\lib\site-packages\datalabs\__init__.py", line 28, in <module>
    from datalabs.arrow_dataset import concatenate_datasets, Dataset
  File "C:\Users\Bened\anaconda3\envs\ner-eval-dashboard2\lib\site-packages\datalabs\arrow_dataset.py", line 60, in <module>
    from datalabs.arrow_writer import ArrowWriter, OptimizedTypedSequence
  File "C:\Users\Bened\anaconda3\envs\ner-eval-dashboard2\lib\site-packages\datalabs\arrow_writer.py", line 28, in <module>
    from datalabs.features import (
  File "C:\Users\Bened\anaconda3\envs\ner-eval-dashboard2\lib\site-packages\datalabs\features\__init__.py", line 2, in <module>
    from datalabs.features.audio import Audio
  File "C:\Users\Bened\anaconda3\envs\ner-eval-dashboard2\lib\site-packages\datalabs\features\audio.py", line 21, in <module>
    from datalabs.utils.streaming_download_manager import xopen
  File "C:\Users\Bened\anaconda3\envs\ner-eval-dashboard2\lib\site-packages\datalabs\utils\streaming_download_manager.py", line 16, in <module>
    from datalabs.filesystems import COMPRESSION_FILESYSTEMS
  File "C:\Users\Bened\anaconda3\envs\ner-eval-dashboard2\lib\site-packages\datalabs\filesystems\__init__.py", line 37, in <module>
    fsspec.register_implementation(fs_class.protocol, fs_class)
  File "C:\Users\Bened\anaconda3\envs\ner-eval-dashboard2\lib\site-packages\fsspec\registry.py", line 51, in register_implementation
    raise ValueError(
ValueError: Name (bz2) already in the registry and clobber is False

Possible Solution

I think as simple solution would be to just set clobber=True in https://github.com/ExpressAI/DataLab/blob/main/datalabs/filesystems/__init__.py#L37. This allows the register to discard previous registrations. This should work, as the datalabs FileSystems are copies of the datasets FileSystems. However, I don't know if it is guaranteed to be compatible with other libraries that might use the same protocols.

I am linking the symmetric issue on datasets as ideally the issue is solved in both libraries the same way. Otherwise, it could lead to different behaviors depending on which library gets imported first.