Open ayushdg opened 7 months ago
Quick update: There exists a HfFileSystem
that is fsspec compatible. https://huggingface.co/docs/huggingface_hub/main/en/guides/hf_file_system
It should be possible to use this to read in dataframes as dask dataframes with the hf://
protocol prefix.
Is your feature request related to a problem? Please describe. NeMo curator supports document datasets as dataframes today and includes some helpers to read from json/parquet files.
Describe the solution you'd like Support to read in/ work with hugging face datasets.
Describe alternatives you've considered Dumping from huggingface datasets to json/parquet before reading with Curator
Additional context N/A