NVIDIA / NeMo-Curator

Scalable data pre processing and curation toolkit for LLMs
Apache License 2.0
617 stars 83 forks source link

[FEA] Add support for huggingface datasets #28

Open ayushdg opened 7 months ago

ayushdg commented 7 months ago

Is your feature request related to a problem? Please describe. NeMo curator supports document datasets as dataframes today and includes some helpers to read from json/parquet files.

Describe the solution you'd like Support to read in/ work with hugging face datasets.

Describe alternatives you've considered Dumping from huggingface datasets to json/parquet before reading with Curator

Additional context N/A

ayushdg commented 5 months ago

Quick update: There exists a HfFileSystem that is fsspec compatible. https://huggingface.co/docs/huggingface_hub/main/en/guides/hf_file_system

It should be possible to use this to read in dataframes as dask dataframes with the hf:// protocol prefix.