Eventual-Inc / Daft

Distributed data engine for Python/SQL designed for the cloud, powered by Rust
https://getdaft.io
Apache License 2.0
2.28k stars 157 forks source link

Load from Hugging Face ? #2841

Open lhoestq opened 1 month ago

lhoestq commented 1 month ago

Hi ! I'm Quentin from Hugging Face :)

Congrats on this project, this has the potential to help the community so much ! Especially with large scale and multimodal datasets.

I was wondering if you had any plan to support loading datasets from Hugging Face ? E.g. using the HF paths syntax hf://datasets/repo_id/path/in/repo

It would be useful for many datasets and use cases imo

jaychia commented 1 month ago

Indeeeeed.... @universalmind303 already built some basic support: https://www.getdaft.io/projects/docs/en/latest/user_guide/integrations/huggingface.html

I think there are still many things we can work on here (e.g. if HF can expose an S3 protocol instead, Daft will then really be able to zoom through these datasets). Would love to chat more about it.

jaychia commented 1 month ago

Other workloads we're actively working on include Huggingface's Datatrove https://github.com/huggingface/datatrove -- we've been running some really large deduplication/batch inference pipelines with Daft

Could be really cool getting that working e2e with HF datasets

lhoestq commented 1 month ago

Nice ! Looking forward to play with it then :)

What is missing for the next steps of HF support ? I feel like being able to write back to HF would be quite useful. Let me know if we can help with anything (also adding @wauplin @guipenedo for viz)

jaychia commented 1 month ago

Some open questions on our end which we can discuss:

  1. Can HF expose S3 protocols? That would greatly improve performance, potentially unlock writes, and also leverage a lot of the really good machinery we've already built/tuned for AWS S3 and Parquet reads.

  2. We should also make sure the multimodal support in Daft works well with HF-provided URLs. Will be curious how well this holds up if we start hammering HF's CDNs with something like df = df.with_column("images", df["image_urls"].url.download()) from a distributed setting. Note also that our Parquet reads do tend to be much more aggressive than other Parquet readers (it is tuned against AWS S3).

  3. Lastly, we should do more benchmarking. I'd love for Daft to be the defacto way to do ETL and dataloading to/from HF datasets, and we should be able to back that up with numbers!

Excited for a collab :) -- let me know if there's a good way for us to sync up offline and perhaps we can set up a chat to discuss potential projects here

Wauplin commented 1 month ago

Hi there :wave:

Can HF expose S3 protocols?

Not yet no. We are in the process of changing how file management works in the backend (see blog post). This is a big piece of work but once that's done we can reassess :)

kevinzwang commented 1 month ago

Hey @Wauplin @lhoestq ! Glad to see interest from the Huggingface team on this project!

What is missing for the next steps of HF support ?

This is actually a question we would like to direct back to you! Since we do already have support for downloading datasets via the hf://datasets/repo_id/path/in/repo syntax, are there other features that you think would be valuable to users who want to use Daft with Huggingface?

Also, wondering how can we get on this list 😄

image
lhoestq commented 1 month ago

We should add the ability to write back to HF, this will let people iterate more easily.

There is some code in the Spark docs to upload data in a distributed manner that we can reuse:

  1. _preupload
  2. _commit