Lightning-AI / litdata

Streamline data pipelines for AI. Process datasets across 1000s of machines, and optimize data for blazing fast model training.
Apache License 2.0
250 stars 24 forks source link

Question: is there a plan to support streaming from GCS? #101

Closed dnnspark closed 4 weeks ago

dnnspark commented 2 months ago

🚀 Feature

Motivation

Pitch

Alternatives

Additional context

github-actions[bot] commented 2 months ago

Hi! thanks for your contribution!, great first issue!

tchaton commented 2 months ago

Hey @dnnspark,

Yes, this is quite simple to add. Simply needs to add the downloader.

dnnspark commented 2 months ago

Thanks @tchaton, do you have an idea when it's going to land (even very rough estimate)?

tchaton commented 2 months ago

Hey @dnnspark,

If you are willing to give it a try, I can look into it this week.

dnnspark commented 2 months ago

Sorry for the late @tchaton

I'm willing to try! But it's not blocking at the moment, so I will stay tuned about the GCS support (it will be very helpful if you ping on this thread once it's ready).

One thing I notice is that optimize() function assumes the data is stored on local disk (at least in the example). In my case, the raw data is at GCS (because it's too large). Is there a way to transform the data that is stored in the cloud, and save the transformed data to the cloud without having to download the entire data?

tchaton commented 2 months ago

Sorry for the late @tchaton

I'm willing to try! But it's not blocking at the moment, so I will stay tuned about the GCS support (it will be very helpful if you ping on this thread once it's ready).

One thing I notice is that optimize() function assumes the data is stored on local disk (at least in the example). In my case, the raw data is at GCS (because it's too large). Is there a way to transform the data that is stored in the cloud, and save the transformed data to the cloud without having to download the entire data?

Yes. that's why this library was built :) But I would need to add GCS support for it ;) I will try to prioritize it.

tchaton commented 4 weeks ago

@dnnspark GCS support was merged.