Lightning-AI / litdata

Streamline data pipelines for AI. Process datasets across 1000s of machines, and optimize data for blazing fast model training.
Apache License 2.0
250 stars 24 forks source link

Subsample StreamingDataset #135

Closed yhl48 closed 2 weeks ago

yhl48 commented 1 month ago

🚀 Feature

StreamingDataset has enabled fast data reading, which is amazing when we have a large dataset. However, currently, it does not support reading just a fraction of data, and simple methods such as slicing the dataset does not work with StreamingDataset.

Could we have a feature to support this? Thanks.

cc: @tchaton

abysmalocean commented 2 weeks ago

I am also interested in this feature. any ideas?

tchaton commented 2 weeks ago

Hey @abysmalocean The feature is being worked on right now: https://github.com/Lightning-AI/litdata/pull/161. Feel free to review it and give your ideas ;)