Lightning-AI / litdata

Streamline data pipelines for AI. Process datasets across 1000s of machines, and optimize data for blazing fast model training.
Apache License 2.0
249 stars 24 forks source link

Add feature to slice, subsample and split dataset #161

Closed deependujha closed 2 weeks ago

deependujha commented 3 weeks ago
Before submitting - [x] Was this discussed/agreed via a Github issue? (no need for typos and docs improvements) - [x] Did you read the [contributor guideline](https://github.com/Lightning-AI/lit-data/blob/main/.github/CONTRIBUTING.md), Pull Request section? - [x] Did you make sure to update the docs? - [x] Did you write any new necessary tests?

What does this PR do?

Fixes #135 & fixes #145.

  1. Adds support to slice StreamingDataset Screenshot from 2024-06-05 13-20-06

  2. Adds support to Subsample StreamingDataset Screenshot from 2024-06-08 01-29-54

  3. Adds support to train_test_split StreamingDataset Screenshot from 2024-06-08 01-29-18

PR review

Anyone in the community is free to review the PR once the tests have passed. If we didn't discuss your PR in GitHub issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

codecov[bot] commented 3 weeks ago

Codecov Report

Attention: Patch coverage is 94.09091% with 13 lines in your changes missing coverage. Please review.

Please upload report for BASE (main@b51b597). Learn more about missing BASE report.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #161 +/- ## ===================================== Coverage ? 78% ===================================== Files ? 33 Lines ? 4324 Branches ? 0 ===================================== Hits ? 3363 Misses ? 961 Partials ? 0 ```
deependujha commented 2 weeks ago

If subsample is 1 (default or passed), expensive tabulation (optimized in its own way) isn't called. Also, using a local random seed sampler to avoid changing the seed of the user.

The only remaining one is: wrong chunk size with dim

And, test_s3_streaming_dataset test passes in Lightning Studio, but fails in CI.

cc: @tchaton