Add feature to slice, subsample and split dataset

deependujha commented 3 weeks ago

Before submitting

- [x] Was this discussed/agreed via a Github issue? (no need for typos and docs improvements) - [x] Did you read the [contributor guideline](https://github.com/Lightning-AI/lit-data/blob/main/.github/CONTRIBUTING.md), Pull Request section? - [x] Did you make sure to update the docs? - [x] Did you write any new necessary tests?

What does this PR do?

Fixes #135 & fixes #145.

Adds support to slice StreamingDataset
Adds support to Subsample StreamingDataset
Adds support to train_test_split StreamingDataset

PR review

Anyone in the community is free to review the PR once the tests have passed. If we didn't discuss your PR in GitHub issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

codecov[bot] commented 3 weeks ago

Codecov Report

Attention: Patch coverage is 94.09091% with 13 lines in your changes missing coverage. Please review.

Please upload report for BASE (main@b51b597). Learn more about missing BASE report.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #161 +/- ## ===================================== Coverage ? 78% ===================================== Files ? 33 Lines ? 4324 Branches ? 0 ===================================== Hits ? 3363 Misses ? 961 Partials ? 0 ```

deependujha commented 2 weeks ago

If subsample is 1 (default or passed), expensive tabulation (optimized in its own way) isn't called. Also, using a local random seed sampler to avoid changing the seed of the user.

The only remaining one is: wrong chunk size with dim

And, test_s3_streaming_dataset test passes in Lightning Studio, but fails in CI.

cc: @tchaton

Lightning-AI / litdata