add support for slicing, subsampling and splitting StreamingDataset

deependujha commented 3 weeks ago

Before submitting

- [ ] Was this discussed/agreed via a Github issue? (no need for typos and docs improvements) - [ ] Did you read the [contributor guideline](https://github.com/Lightning-AI/lit-data/blob/main/.github/CONTRIBUTING.md), Pull Request section? - [ ] Did you make sure to update the docs? - [ ] Did you write any new necessary tests?

What does this PR do?

Fixes #135 & fixes #145.

Adds support to slice StreamingDataset
Adds support to Subsample StreamingDataset
Adds support to train_test_split StreamingDataset

PR review

Anyone in the community is free to review the PR once the tests have passed. If we didn't discuss your PR in GitHub issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

codecov[bot] commented 3 weeks ago

Codecov Report

Attention: Patch coverage is 20.00000% with 4 lines in your changes missing coverage. Please review.

Please upload report for BASE (main@935eef5). Learn more about missing BASE report.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #156 +/- ## ===================================== Coverage ? 77% ===================================== Files ? 30 Lines ? 4129 Branches ? 0 ===================================== Hits ? 3185 Misses ? 944 Partials ? 0 ```

tchaton commented 3 weeks ago

Hey @deependujha, how is it going ?

deependujha commented 3 weeks ago

I modified item_loader interval, and instead of returning a list of [start_chunk_idx, end_chunk_idx], it returns [start_chunk_idx, my_chunk_start, my_chunk_end, end_chunk].

Refer to this line: streaming/item_loader#192.

my_chunk_start and my_chunk_end denotes from which index to which index of the current chunk, this streaming dataset is allowed to read.

If we are to read the whole chunk, it will simply be [start_chunk_idx, start_chunk_idx, end_chunk_idx, end_chunk_idx].

The logic for subsampling: Screenshot from 2024-06-07 19-21-23

The logic for train_test_split Screenshot from 2024-06-07 19-26-05

Using each chunk as a basis for subsampling and train_test_split.

tchaton commented 3 weeks ago

I recommend you to always start with writing test when doing development new features. This helps to ensure the changes you are making are correct.

tchaton commented 2 weeks ago

Closing in favor of #161

Lightning-AI / litdata