Lightning-AI / litdata

Streamline data pipelines for AI. Process datasets across 1000s of machines, and optimize data for blazing fast model training.
Apache License 2.0
249 stars 24 forks source link

add support for slicing, subsampling and splitting StreamingDataset #156

Closed deependujha closed 2 weeks ago

deependujha commented 3 weeks ago
Before submitting - [ ] Was this discussed/agreed via a Github issue? (no need for typos and docs improvements) - [ ] Did you read the [contributor guideline](https://github.com/Lightning-AI/lit-data/blob/main/.github/CONTRIBUTING.md), Pull Request section? - [ ] Did you make sure to update the docs? - [ ] Did you write any new necessary tests?

What does this PR do?

Fixes #135 & fixes #145.

  1. Adds support to slice StreamingDataset Screenshot from 2024-06-05 13-20-06

  2. Adds support to Subsample StreamingDataset Screenshot from 2024-06-08 01-29-54

  3. Adds support to train_test_split StreamingDataset Screenshot from 2024-06-08 01-29-18

PR review

Anyone in the community is free to review the PR once the tests have passed. If we didn't discuss your PR in GitHub issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

codecov[bot] commented 3 weeks ago

Codecov Report

Attention: Patch coverage is 20.00000% with 4 lines in your changes missing coverage. Please review.

Please upload report for BASE (main@935eef5). Learn more about missing BASE report.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #156 +/- ## ===================================== Coverage ? 77% ===================================== Files ? 30 Lines ? 4129 Branches ? 0 ===================================== Hits ? 3185 Misses ? 944 Partials ? 0 ```
tchaton commented 3 weeks ago

Hey @deependujha, how is it going ?

deependujha commented 3 weeks ago

I modified item_loader interval, and instead of returning a list of [start_chunk_idx, end_chunk_idx], it returns [start_chunk_idx, my_chunk_start, my_chunk_end, end_chunk].

Refer to this line: streaming/item_loader#192.

my_chunk_start and my_chunk_end denotes from which index to which index of the current chunk, this streaming dataset is allowed to read.

If we are to read the whole chunk, it will simply be [start_chunk_idx, start_chunk_idx, end_chunk_idx, end_chunk_idx].

The logic for subsampling: Screenshot from 2024-06-07 19-21-23

The logic for train_test_split Screenshot from 2024-06-07 19-26-05


Using each chunk as a basis for subsampling and train_test_split.

tchaton commented 3 weeks ago

I recommend you to always start with writing test when doing development new features. This helps to ensure the changes you are making are correct.

tchaton commented 2 weeks ago

Closing in favor of #161