Feat: add support for custom cache dir in Streaming Dataset

Lightning-AI / litdata

Transform datasets at scale. Optimize datasets for fast AI model training.

Apache License 2.0

374 stars 42 forks source link

Before submitting

- [x] Was this discussed/agreed via a Github issue? (no need for typos and docs improvements) - [x] Did you read the [contributor guideline](https://github.com/Lightning-AI/lit-data/blob/main/.github/CONTRIBUTING.md), Pull Request section? - [ ] Did you make sure to update the docs? - [x] Did you write any new necessary tests?

What does this PR do?

Fixes #398.

Adds support for custom cache dir in Streaming Dataset

Usage

import litdata as ld

dataset = ld.StreamingDataset(input_dir = 's3://my-bucket/fast_data', cache_dir='local_cache_dir' ,shuffle=True, drop_last=True)

PR review

Anyone in the community is free to review the PR once the tests have passed. If we didn't discuss your PR in GitHub issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 78%. Comparing base (62907b3) to head (20be9e1). Report is 1 commits behind head on main.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #399 +/- ## =================================== Coverage 78% 78% =================================== Files 34 34 Lines 5042 5045 +3 =================================== + Hits 3948 3954 +6 + Misses 1094 1091 -3 ```

Lightning-AI / litdata