Lightning-AI / litdata

Transform datasets at scale. Optimize datasets for fast AI model training.
Apache License 2.0
335 stars 39 forks source link

Fix infinite sleep when loading local compressed dataset. #127

Closed wzf03 closed 4 months ago

wzf03 commented 4 months ago
Before submitting - [ ] Was this discussed/agreed via a Github issue? (no need for typos and docs improvements) - [ ] Did you read the [contributor guideline](https://github.com/Lightning-AI/lit-data/blob/main/.github/CONTRIBUTING.md), Pull Request section? - [ ] Did you make sure to update the docs? - [ ] Did you write any new necessary tests?

What does this PR do?

When use the StreamingDataset when zstd-compressed dataset which stored in local dir, the StreamingDataset will sleep forever after reading several samples. This bug is caused by unsent delete request to the PrepareChunksThread.

PR review

Anyone in the community is free to review the PR once the tests have passed. If we didn't discuss your PR in GitHub issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

tchaton commented 4 months ago

Hey @wzf03, thanks for the contribution. Would you mind to add a test if possible ?

codecov[bot] commented 4 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

:exclamation: No coverage uploaded for pull request base (main@7fe1c26). Click here to learn what that means.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #127 +/- ## ===================================== Coverage ? 77% ===================================== Files ? 30 Lines ? 3976 Branches ? 0 ===================================== Hits ? 3044 Misses ? 932 Partials ? 0 ```
wzf03 commented 4 months ago

Hey @wzf03, thanks for the contribution. Would you mind to add a test if possible ?

Sure!

wzf03 commented 4 months ago

The reason of the long running time of the test cases with zstd compression is the big sleep time in https://github.com/Lightning-AI/litdata/blob/7fe1c2674d14f190fabf089fe51c9b6e91dacef8/src/litdata/streaming/item_loader.py#L113 compared to the small chunk_size in test cases. As the sleep time is big enough for the compressed file to be uncompressed, the total test process will take approximately len(dataset) / chunk_size * sleep_time time. Don't worry about those tests timing out.

wzf03 commented 4 months ago

Oops, the ci has failed. I need further check...

wzf03 commented 4 months ago

@tchaton Thanks for your helping! I found that the newly added test cases about zstd and some old test cases which require zstd like processing/test_data_processor.py::test_load_torch_audio are skipped because missing zstd pip package in testing environment. This was also mentioned in https://github.com/Lightning-AI/litdata/issues/97#issuecomment-2049085490 . Should this package be added to the requirements?

tchaton commented 4 months ago

@tchaton Thanks for your helping! I found that the newly added test cases about zstd and some old test cases which require zstd like processing/test_data_processor.py::test_load_torch_audio are skipped because missing zstd pip package in testing environment. This was also mentioned in #97 (comment) . Should this package be added to the requirements?

Hey @wzf03, let's add zstd to the test requirements. So it is optional but still tested. Here: https://github.com/Lightning-AI/litdata/blob/main/requirements/test.txt

tchaton commented 4 months ago

BTW @wzf03, what's your experience with litdata so far ? What's your use case ?

wzf03 commented 4 months ago

BTW @wzf03, what's your experience with litdata so far ? What's your use case ?

The litdata is really light and fast for training with large dataset. Thanks for your great jobs!

I used to use mosaicml-streaming in my project. However, my project is built upon pytorch and lightning, and lightning is not first-class supported by mosaicml-streaming. At the same time, the mosaicml-streaming use fix cache path in /tmp/streaming and /dev/shm, which caused inconvenience in multi-users servers because of the tmp-files conflict. So I try to turn to litdata. Though litdata is not so feature-rich compared to mosaicml-streaming, it provides a really smooth experience with lightning.

However, there is still a problem blocking us from further using litdata. The S3Client seems to be designed for aws, but we are using s3-compatible oss. I found that the even though I set the S3_ENDPOINT_URL and config the auth information in ~/.aws/config, I still cannot use the oss with litdata as I used to do with mosaicml-streaming. I will open a issue for this later.

wzf03 commented 4 months ago

The S3Client seems to be designed for aws, but we are using s3-compatible oss. I found that the even though I set the S3_ENDPOINT_URL and config the auth information in ~/.aws/config, I still cannot use the oss with litdata as I used to do with mosaicml-streaming.

Oh, sorry. I found actually it can work. As described in aws document, I just only need to set url in .aws/config, the boto3 can properly read my oss! Maybe this can be add to document.

tchaton commented 4 months ago

The S3Client seems to be designed for aws, but we are using s3-compatible oss. I found that the even though I set the S3_ENDPOINT_URL and config the auth information in ~/.aws/config, I still cannot use the oss with litdata as I used to do with mosaicml-streaming.

Oh, sorry. I found actually it can work. As described in aws document, I just only need to set url in .aws/config, the boto3 can properly read my oss! Maybe this can be add to document.

Do you mind adding a note in the README for this? This is a great find. I am going to start writing down the docs next week. More people are asking for them and I didn't get time yet ;)

tchaton commented 4 months ago

BTW @wzf03, what's your experience with litdata so far ? What's your use case ?

The litdata is really light and fast for training with large dataset. Thanks for your great jobs!

I used to use mosaicml-streaming in my project. However, my project is built upon pytorch and lightning, and lightning is not first-class supported by mosaicml-streaming. At the same time, the mosaicml-streaming use fix cache path in /tmp/streaming and /dev/shm, which caused inconvenience in multi-users servers because of the tmp-files conflict. So I try to turn to litdata. Though litdata is not so feature-rich compared to mosaicml-streaming, it provides a really smooth experience with lightning.

However, there is still a problem blocking us from further using litdata. The S3Client seems to be designed for aws, but we are using s3-compatible oss. I found that the even though I set the S3_ENDPOINT_URL and config the auth information in ~/.aws/config, I still cannot use the oss with litdata as I used to do with mosaicml-streaming. I will open a issue for this later.

I am happy you are seeing value in litdata ;) Feel free to keep contributing if you have bandwidth. Also, I am curious, how litdata speed compared to mosaicml-streaming for your use case ?

wzf03 commented 4 months ago

The S3Client seems to be designed for aws, but we are using s3-compatible oss. I found that the even though I set the S3_ENDPOINT_URL and config the auth information in ~/.aws/config, I still cannot use the oss with litdata as I used to do with mosaicml-streaming.

Oh, sorry. I found actually it can work. As described in aws document, I just only need to set url in .aws/config, the boto3 can properly read my oss! Maybe this can be add to document.

Do you mind adding a note in the README for this? This is a great find. I am going to start writing down the docs next week. More people are asking for them and I didn't get time yet ;)

I found I useful link describing the .aws/config https://boto3.amazonaws.com/v1/documentation/api/latest/guide/configuration.html#using-a-configuration-file . But the endpoint_url which is mentioned in the former link is not presented. So I suppose you can put these two url in README to guide the user to make their own configuration.

wzf03 commented 4 months ago

BTW @wzf03, what's your experience with litdata so far ? What's your use case ?

The litdata is really light and fast for training with large dataset. Thanks for your great jobs! I used to use mosaicml-streaming in my project. However, my project is built upon pytorch and lightning, and lightning is not first-class supported by mosaicml-streaming. At the same time, the mosaicml-streaming use fix cache path in /tmp/streaming and /dev/shm, which caused inconvenience in multi-users servers because of the tmp-files conflict. So I try to turn to litdata. Though litdata is not so feature-rich compared to mosaicml-streaming, it provides a really smooth experience with lightning. However, there is still a problem blocking us from further using litdata. The S3Client seems to be designed for aws, but we are using s3-compatible oss. I found that the even though I set the S3_ENDPOINT_URL and config the auth information in ~/.aws/config, I still cannot use the oss with litdata as I used to do with mosaicml-streaming. I will open a issue for this later.

I am happy you are seeing value in litdata ;) Feel free to keep contributing if you have bandwidth. Also, I am curious, how litdata speed compared to mosaicml-streaming for your use case ?

Both are fast enough for reading large dataset. When optimizing dataset, mosaicml-streaming is much more faster in single-process setting. However, the lidata can easily utilize multi-core, so if launch multiple workers, the speed is fast enough.

tchaton commented 4 months ago

Hey @wzf03. Thanks. To be fair, litdata optimize really shines when optimizing data from s3 to s3. As it pre-download the data in the background while optimizing them and upload back the chunks to s3 in the background.

When I benchmarked it against streaming, it was 8x faster in general.