Closed wzf03 closed 4 months ago
Hey @wzf03, thanks for the contribution. Would you mind to add a test if possible ?
All modified and coverable lines are covered by tests :white_check_mark:
:exclamation: No coverage uploaded for pull request base (
main@7fe1c26
). Click here to learn what that means.
Hey @wzf03, thanks for the contribution. Would you mind to add a test if possible ?
Sure!
The reason of the long running time of the test cases with zstd compression is the big sleep time in https://github.com/Lightning-AI/litdata/blob/7fe1c2674d14f190fabf089fe51c9b6e91dacef8/src/litdata/streaming/item_loader.py#L113 compared to the small chunk_size
in test cases. As the sleep time is big enough for the compressed file to be uncompressed, the total test process will take approximately len(dataset) / chunk_size * sleep_time
time. Don't worry about those tests timing out.
Oops, the ci has failed. I need further check...
@tchaton Thanks for your helping!
I found that the newly added test cases about zstd and some old test cases which require zstd like processing/test_data_processor.py::test_load_torch_audio
are skipped because missing zstd
pip package in testing environment. This was also mentioned in https://github.com/Lightning-AI/litdata/issues/97#issuecomment-2049085490 . Should this package be added to the requirements?
@tchaton Thanks for your helping! I found that the newly added test cases about zstd and some old test cases which require zstd like
processing/test_data_processor.py::test_load_torch_audio
are skipped because missingzstd
pip package in testing environment. This was also mentioned in #97 (comment) . Should this package be added to the requirements?
Hey @wzf03, let's add zstd
to the test requirements. So it is optional but still tested. Here: https://github.com/Lightning-AI/litdata/blob/main/requirements/test.txt
BTW @wzf03, what's your experience with litdata so far ? What's your use case ?
BTW @wzf03, what's your experience with litdata so far ? What's your use case ?
The litdata is really light and fast for training with large dataset. Thanks for your great jobs!
I used to use mosaicml-streaming in my project. However, my project is built upon pytorch and lightning, and lightning is not first-class supported by mosaicml-streaming. At the same time, the mosaicml-streaming use fix cache path in /tmp/streaming
and /dev/shm
, which caused inconvenience in multi-users servers because of the tmp-files conflict. So I try to turn to litdata. Though litdata is not so feature-rich compared to mosaicml-streaming, it provides a really smooth experience with lightning.
However, there is still a problem blocking us from further using litdata. The S3Client
seems to be designed for aws, but we are using s3-compatible oss. I found that the even though I set the S3_ENDPOINT_URL
and config the auth information in ~/.aws/config
, I still cannot use the oss with litdata as I used to do with mosaicml-streaming. I will open a issue for this later.
The
S3Client
seems to be designed for aws, but we are using s3-compatible oss. I found that the even though I set theS3_ENDPOINT_URL
and config the auth information in~/.aws/config
, I still cannot use the oss with litdata as I used to do with mosaicml-streaming.
Oh, sorry. I found actually it can work. As described in aws document, I just only need to set url in .aws/config
, the boto3 can properly read my oss! Maybe this can be add to document.
The
S3Client
seems to be designed for aws, but we are using s3-compatible oss. I found that the even though I set theS3_ENDPOINT_URL
and config the auth information in~/.aws/config
, I still cannot use the oss with litdata as I used to do with mosaicml-streaming.Oh, sorry. I found actually it can work. As described in aws document, I just only need to set url in
.aws/config
, the boto3 can properly read my oss! Maybe this can be add to document.
Do you mind adding a note in the README for this? This is a great find. I am going to start writing down the docs next week. More people are asking for them and I didn't get time yet ;)
BTW @wzf03, what's your experience with litdata so far ? What's your use case ?
The litdata is really light and fast for training with large dataset. Thanks for your great jobs!
I used to use mosaicml-streaming in my project. However, my project is built upon pytorch and lightning, and lightning is not first-class supported by mosaicml-streaming. At the same time, the mosaicml-streaming use fix cache path in
/tmp/streaming
and/dev/shm
, which caused inconvenience in multi-users servers because of the tmp-files conflict. So I try to turn to litdata. Though litdata is not so feature-rich compared to mosaicml-streaming, it provides a really smooth experience with lightning.However, there is still a problem blocking us from further using litdata. The
S3Client
seems to be designed for aws, but we are using s3-compatible oss. I found that the even though I set theS3_ENDPOINT_URL
and config the auth information in~/.aws/config
, I still cannot use the oss with litdata as I used to do with mosaicml-streaming. I will open a issue for this later.
I am happy you are seeing value in litdata ;) Feel free to keep contributing if you have bandwidth. Also, I am curious, how litdata speed compared to mosaicml-streaming for your use case ?
The
S3Client
seems to be designed for aws, but we are using s3-compatible oss. I found that the even though I set theS3_ENDPOINT_URL
and config the auth information in~/.aws/config
, I still cannot use the oss with litdata as I used to do with mosaicml-streaming.Oh, sorry. I found actually it can work. As described in aws document, I just only need to set url in
.aws/config
, the boto3 can properly read my oss! Maybe this can be add to document.Do you mind adding a note in the README for this? This is a great find. I am going to start writing down the docs next week. More people are asking for them and I didn't get time yet ;)
I found I useful link describing the .aws/config
https://boto3.amazonaws.com/v1/documentation/api/latest/guide/configuration.html#using-a-configuration-file . But the endpoint_url
which is mentioned in the former link is not presented. So I suppose you can put these two url in README to guide the user to make their own configuration.
BTW @wzf03, what's your experience with litdata so far ? What's your use case ?
The litdata is really light and fast for training with large dataset. Thanks for your great jobs! I used to use mosaicml-streaming in my project. However, my project is built upon pytorch and lightning, and lightning is not first-class supported by mosaicml-streaming. At the same time, the mosaicml-streaming use fix cache path in
/tmp/streaming
and/dev/shm
, which caused inconvenience in multi-users servers because of the tmp-files conflict. So I try to turn to litdata. Though litdata is not so feature-rich compared to mosaicml-streaming, it provides a really smooth experience with lightning. However, there is still a problem blocking us from further using litdata. TheS3Client
seems to be designed for aws, but we are using s3-compatible oss. I found that the even though I set theS3_ENDPOINT_URL
and config the auth information in~/.aws/config
, I still cannot use the oss with litdata as I used to do with mosaicml-streaming. I will open a issue for this later.I am happy you are seeing value in litdata ;) Feel free to keep contributing if you have bandwidth. Also, I am curious, how litdata speed compared to mosaicml-streaming for your use case ?
Both are fast enough for reading large dataset. When optimizing dataset, mosaicml-streaming is much more faster in single-process setting. However, the lidata can easily utilize multi-core, so if launch multiple workers, the speed is fast enough.
Hey @wzf03. Thanks. To be fair, litdata optimize really shines when optimizing data from s3 to s3. As it pre-download the data in the background while optimizing them and upload back the chunks to s3 in the background.
When I benchmarked it against streaming, it was 8x faster in general.
Before submitting
- [ ] Was this discussed/agreed via a Github issue? (no need for typos and docs improvements) - [ ] Did you read the [contributor guideline](https://github.com/Lightning-AI/lit-data/blob/main/.github/CONTRIBUTING.md), Pull Request section? - [ ] Did you make sure to update the docs? - [ ] Did you write any new necessary tests?What does this PR do?
When use the StreamingDataset when zstd-compressed dataset which stored in local dir, the StreamingDataset will sleep forever after reading several samples. This bug is caused by unsent delete request to the
PrepareChunksThread
.PR review
Anyone in the community is free to review the PR once the tests have passed. If we didn't discuss your PR in GitHub issues there's a high chance it will not be merged.
Did you have fun?
Make sure you had fun coding 🙃