Lightning-AI / litdata

Transform datasets at scale. Optimize datasets for fast AI model training.
Apache License 2.0
375 stars 43 forks source link

Existing Cache files leads to permanent DataLoader hang #398

Closed lilavocado closed 1 month ago

lilavocado commented 1 month ago

๐Ÿ› Bug

I'm currently using litdata completely locally where I first convert the dataset using 'optimize' and use StreamingDataset to stream the records from a local directory to train my model. I want to train multiple models (using the same dataset) in parallel but it seems that the cache files created from the previous runs ends up blocking the StreamingDataset of the future runs (probably due to locking?) It took me quite a while to figure out that the freeze was due to the cache files.

My workaround for now is to create a new caching directory for each run following the documentation using the 'Dir' class from resolver.py which was a bit confusing at first because 'Dir' takes arguments 'url' and 'path' which makes it seem like it only works when you have data in the cloud (url.) It would have made more sense it the arguments were like 'path' (either url or local directory) and 'cache_dir' (the directory to store cache)

The question I had was: why does it have to cache data when all the data is already available locally?

It would be great if StreamingDataset directly took an argument like cache_dir.

Thanks.

github-actions[bot] commented 1 month ago

Hi! thanks for your contribution!, great first issue!

tchaton commented 1 month ago

Hey @lilavocado. This library is made to work for cloud data first. I haven't faced this issue before.

Could you provide a simple reproducible script ?

emileclastres commented 1 month ago

Hi, I have experienced a similar problem when using a StreamingDataLoader on two concurrent studios with multiple workers on the lightning platform. After a while, the multiprocessed dataloader would just stop iterating batches without throwing an error, and GPU utilization would drop to 0 on one or a few GPUs. It looks to me like the /cache/ folder is shared for the multiple runs and is the root of the issue (maybe hash collisions?).

Simply exposing a cache_dir argument to StreamingDataset could be a convenient fix.

tchaton commented 1 month ago

Hey @emileclastres. Would you like to make a PR to expose it ?

bhimrazy commented 1 month ago

Hi @lilavocado, @emileclastres,

Weโ€™ve added support for passing the cache_dir parameter in the Streaming dataset. Feel free to try it out and share your feedback!

You can use it directly from the main branch, and this feature will be officially included in the upcoming release.

Thank you! ๐Ÿ˜Š