Lightning-AI / litdata

Transform datasets at scale. Optimize datasets for fast AI model training.
Apache License 2.0
367 stars 42 forks source link

StreamingDataset intermittently fails due to lack of index.json #337

Open plra opened 2 months ago

plra commented 2 months ago

🐛 Bug

My training job intermittently fails with

File ".../litdata/streaming/dataset.py", line 89, in __init__
    self.subsampled_files, self.region_of_interest = subsample_streaming_dataset(
File ".../litdata/utilities/dataset_utilities.py", line 60, in subsample_streaming_dataset
    raise ValueError(
ValueError: The provided dataset `/root/.lightning/chunks/<hash>/<hash>` doesn't contain any index.json file. HINT: Did you successfully optimize a dataset to the provided `input_dir`?

I see this occasionally when attempting to train in standard single-node DDP setups, but now that I've started using dual-node DDP I'm seeing it much more often (at least one node will fail ~80% of the time -- I want to say this is much higher than would be the case if the node-level failures were independent). My usual solution has been to just re-run the job, but this is now very impractical.

Code sample

My setup is very standard AFAIK. I have a LightningDataModule with StreamingDataLoaders on top of StreamingDatasets pointing to a collection of files in an S3 bucket.

Environment

github-actions[bot] commented 2 months ago

Hi! thanks for your contribution!, great first issue!

Borda commented 2 months ago

My setup is very standard AFAIK. I have a LightningDataModule with StreamingDataLoaders on top of StreamingDatasets pointing to a collection of files in an S3 bucket.

thank you for sharing this with us, could we kindly ask you to share a full reproducible example?