StreamingDataset intermittently fails due to lack of index.json

plra commented 2 months ago

🐛 Bug

My training job intermittently fails with

File ".../litdata/streaming/dataset.py", line 89, in __init__
    self.subsampled_files, self.region_of_interest = subsample_streaming_dataset(
File ".../litdata/utilities/dataset_utilities.py", line 60, in subsample_streaming_dataset
    raise ValueError(
ValueError: The provided dataset `/root/.lightning/chunks/<hash>/<hash>` doesn't contain any index.json file. HINT: Did you successfully optimize a dataset to the provided `input_dir`?

I see this occasionally when attempting to train in standard single-node DDP setups, but now that I've started using dual-node DDP I'm seeing it much more often (at least one node will fail ~80% of the time -- I want to say this is much higher than would be the case if the node-level failures were independent). My usual solution has been to just re-run the job, but this is now very impractical.

Code sample

My setup is very standard AFAIK. I have a LightningDataModule with StreamingDataLoaders on top of StreamingDatasets pointing to a collection of files in an S3 bucket.

Environment

PyTorch Version (e.g., 1.0): 2.4.0
OS (e.g., Linux): Linux
How you installed PyTorch (conda, pip, source): pip
Build command you used (if compiling from source):
Python version: 3.10
CUDA/cuDNN version: 12.1.0
GPU models and configuration: 8x H100 x 2 nodes (+ various other configurations)

github-actions[bot] commented 2 months ago

Hi! thanks for your contribution!, great first issue!

Borda commented 2 months ago

My setup is very standard AFAIK. I have a LightningDataModule with StreamingDataLoaders on top of StreamingDatasets pointing to a collection of files in an S3 bucket.

thank you for sharing this with us, could we kindly ask you to share a full reproducible example?

Lightning-AI / litdata