Open plra opened 2 months ago
Hi! thanks for your contribution!, great first issue!
My setup is very standard AFAIK. I have a
LightningDataModule
withStreamingDataLoader
s on top ofStreamingDataset
s pointing to a collection of files in an S3 bucket.
thank you for sharing this with us, could we kindly ask you to share a full reproducible example?
🐛 Bug
My training job intermittently fails with
I see this occasionally when attempting to train in standard single-node DDP setups, but now that I've started using dual-node DDP I'm seeing it much more often (at least one node will fail ~80% of the time -- I want to say this is much higher than would be the case if the node-level failures were independent). My usual solution has been to just re-run the job, but this is now very impractical.
Code sample
My setup is very standard AFAIK. I have a
LightningDataModule
withStreamingDataLoader
s on top ofStreamingDataset
s pointing to a collection of files in an S3 bucket.Environment
conda
,pip
, source): pip