When utilizing the StreamingDataset to read data directly from AWS S3 with Distributed Data Parallel (DDP), the following warning message is displayed:
lib/python3.10/site-packages/lightning/pytorch/utilities/data.py:122 Your `IterableDataset` has `__len__` defined. In combination with multi-process data loading (when num_workers > 1), `__len__` could be inaccurate if each worker is not configured independently to avoid having duplicate data.
To Reproduce
Steps to reproduce the behavior:
Create the litdata.StreamingDataset
Create dataLoader using litdata.StreamingDataLoader or torch.utils.data.DataLoader
set batch_size > 1
train using DDP
Code sample
Datamodule
import lightning.pytorch as pl
from litdata import StreamingDataset, StreamingDataLoader
def collate_fn(samples):
# some data modifications
return samples
class MyDataModule(pl.LightningDataModule):
def __init__(self, data_path, **kwargs):
super().__init__()
self.data_path = data_path
def setup(self, stage):
if "s3://" in self.data_path:
self.dataset = StreamingDataset(self.data_path, shuffle=True)
def train_dataloader(self):
return StreamingDataLoader(
self.dataset,
batch_size=16,
shuffle=True,
num_workers=4,
collate_fn=collate_fn,
drop_last=True,
)
š Bug
When utilizing the StreamingDataset to read data directly from AWS S3 with Distributed Data Parallel (DDP), the following warning message is displayed:
To Reproduce
Steps to reproduce the behavior:
Code sample
Datamodule
Training
Expected behavior
No warning message should be displayed during training.
Environment
conda
,pip
, source): pipAdditional context