Lightning-AI / litdata

Streamline data pipelines for AI. Process datasets across 1000s of machines, and optimize data for blazing fast model training.
Apache License 2.0
249 stars 24 forks source link

Warning Message When Using StreamingDataset with DDP #172

Open taemincho opened 2 weeks ago

taemincho commented 2 weeks ago

šŸ› Bug

When utilizing the StreamingDataset to read data directly from AWS S3 with Distributed Data Parallel (DDP), the following warning message is displayed:

lib/python3.10/site-packages/lightning/pytorch/utilities/data.py:122 Your `IterableDataset` has `__len__` defined. In combination with multi-process data loading (when num_workers > 1), `__len__` could be inaccurate if each worker is not configured independently to avoid having duplicate data.

To Reproduce

Steps to reproduce the behavior:

  1. Create the litdata.StreamingDataset
  2. Create dataLoader using litdata.StreamingDataLoader or torch.utils.data.DataLoader
  3. set batch_size > 1
  4. train using DDP

Code sample

Datamodule

import lightning.pytorch as pl
from litdata import StreamingDataset, StreamingDataLoader

def collate_fn(samples):
    # some data modifications
    return samples

class MyDataModule(pl.LightningDataModule):
    def __init__(self, data_path, **kwargs):
        super().__init__()
        self.data_path = data_path

    def setup(self, stage):
        if "s3://" in self.data_path:
            self.dataset = StreamingDataset(self.data_path, shuffle=True)

    def train_dataloader(self):
        return StreamingDataLoader(
            self.dataset,
            batch_size=16,
            shuffle=True,
            num_workers=4,
            collate_fn=collate_fn,
            drop_last=True,
        )

Training

datamodule = MyDataModule("s3://my_bucket")

trainer = pl.Trainer(
    logger=False
    max_epochs=100000,
    precision="16-mixed",
)

trainer.fit(model, datamodule=datamodule, ckpt_path="last")

Expected behavior

No warning message should be displayed during training.

Environment

Additional context

github-actions[bot] commented 2 weeks ago

Hi! thanks for your contribution!, great first issue!

tchaton commented 2 weeks ago

Yes, it is normal. All good @taemincho