Validation loader is seeded the same during epoch in distributed mode

bzwartsenberg commented 1 year ago

Bug description

The validation dataloader is seeded identically for each validation step during each epoch in distributed mode. While this may be fine when the whole validation set is consumed entirely in the validation step, this leads to unexpected results when using the limit_val_batches option.

It seems to me that this is not the result that is expected when setting limit_val_batches ; it seems that that should give a different subset of batches every time the validation loop is executed. The behavior for devices=1 confirms this: in this case the dataloader provides a new set of batches on every validation loop, even in the case of multiple validation loops per epoch.

The issue seems to arise from the distributed sampler in pytorch (torch.utils.data.distributed.DistributedSampler, see https://pytorch.org/docs/stable/_modules/torch/utils/data/distributed.html#DistributedSampler), which seeds based on a seed, and epoch number. The effect is that the exact same shuffling is used as long as the epoch number has not been updated. And then, as mentioned before, when limiting the validation to a small subset of the total validation set, over the course of one epoch, exactly the same batches are used for evaluation at every step, and then a new fixed set on the second epoch, etcetera.

Below is a reproducing example that displays the batch indices at every validation step. It is easy to see that the output is identical all throughout the first epoch. Setting the number of devices to 1, gives a random set of batches every validation step (the expected behavior).

Summary: The combination devices > 1, limit_val_batches < (len(dataset) // batch_size) and shuffle=True gives the same set of validation batches for each validation step.

What version are you seeing the problem on?

v2.0

How to reproduce the bug

from torch.utils.data import DataLoader
import torch
import torch.nn as nn
import torch.optim as optim
import pytorch_lightning as pl
import time

# Define the LightningModule
class MyModel(pl.LightningModule):
    def __init__(self):
        super(MyModel, self).__init__()
        self.model = nn.parameter.Parameter(torch.ones(1,), requires_grad=True)

    # Nothing important here
    def training_step(self, batch, batch_idx):
        loss = (self.model - batch).square().mean()
        self.log('train_loss', loss)
        return loss

    def train_dataloader(self):
        dataset = torch.range(0, 100000).float() / 10000
        train_dataloader = DataLoader(
            dataset, batch_size=2)
        return train_dataloader

    def val_dataloader(self):
        val_dataset = torch.range(0, 10000).float()
        val_batch_size=8

        val_dataloader = DataLoader(
            val_dataset, batch_size=val_batch_size, drop_last=False, num_workers=0,
            pin_memory=False, collate_fn=collate_fn, shuffle=True
        )
        return val_dataloader

    def configure_optimizers(self):
        optimizer = optim.Adam(self.parameters(), lr=1e-4)
        return optimizer

    def validation_step(self, batch, batch_idx):
        print(f'Reporting from local rank {self.trainer.local_rank}, index {batch_idx}, batch {batch}')

collate_fn = torch.stack

if __name__ == '__main__':
    devices= 2
    accelerator = 'cpu'
    # Create a PyTorch Lightning trainer
    model = MyModel()

    trainer = pl.Trainer(
        max_epochs=100,
        val_check_interval=0.2,
        limit_val_batches=2,
        devices=devices,
        accelerator=accelerator,
    )

    # Train the model using the Lightning trainer
    trainer.fit(model)

Error messages and logs

[ ... ] 
Epoch 0:  20%|███████████████████████▌                                                                                              | 5000/25001 [00:06<00:24, 811.98it/s, v_num=0]
Reporting from local rank 1, index 0, batch tensor([8240., 9125., 2912., 3170., 9948., 9427., 6728., 1879.])
Reporting from local rank 1, index 1, batch tensor([6557., 5502., 8580., 1543., 6386., 1750., 4385., 3793.])
Validation: 0it [00:00, ?it/s]                                                                                                                                                    
Reporting from local rank 0, index 0, batch tensor([ 354., 2137., 5853., 6143., 8100., 3042., 2407., 1548.])                                                  | 0/2 [00:00<?, ?it/s]

Reporting from local rank 0, index 1, batch tensor([7737., 3251., 6256., 6769., 7153.,  720., 3561., 5881.])                                        | 1/2 [00:00<00:00, 2004.93it/s]
Epoch 0:  40%|██████████████████████████████████████████████▊                                                                      | 10000/25001 [00:12<00:18, 806.40it/s, v_num=0]
Reporting from local rank 1, index 0, batch tensor([8240., 9125., 2912., 3170., 9948., 9427., 6728., 1879.])                                                                       
Reporting from local rank 1, index 1, batch tensor([6557., 5502., 8580., 1543., 6386., 1750., 4385., 3793.])

Reporting from local rank 0, index 0, batch tensor([ 354., 2137., 5853., 6143., 8100., 3042., 2407., 1548.])                                                  | 0/2 [00:00<?, ?it/s]

Reporting from local rank 0, index 1, batch tensor([7737., 3251., 6256., 6769., 7153.,  720., 3561., 5881.])                                        | 1/2 [00:00<00:00, 1972.86it/s]
Epoch 0:  60%|██████████████████████████████████████████████████████████████████████▏                                              | 15000/25001 [00:18<00:12, 799.65it/s, v_num=0]
Reporting from local rank 1, index 0, batch tensor([8240., 9125., 2912., 3170., 9948., 9427., 6728., 1879.])                                                                       
Reporting from local rank 1, index 1, batch tensor([6557., 5502., 8580., 1543., 6386., 1750., 4385., 3793.])

Reporting from local rank 0, index 0, batch tensor([ 354., 2137., 5853., 6143., 8100., 3042., 2407., 1548.])                                                  | 0/2 [00:00<?, ?it/s]

Reporting from local rank 0, index 1, batch tensor([7737., 3251., 6256., 6769., 7153.,  720., 3561., 5881.])                                        | 1/2 [00:00<00:00, 1897.88it/s]
[ ... ]

Environment

Current environment

* CUDA: - GPU: - NVIDIA GeForce RTX 3080 Laptop GPU - available: True - version: 11.7 * Lightning: - lightning-utilities: 0.8.0 - pytorch-lightning: 2.0.3 - torch: 2.0.1 - torchaudio: 2.0.2 - torchmetrics: 0.11.4 - torchvision: 0.15.2 * Packages: - aiohttp: 3.8.4 - aiosignal: 1.3.1 - appdirs: 1.4.4 - async-timeout: 4.0.2 - attrs: 23.1.0 - certifi: 2023.5.7 - charset-normalizer: 3.1.0 - click: 8.1.3 - cmake: 3.26.4 - docker-pycreds: 0.4.0 - filelock: 3.12.2 - frozenlist: 1.3.3 - fsspec: 2023.6.0 - gitdb: 4.0.10 - gitpython: 3.1.31 - idna: 3.4 - jinja2: 3.1.2 - lightning-utilities: 0.8.0 - lit: 16.0.6 - markupsafe: 2.1.3 - mpmath: 1.3.0 - multidict: 6.0.4 - networkx: 3.1 - numpy: 1.24.3 - nvidia-cublas-cu11: 11.10.3.66 - nvidia-cuda-cupti-cu11: 11.7.101 - nvidia-cuda-nvrtc-cu11: 11.7.99 - nvidia-cuda-runtime-cu11: 11.7.99 - nvidia-cudnn-cu11: 8.5.0.96 - nvidia-cufft-cu11: 10.9.0.58 - nvidia-curand-cu11: 10.2.10.91 - nvidia-cusolver-cu11: 11.4.0.1 - nvidia-cusparse-cu11: 11.7.4.91 - nvidia-nccl-cu11: 2.14.3 - nvidia-nvtx-cu11: 11.7.91 - packaging: 23.1 - pathtools: 0.1.2 - pillow: 9.5.0 - pip: 22.0.2 - protobuf: 4.23.3 - psutil: 5.9.5 - pytorch-lightning: 2.0.3 - pyyaml: 6.0 - requests: 2.31.0 - sentry-sdk: 1.25.1 - setproctitle: 1.3.2 - setuptools: 59.6.0 - six: 1.16.0 - smmap: 5.0.0 - sympy: 1.12 - torch: 2.0.1 - torchaudio: 2.0.2 - torchmetrics: 0.11.4 - torchvision: 0.15.2 - tqdm: 4.65.0 - triton: 2.0.0 - typing-extensions: 4.6.3 - urllib3: 2.0.3 - wandb: 0.15.4 - wheel: 0.40.0 - yarl: 1.9.2 * System: - OS: Linux - architecture: - 64bit - ELF - processor: x86_64 - python: 3.10.4 - release: 5.17.15-76051715-generic - version: #202206141358~1655919116~22.04~1db9e34 SMP PREEMPT Wed Jun 22 19

More info

No response

cc @justusschock @awaelchli

justusschock commented 1 year ago

@bzwartsenberg thanks for opening the issue. Before changing the behavior, I'd like to have a short discussion on this.

The reason for this is that I kind of disagree and this would be exactly the behavior I'd expect for validation loaders. If you set limit_val_batches and for every validation call you get different batches/samples, how would you track progress of your model if scores obtained in validation runs wouldn't be comparable to each other as they were computed on different samples. This would make things like early stopping, checkpointing and lr scheduling based on validation scores impossible.

Thoughts @awaelchli @carmocca ?

bzwartsenberg commented 1 year ago

I can think of two reasons why I would assume that is the expected behavior. Assuming here the user has set shuffle in the dataloader to True.

after every epoch, the random validation batch does change.
In non-distributed training, the random shuffle does change every round of validation.

At the very least, 1. and 2. are inconsistent. 1. seems particularly strange, since it keeps the same batch during the epoch, but changes it at the end. Certainly 1. would be the worst scenario for early stopping, chceckpointing and lr scheduling.

Taking a new random sample from the validation set every validation round certainly has more variance, but at least it removes the bias associated with having a fixed random sample. Variance can be removed with a moving average or some kind of smoothing, bias can't.

Consider this: training the same model twice with a different random seed would produce arbitrarily different results, depending on which validation batch is chosen. Even worse: going from one epoch to the next could yield completely different results.

carmocca commented 1 year ago

If there's an inconsistency between distributed and non-distributed, we should address it. I agree with shuffling every validation call during the epoch if that's what devices=1 does.

awaelchli commented 1 year ago

I agree making it consistent. Since the user explicitly asked for shuffling, this should be ok IMO. Implementing it would however require a decision. The change can be made in this line: https://github.com/Lightning-AI/lightning/blob/bc85c5fd14005965faae4010db428820e422e2d3/src/lightning/pytorch/loops/evaluation_loop.py#L225

The change proposed here would request to make this number there increase every validation loop call. Right now, it is constant across an epoch (representing the epoch index). However, afaik we don't explicitly track a counter for how many times the validation loop is invoked. This would have to be added. Alternatively, we could decide to set it to the training batch progress:

_set_sampler_epoch(dl, trainer.fit_loop.epoch_loop.batch_progress.current.processed)

But this isn't a very clean solution.

carmocca commented 1 year ago

I think changing it is okay too. The number used was an implementation detail more than an explicit request of the original feature request: https://github.com/Lightning-AI/lightning/issues/10342

Lightning-AI / pytorch-lightning