Open bzwartsenberg opened 1 year ago
@bzwartsenberg thanks for opening the issue. Before changing the behavior, I'd like to have a short discussion on this.
The reason for this is that I kind of disagree and this would be exactly the behavior I'd expect for validation loaders.
If you set limit_val_batches
and for every validation call you get different batches/samples, how would you track progress of your model if scores obtained in validation runs wouldn't be comparable to each other as they were computed on different samples. This would make things like early stopping, checkpointing and lr scheduling based on validation scores impossible.
Thoughts @awaelchli @carmocca ?
I can think of two reasons why I would assume that is the expected behavior. Assuming here the user has set shuffle
in the dataloader to True
.
At the very least, 1. and 2. are inconsistent. 1. seems particularly strange, since it keeps the same batch during the epoch, but changes it at the end. Certainly 1. would be the worst scenario for early stopping, chceckpointing and lr
scheduling.
Taking a new random sample from the validation set every validation round certainly has more variance, but at least it removes the bias associated with having a fixed random sample. Variance can be removed with a moving average or some kind of smoothing, bias can't.
Consider this: training the same model twice with a different random seed would produce arbitrarily different results, depending on which validation batch is chosen. Even worse: going from one epoch to the next could yield completely different results.
If there's an inconsistency between distributed and non-distributed, we should address it. I agree with shuffling every validation call during the epoch if that's what devices=1
does.
I agree making it consistent. Since the user explicitly asked for shuffling, this should be ok IMO. Implementing it would however require a decision. The change can be made in this line: https://github.com/Lightning-AI/lightning/blob/bc85c5fd14005965faae4010db428820e422e2d3/src/lightning/pytorch/loops/evaluation_loop.py#L225
The change proposed here would request to make this number there increase every validation loop call. Right now, it is constant across an epoch (representing the epoch index). However, afaik we don't explicitly track a counter for how many times the validation loop is invoked. This would have to be added. Alternatively, we could decide to set it to the training batch progress:
_set_sampler_epoch(dl, trainer.fit_loop.epoch_loop.batch_progress.current.processed)
But this isn't a very clean solution.
I think changing it is okay too. The number used was an implementation detail more than an explicit request of the original feature request: https://github.com/Lightning-AI/lightning/issues/10342
Bug description
The validation dataloader is seeded identically for each validation step during each epoch in distributed mode. While this may be fine when the whole validation set is consumed entirely in the validation step, this leads to unexpected results when using the
limit_val_batches
option.It seems to me that this is not the result that is expected when setting
limit_val_batches
; it seems that that should give a different subset of batches every time the validation loop is executed. The behavior fordevices=1
confirms this: in this case the dataloader provides a new set of batches on every validation loop, even in the case of multiple validation loops per epoch.The issue seems to arise from the distributed sampler in pytorch (
torch.utils.data.distributed.DistributedSampler
, see https://pytorch.org/docs/stable/_modules/torch/utils/data/distributed.html#DistributedSampler), which seeds based on a seed, and epoch number. The effect is that the exact same shuffling is used as long as the epoch number has not been updated. And then, as mentioned before, when limiting the validation to a small subset of the total validation set, over the course of one epoch, exactly the same batches are used for evaluation at every step, and then a new fixed set on the second epoch, etcetera.Below is a reproducing example that displays the batch indices at every validation step. It is easy to see that the output is identical all throughout the first epoch. Setting the number of devices to 1, gives a random set of batches every validation step (the expected behavior).
Summary: The combination
devices > 1
,limit_val_batches < (len(dataset) // batch_size)
andshuffle=True
gives the same set of validation batches for each validation step.What version are you seeing the problem on?
v2.0
How to reproduce the bug
Error messages and logs
Environment
Current environment
* CUDA: - GPU: - NVIDIA GeForce RTX 3080 Laptop GPU - available: True - version: 11.7 * Lightning: - lightning-utilities: 0.8.0 - pytorch-lightning: 2.0.3 - torch: 2.0.1 - torchaudio: 2.0.2 - torchmetrics: 0.11.4 - torchvision: 0.15.2 * Packages: - aiohttp: 3.8.4 - aiosignal: 1.3.1 - appdirs: 1.4.4 - async-timeout: 4.0.2 - attrs: 23.1.0 - certifi: 2023.5.7 - charset-normalizer: 3.1.0 - click: 8.1.3 - cmake: 3.26.4 - docker-pycreds: 0.4.0 - filelock: 3.12.2 - frozenlist: 1.3.3 - fsspec: 2023.6.0 - gitdb: 4.0.10 - gitpython: 3.1.31 - idna: 3.4 - jinja2: 3.1.2 - lightning-utilities: 0.8.0 - lit: 16.0.6 - markupsafe: 2.1.3 - mpmath: 1.3.0 - multidict: 6.0.4 - networkx: 3.1 - numpy: 1.24.3 - nvidia-cublas-cu11: 11.10.3.66 - nvidia-cuda-cupti-cu11: 11.7.101 - nvidia-cuda-nvrtc-cu11: 11.7.99 - nvidia-cuda-runtime-cu11: 11.7.99 - nvidia-cudnn-cu11: 8.5.0.96 - nvidia-cufft-cu11: 10.9.0.58 - nvidia-curand-cu11: 10.2.10.91 - nvidia-cusolver-cu11: 11.4.0.1 - nvidia-cusparse-cu11: 11.7.4.91 - nvidia-nccl-cu11: 2.14.3 - nvidia-nvtx-cu11: 11.7.91 - packaging: 23.1 - pathtools: 0.1.2 - pillow: 9.5.0 - pip: 22.0.2 - protobuf: 4.23.3 - psutil: 5.9.5 - pytorch-lightning: 2.0.3 - pyyaml: 6.0 - requests: 2.31.0 - sentry-sdk: 1.25.1 - setproctitle: 1.3.2 - setuptools: 59.6.0 - six: 1.16.0 - smmap: 5.0.0 - sympy: 1.12 - torch: 2.0.1 - torchaudio: 2.0.2 - torchmetrics: 0.11.4 - torchvision: 0.15.2 - tqdm: 4.65.0 - triton: 2.0.0 - typing-extensions: 4.6.3 - urllib3: 2.0.3 - wandb: 0.15.4 - wheel: 0.40.0 - yarl: 1.9.2 * System: - OS: Linux - architecture: - 64bit - ELF - processor: x86_64 - python: 3.10.4 - release: 5.17.15-76051715-generic - version: #202206141358~1655919116~22.04~1db9e34 SMP PREEMPT Wed Jun 22 19More info
No response
cc @justusschock @awaelchli