Open jakub-dyno opened 3 months ago
@jakub-dyno If the epoch size is not evenly divisible by the gradient accumulation size, the optimizer will step anyway on the last iterations. The partial accumulation can't be kept around in memory and continued in the next epoch.
The loss over the accumulation window is scaled regardless of whether it is the full window or not: https://github.com/Lightning-AI/pytorch-lightning/blob/e330da5870fae34339170b942095a2600fa7a95e/src/lightning/pytorch/loops/optimization/automatic.py#L327
If you'd like to change it, a PR for this would be welcome. But it could be a bit involved.
Bug description
At the end of an epoch with accumulate_grad_batches>1 the dataloader may run out of data before the required number of accumulations. The lightning docs do not say what happens. It could
My experiments suggest its option 3 but happy to be wrong.
What version are you seeing the problem on?
v2.0, v2.2
How to reproduce the bug
Error messages and logs
Environment
Current environment
``` #- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow): #- PyTorch Lightning Version (e.g., 1.5.0): #- Lightning App Version (e.g., 0.5.2): #- PyTorch Version (e.g., 2.0): #- Python version (e.g., 3.9): #- OS (e.g., Linux): #- CUDA/cuDNN version: #- GPU models and configuration: #- How you installed Lightning(`conda`, `pip`, source): #- Running environment of LightningApp (e.g. local, cloud): ```More info
No response
cc @borda