Lightning-AI / pytorch-lightning

Pretrain, finetune and deploy AI models on multiple GPUs, TPUs with zero code changes.
Apache License 2.0
Apache License 2.0

PyTorch Lightning produces different loss when resuming from ckpt vs training without interruption #18098

Closed dinhanhx closed 1 year ago

dinhanhx commented 1 year ago

Bug description

As the title said, the loss values are different when resuming from ckpt vs training without interruption.

First, run the code below without interruption. Second, run the code again, wait till a certain step, kill it Finally, run the code again with the ckpt

What version are you seeing the problem on?


How to reproduce the bug

from typing import Any, Union

import lightning.pytorch as pl
import torch
from lightning.pytorch.callbacks import (
from lightning.pytorch.demos import Transformer, WikiText2
from lightning.pytorch.loggers import CSVLogger, TensorBoardLogger
from lightning.pytorch.utilities.types import STEP_OUTPUT
from torch.optim import AdamW
from import DataLoader
from transformers.optimization import get_cosine_schedule_with_warmup

class BoringTransformer(pl.LightningModule):
    def __init__(
        self, vocab_size: int, learning_rate: float = 5e-5, warmup_ratio: float = 0.1
    ) -> None:
        self.learning_rate = learning_rate
        self.warmup_ratio = warmup_ratio
        self.transformer = Transformer(vocab_size=vocab_size)

    def training_step(self, batch) -> STEP_OUTPUT:
        x, y = batch
        z = self.transformer(x, y)
        loss = torch.nn.functional.nll_loss(z, y.view(-1))
        self.log("train_loss", loss)
        return loss

    def configure_optimizers(self) -> Any:
        opt = AdamW(self.parameters(), self.learning_rate)
        opt_list = [opt]
        lrs = {
            "scheduler": get_cosine_schedule_with_warmup(
                self.trainer.estimated_stepping_batches * self.warmup_ratio,
            "interval": "step",
            "frequency": 1,
        lrs_list = [lrs]
        return opt_list, lrs_list

dataset = WikiText2()
dataloader = DataLoader(dataset, batch_size=16)
transformer = BoringTransformer(dataset.vocab_size)
trainer = pl.Trainer(
        ModelCheckpoint(every_n_train_steps=16, save_last=True)

ckpt_path: Union[
    str, None
] = "boring_logs/lightning_logs/version_1/checkpoints/epoch=0-step=288.ckpt", dataloader, ckpt_path=ckpt_path)

Error messages and logs



More info

I provide the kaggle notebook to produce quickly

I also google the behaviour and find related post:

dinhanhx commented 1 year ago

With larger scale (bigger model, bigger dataset, bigger batch size), the differences are noticeable.

In the following pictures, the purple line is the resumed version of the blue line. And the green line is the training without interruption. As we can see, the blue line and the green line are the same until the interruption. After the interruption, the purple line is very different from the green line.

image image

awaelchli commented 1 year ago

@dinhanhx This is expected, because after resuming, the random state of the program (e.g. in torch) is different than it was when it stopped. This is very much expected. What matters is that the training loss converges to the same result in the end, this seems to be the case in your experiments (the curves are not identical, but their average value is the same). Please let me know if I should explain it in more detail.

Restoring the random state to exactly the way it was when stopped is highly non-trivial. We investigated this in the past but found it too complex, while at the same time it is rarely needed in practice.

aweinmann commented 1 year ago

17543 seems related

dinhanhx commented 1 year ago

What matters is that the training loss converges to the same result in the end, this seems to be the case in your experiments (the curves are not identical, but their average value is the same).

I have noticed the similar pattern with my other training experiments other than this issue. I guess the difference would not affect the final accuracy. Thanks @awaelchli for explaining.

awaelchli commented 1 year ago

I have noticed the similar pattern with my other training experiments other than this issue

Just to clarify, I'm not sure what you are saying. Is it that in general you agree with me and that training experiments from the past have shown this behavior, but you are saying this particular experiment you showed in this issue is not following that? However, the loss curves you posted here follow the same trend and the only difference is the variation is the loss bumps.

awaelchli commented 1 year ago

Hey again @dinhanhx I just want to make sure I have all the information here. So is the explanation consistent with all your experiments and the concerns resolved? Or is there still something unclear? Please let me know so we can either close the ticket or look into it if something is still not working.

dinhanhx commented 1 year ago

So is the explanation consistent with all your experiments and the concerns resolved?

Yes. Everything is clear, I suppose.