Updating lightning from v1.7.7 to 1.8.0 significantly affects results

Bug description

The training curves and the final performance of models is significantly affected by the update from lightning v1.7.7 to v1.8.0 when training with ddp strategy. Here are the curves that I obtained with the code below.

The 4 curves with higher accuracy (lower loss) are obtained with 1.7.7, using a single node with 8 gpus for training. The 4 curves with lower accuracy (higher loss) are obtained with 1.8.0, using a single node with 8 gpus for training.

Each curve uses a different seed.

I do not know whether a similar issue happens with single-gpu training or with other strategies (didn't test).

The code is essentially taken from the pytorch-cifar10-94%-baseline tutorial. (Note that with other code/experiments of mine, the training curves end up being even further apart than in this example.)

How to reproduce the bug

import os

import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision
from pl_bolts.datamodules import CIFAR10DataModule
from pl_bolts.transforms.dataset_normalizations import cifar10_normalization
from pytorch_lightning import LightningModule, Trainer, seed_everything
from pytorch_lightning.callbacks import LearningRateMonitor
from pytorch_lightning.loggers import CSVLogger, TensorBoardLogger
from torch.optim.lr_scheduler import OneCycleLR
from torchmetrics.functional import accuracy

seed_everything(7)

PATH_DATASETS = os.environ.get("PATH_DATASETS", ".")
BATCH_SIZE = 256 if torch.cuda.is_available() else 64
NUM_WORKERS = int(os.cpu_count() / 2)

train_transforms = torchvision.transforms.Compose(
    [
        torchvision.transforms.RandomCrop(32, padding=4),
        torchvision.transforms.RandomHorizontalFlip(),
        torchvision.transforms.ToTensor(),
        cifar10_normalization(),
    ]
)

test_transforms = torchvision.transforms.Compose(
    [
        torchvision.transforms.ToTensor(),
        cifar10_normalization(),
    ]
)

cifar10_dm = CIFAR10DataModule(
    data_dir=PATH_DATASETS,
    batch_size=BATCH_SIZE,
    num_workers=NUM_WORKERS,
    train_transforms=train_transforms,
    test_transforms=test_transforms,
    val_transforms=test_transforms,
)

def create_model():
    model = torchvision.models.resnet18(pretrained=False, num_classes=10)
    model.conv1 = nn.Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
    model.maxpool = nn.Identity()
    return model

class LitResnet(LightningModule):
    def __init__(self, lr=0.05):
        super().__init__()

        self.save_hyperparameters()
        self.model = create_model()

    def forward(self, x):
        out = self.model(x)
        return F.log_softmax(out, dim=1)

    def training_step(self, batch, batch_idx):
        x, y = batch
        logits = self(x)
        loss = F.nll_loss(logits, y)
        self.log("train_loss", loss)
        return loss

    def evaluate(self, batch, stage=None):
        x, y = batch
        logits = self(x)
        loss = F.nll_loss(logits, y)
        preds = torch.argmax(logits, dim=1)
        acc = accuracy(preds, y)

        if stage:
            self.log(f"{stage}_loss", loss, prog_bar=True)
            self.log(f"{stage}_acc", acc, prog_bar=True)

    def validation_step(self, batch, batch_idx):
        self.evaluate(batch, "val")

    def test_step(self, batch, batch_idx):
        self.evaluate(batch, "test")

    def configure_optimizers(self):
        optimizer = torch.optim.SGD(
            self.parameters(),
            lr=self.hparams.lr,
            momentum=0.9,
            weight_decay=5e-4,
        )
        steps_per_epoch = 45000 // BATCH_SIZE
        scheduler_dict = {
            "scheduler": OneCycleLR(
                optimizer,
                0.1,
                epochs=self.trainer.max_epochs,
                steps_per_epoch=steps_per_epoch,
            ),
            "interval": "step",
        }
        return {"optimizer": optimizer, "lr_scheduler": scheduler_dict}

if __name__ == "__main__":
    model = LitResnet(lr=0.05)

    trainer = Trainer(
        max_epochs=100,
        accelerator="auto",
        gpus=list(range(torch.cuda.device_count())),
        strategy="ddp",
        logger=[
            TensorBoardLogger(save_dir="logs/", name="tensorboard"),
            CSVLogger(save_dir="logs/", name="csv", version=3),
        ],
        callbacks=[LearningRateMonitor(logging_interval="step")],
        enable_progress_bar=False,
    )

    trainer.fit(model, cifar10_dm)
    trainer.test(model, datamodule=cifar10_dm)

Environment

With lightning 1.7.7

* CUDA:
    - GPU:
        - Tesla T4
    - available:         True
    - version:           11.7
* Lightning:
    - lightning-bolts:   0.6.0.post1
    - lightning-utilities: 0.4.1
    - pytorch-lightning: 1.7.7
    - torch:             1.13.0
    - torchmetrics:      0.10.2
    - torchvision:       0.14.0
* Packages:
    - absl-py:           1.3.0
    - aiohttp:           3.8.3
    - aiosignal:         1.3.1
    - async-timeout:     4.0.2
    - attrs:             22.1.0
    - cachetools:        5.2.0
    - certifi:           2022.9.24
    - charset-normalizer: 2.1.1
    - frozenlist:        1.3.3
    - fsspec:            2022.11.0
    - google-auth:       2.14.1
    - google-auth-oauthlib: 0.4.6
    - grpcio:            1.50.0
    - idna:              3.4
    - importlib-metadata: 5.0.0
    - lightning-bolts:   0.6.0.post1
    - lightning-utilities: 0.4.1
    - markdown:          3.4.1
    - markupsafe:        2.1.1
    - multidict:         6.0.2
    - numpy:             1.23.4
    - nvidia-cublas-cu11: 11.10.3.66
    - nvidia-cuda-nvrtc-cu11: 11.7.99
    - nvidia-cuda-runtime-cu11: 11.7.99
    - nvidia-cudnn-cu11: 8.5.0.96
    - oauthlib:          3.2.2
    - packaging:         21.3
    - pillow:            9.3.0
    - pip:               22.3.1
    - protobuf:          3.20.3
    - pyasn1:            0.4.8
    - pyasn1-modules:    0.2.8
    - pydeprecate:       0.3.2
    - pyparsing:         3.0.9
    - pytorch-lightning: 1.7.7
    - pyyaml:            6.0
    - requests:          2.28.1
    - requests-oauthlib: 1.3.1
    - rsa:               4.9
    - setuptools:        65.3.0
    - six:               1.16.0
    - tensorboard:       2.11.0
    - tensorboard-data-server: 0.6.1
    - tensorboard-plugin-wit: 1.8.1
    - torch:             1.13.0
    - torchmetrics:      0.10.2
    - torchvision:       0.14.0
    - tqdm:              4.64.1
    - typing-extensions: 4.4.0
    - urllib3:           1.26.12
    - werkzeug:          2.2.2
    - wheel:             0.37.1
    - yarl:              1.8.1
    - zipp:              3.10.0
* System:
    - OS:                Linux
    - architecture:
        - 64bit
        - ELF
    - processor:         x86_64
    - python:            3.8.13
    - version:           #1 SMP Wed Jun 29 23:49:26 UTC 2022

With lightning 1.8.0:

* CUDA:
    - GPU:
        - Tesla T4
    - available:         True
    - version:           11.7
* Lightning:
    - lightning-bolts:   0.6.0.post1
    - lightning-lite:    1.8.0
    - lightning-utilities: 0.3.0
    - pytorch-lightning: 1.8.0
    - torch:             1.13.0
    - torchmetrics:      0.10.2
    - torchvision:       0.14.0
* Packages:
    - absl-py:           1.3.0
    - aiohttp:           3.8.3
    - aiosignal:         1.3.1
    - async-timeout:     4.0.2
    - attrs:             22.1.0
    - cachetools:        5.2.0
    - certifi:           2022.9.24
    - charset-normalizer: 2.1.1
    - fire:              0.4.0
    - frozenlist:        1.3.3
    - fsspec:            2022.11.0
    - google-auth:       2.14.1
    - google-auth-oauthlib: 0.4.6
    - grpcio:            1.50.0
    - idna:              3.4
    - importlib-metadata: 5.0.0
    - lightning-bolts:   0.6.0.post1
    - lightning-lite:    1.8.0
    - lightning-utilities: 0.3.0
    - markdown:          3.4.1
    - markupsafe:        2.1.1
    - multidict:         6.0.2
    - numpy:             1.23.4
    - nvidia-cublas-cu11: 11.10.3.66
    - nvidia-cuda-nvrtc-cu11: 11.7.99
    - nvidia-cuda-runtime-cu11: 11.7.99
    - nvidia-cudnn-cu11: 8.5.0.96
    - oauthlib:          3.2.2
    - packaging:         21.3
    - pillow:            9.3.0
    - pip:               22.3.1
    - protobuf:          3.20.3
    - pyasn1:            0.4.8
    - pyasn1-modules:    0.2.8
    - pydeprecate:       0.3.2
    - pyparsing:         3.0.9
    - pytorch-lightning: 1.8.0
    - pyyaml:            6.0
    - requests:          2.28.1
    - requests-oauthlib: 1.3.1
    - rsa:               4.9
    - setuptools:        65.3.0
    - six:               1.16.0
    - tensorboard:       2.11.0
    - tensorboard-data-server: 0.6.1
    - tensorboard-plugin-wit: 1.8.1
    - termcolor:         2.1.0
    - torch:             1.13.0
    - torchmetrics:      0.10.2
    - torchvision:       0.14.0
    - tqdm:              4.64.1
    - typing-extensions: 4.4.0
    - urllib3:           1.26.12
    - werkzeug:          2.2.2
    - wheel:             0.37.1
    - yarl:              1.8.1
    - zipp:              3.10.0
* System:
    - OS:                Linux
    - architecture:
        - 64bit
        - ELF
    - processor:         x86_64
    - python:            3.8.13
    - version:           #1 SMP Wed Jun 29 23:49:26 UTC 2022

Differences:

diff env_details_177.txt env_details_180.txt
8,9c8,10
<   - lightning-utilities: 0.4.1
<   - pytorch-lightning: 1.7.7
---
>   - lightning-lite:    1.8.0
>   - lightning-utilities: 0.3.0
>   - pytorch-lightning: 1.8.0
21a23
>   - fire:              0.4.0
30c32,33
<   - lightning-utilities: 0.4.1
---
>   - lightning-lite:    1.8.0
>   - lightning-utilities: 0.3.0
48c51
<   - pytorch-lightning: 1.7.7
---
>   - pytorch-lightning: 1.8.0
57a61
>   - termcolor:         2.1.0

cc @tchaton @justusschock @awaelchli @akihironitta @borda

Just checking: has anyone started working on this from the lightning team? I think this should be top priority, since it can potentially affect significantly (and unfavourably) all training pipelines using distributed-data-parallel.

@carmocca @awaelchli any ideas about what could go wrong? :rabbit:

I ran the provided code on 1.7.7 and master (commit a86584d6dd4d50388c7dcef4f3854b0e8355b346). I get similar loss curves. After setting deterministic=True and rerunning both versions, I get identical results (8 GPUs being used).

Note that there are some multiprocessing issues with the provided code, since not all the code is guarded by if name == main. This does not affect results but should be fixed.

@cjsg Could you give me the raw printout of your pip freeze command so that I can install the same environment? Thanks.

For reference, here is the complete modified code I ran to make results deterministic:

import os

import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision
from pl_bolts.datamodules import CIFAR10DataModule
from pl_bolts.transforms.dataset_normalizations import cifar10_normalization
from pytorch_lightning import LightningModule, Trainer, seed_everything
from pytorch_lightning.callbacks import LearningRateMonitor
from pytorch_lightning.loggers import CSVLogger, TensorBoardLogger
from torch.optim.lr_scheduler import OneCycleLR
from torchmetrics.functional import accuracy

def create_model():
    model = torchvision.models.resnet18(pretrained=False, num_classes=10)
    model.conv1 = nn.Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
    model.maxpool = nn.Identity()
    return model

class LitResnet(LightningModule):
    def __init__(self, lr=0.05):
        super().__init__()

        self.save_hyperparameters()
        self.model = create_model()

    def forward(self, x):
        out = self.model(x)
        return F.log_softmax(out, dim=1)

    def training_step(self, batch, batch_idx):
        x, y = batch
        logits = self(x)
        loss = F.nll_loss(logits, y)
        self.log("train_loss", loss)
        return loss

    def evaluate(self, batch, stage=None):
        x, y = batch
        logits = self(x)
        loss = F.nll_loss(logits, y)
        preds = torch.argmax(logits, dim=1)
        acc = accuracy(preds, y)

        if stage:
            self.log(f"{stage}_loss", loss, prog_bar=True)
            self.log(f"{stage}_acc", acc, prog_bar=True)

    def validation_step(self, batch, batch_idx):
        self.evaluate(batch, "val")

    def test_step(self, batch, batch_idx):
        self.evaluate(batch, "test")

    def configure_optimizers(self):
        optimizer = torch.optim.SGD(
            self.parameters(),
            lr=self.hparams.lr,
            momentum=0.9,
            weight_decay=5e-4,
        )
        steps_per_epoch = 45000 // BATCH_SIZE
        scheduler_dict = {
            "scheduler": OneCycleLR(
                optimizer,
                0.1,
                epochs=self.trainer.max_epochs,
                steps_per_epoch=steps_per_epoch,
            ),
            "interval": "step",
        }
        return {"optimizer": optimizer, "lr_scheduler": scheduler_dict}

if __name__ == "__main__":
    seed_everything(7)

    PATH_DATASETS = os.environ.get("PATH_DATASETS", ".")
    BATCH_SIZE = 256 if torch.cuda.is_available() else 64
    NUM_WORKERS = int(os.cpu_count() / 2)

    train_transforms = torchvision.transforms.Compose(
        [
            torchvision.transforms.RandomCrop(32, padding=4),
            torchvision.transforms.RandomHorizontalFlip(),
            torchvision.transforms.ToTensor(),
            cifar10_normalization(),
        ]
    )

    test_transforms = torchvision.transforms.Compose(
        [
            torchvision.transforms.ToTensor(),
            cifar10_normalization(),
        ]
    )

    cifar10_dm = CIFAR10DataModule(
        data_dir=PATH_DATASETS,
        batch_size=BATCH_SIZE,
        num_workers=NUM_WORKERS,
        train_transforms=train_transforms,
        test_transforms=test_transforms,
        val_transforms=test_transforms,
    )

    model = LitResnet(lr=0.05)

    trainer = Trainer(
        max_epochs=100,
        accelerator="auto",
        gpus=list(range(torch.cuda.device_count())),
        strategy="ddp",
        logger=[
            TensorBoardLogger(save_dir="logs/", name="tensorboard"),
            CSVLogger(save_dir="logs/", name="csv", version=3),
        ],
        callbacks=[LearningRateMonitor(logging_interval="step")],
        enable_progress_bar=False,
        deterministic=True,
    )

    trainer.fit(model, cifar10_dm)
    trainer.test(model, datamodule=cifar10_dm)

Hi @awaelchli @Borda , thanks for your replies and tests! Here is my output of pip freeze:

Environment with lightning 1.7.7:

absl-py==1.3.0
aiohttp==3.8.3
aiosignal==1.3.1
async-timeout==4.0.2
attrs==22.1.0
cachetools==5.2.0
certifi==2022.9.24
charset-normalizer==2.1.1
frozenlist==1.3.3
fsspec==2022.11.0
google-auth==2.15.0
google-auth-oauthlib==0.4.6
grpcio==1.51.1
idna==3.4
importlib-metadata==5.1.0
lightning-bolts==0.6.0.post1
lightning-utilities==0.4.2
Markdown==3.4.1
MarkupSafe==2.1.1
multidict==6.0.3
numpy==1.23.5
nvidia-cublas-cu11==11.10.3.66
nvidia-cuda-nvrtc-cu11==11.7.99
nvidia-cuda-runtime-cu11==11.7.99
nvidia-cudnn-cu11==8.5.0.96
oauthlib==3.2.2
packaging==21.3
Pillow==9.3.0
protobuf==3.20.3
pyasn1==0.4.8
pyasn1-modules==0.2.8
pyDeprecate==0.3.2
pyparsing==3.0.9
pytorch-lightning==1.7.7
PyYAML==6.0
requests==2.28.1
requests-oauthlib==1.3.1
rsa==4.9
six==1.16.0
tensorboard==2.11.0
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.1
torch==1.13.0
torchmetrics==0.10.2
torchvision==0.14.0
tqdm==4.64.1
typing_extensions==4.4.0
urllib3==1.26.13
Werkzeug==2.2.2
yarl==1.8.2
zipp==3.11.0

Environment with lightning 1.8.0:

absl-py==1.3.0
aiohttp==3.8.3
aiosignal==1.3.1
async-timeout==4.0.2
attrs==22.1.0
cachetools==5.2.0
certifi==2022.9.24
charset-normalizer==2.1.1
fire==0.4.0
frozenlist==1.3.3
fsspec==2022.11.0
google-auth==2.15.0
google-auth-oauthlib==0.4.6
grpcio==1.51.1
idna==3.4
importlib-metadata==5.1.0
lightning-bolts==0.6.0.post1
lightning-lite==1.8.0
lightning-utilities==0.3.0
Markdown==3.4.1
MarkupSafe==2.1.1
multidict==6.0.3
numpy==1.23.5
nvidia-cublas-cu11==11.10.3.66
nvidia-cuda-nvrtc-cu11==11.7.99
nvidia-cuda-runtime-cu11==11.7.99
nvidia-cudnn-cu11==8.5.0.96
oauthlib==3.2.2
packaging==21.3
pi==0.1.2
Pillow==9.3.0
protobuf==3.20.3
pyasn1==0.4.8
pyasn1-modules==0.2.8
pyparsing==3.0.9
pytorch-lightning==1.8.0
PyYAML==6.0
requests==2.28.1
requests-oauthlib==1.3.1
rsa==4.9
six==1.16.0
tensorboard==2.11.0
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.1
termcolor==2.1.1
torch==1.13.0
torchmetrics==0.10.2
torchvision==0.14.0
tqdm==4.64.1
typing_extensions==4.4.0
urllib3==1.26.13
Werkzeug==2.2.2
yarl==1.8.2
zipp==3.11.0

(I can't guarantee that those environments completely agree with those from my first message since I re-created them... But I checked that I get the same kind of curves than above with these 2 new environments.)

I ran the provided code on 1.7.7 and master (commit https://github.com/Lightning-AI/lightning/commit/a86584d6dd4d50388c7dcef4f3854b0e8355b346). I get similar loss curves. After setting deterministic=True and rerunning both versions, I get identical results (8 GPUs being used).

By "identical results", do you mean: (a) 1.7.7. (resp. master) with determinstic=False yields the same result than 1.7.7 (resp. master) with deterministic=True; or (b) 1.7.7 and master yield the same results when setting deterministic=True ?

For (a) I agree (with master=1.8.0): setting deterministic=True doesn't affect the training curves significantly. For (b) I get different curves (obviously, since (a) is true for me and the 1.7.7 and 1.8.0 curves do not agree).

UPDATE: But not sure what this tells us, since I don't think that the seed is the problem. (See comment below.)

[BTW: I used your modified code for those new tests.]

Another remark: it's not about the seed. You can change the seed at every run. The curves in 1.7.7 and 1.8.0 will still cluster into 2 separate sets of curves, similarly to the plots shown above. (Actually, I think that I did use different seeds for every curve in those plots.)

@cjsg master means the development branch (the latest commit on the repo). master > 1.8

Setting the seed is sometimes not enough, especially with cnns where a lot of tuning happens in the cudnn backend, and one has to set the deterministic flags in torch to enforce deterministic algorithms in the backend. That's why I have set Trainer(deterministic=True) in addition to the seed. I did this because I couldn't reproduce your observations, but saw non-deterministic behavior.

I think neither a) nor b) describes it. Let me make it clearer:

I did

git checkout master
python repro.py
git checkout tags/1.7.7
python repro.py
tensorboard --logdir lightning_logs

with your posted code and compared the two curves. They are similar, so I investigated whether the results are deterministic. They are not, so I turned deterministic=True (the code I posted) and reran the exact same commands. The resulting two curves are identical.

I can run another experiment against 1.8.0 with your frozen requirements.

@awaelchli I see. Thanks for your quick answer. So maybe it's already resolved in the latest master commit. I'll try it too and let you know what I get.

@awaelchli Just finished testing with the lightning git repo installs and your commands from above (master@05dbf48ad). The new curves essentially overlap with the ones I got using pip lightning 1.7.7 / 1.8.0 installs. See plots below.

There are 2 clear "beams" of curves, one for lightning 1.7.7 and one for lightning >= 1.8.0. In both cases, the beams contain curves generated with pip installs (1.7.7 and 1.8.0 respectively) and with git checkout installs (tags/1.7.7 and master respectively), with and without the deterministic=True option.

Lightning-AI / pytorch-lightning