Lightning-AI / pytorch-lightning

Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.42k stars 3.39k forks source link

DDP with 2 GPUs doesn't give same results as 1 GPU with the same effective batch size #6789

Open mees opened 3 years ago

mees commented 3 years ago

🐛 Bug

Training a network with DDP on 2 GPUs with a batch size of N/2 should give the same result a training on a single GPU with batch size N. I tested this using the Cifar VAE from lightning-bolts. The configurations I tried are single GPU with the default batch size 256, Data Parallel on 2 GPUs (each GPU gets then a batch of 128) and DDP on 2GPUs (manually setting batch size to 128). Although all three experiments have the same effective batch size, DDP doesn't show the same performance as the single GPU training and DP, specially with respect to the kl loss. The experiments are with the default setting, without fancy stuff like 16bit precision or sharded training. I used a VAE network to analyze this probem as it is close in spirit to the networks I am using in my current research. @SeanNaren @awaelchli

image

To Reproduce

import os
from argparse import ArgumentParser

import pytorch_lightning as pl
import torch
from torch import nn as nn
from torch.nn import functional as F

from pl_bolts import _HTTPS_AWS_HUB
from pl_bolts.models.autoencoders.components import (
    resnet18_decoder,
    resnet18_encoder,
    resnet50_decoder,
    resnet50_encoder,
)

class VAE(pl.LightningModule):
    """
    Standard VAE with Gaussian Prior and approx posterior.

    Model is available pretrained on different datasets:

    Example::

        # not pretrained
        vae = VAE()

        # pretrained on cifar10
        vae = VAE(input_height=32).from_pretrained('cifar10-resnet18')

        # pretrained on stl10
        vae = VAE(input_height=32).from_pretrained('stl10-resnet18')
    """

    pretrained_urls = {
        'cifar10-resnet18': os.path.join(_HTTPS_AWS_HUB, 'vae/vae-cifar10/checkpoints/epoch%3D89.ckpt'),
        'stl10-resnet18': os.path.join(_HTTPS_AWS_HUB, 'vae/vae-stl10/checkpoints/epoch%3D89.ckpt'),
    }

    def __init__(
        self,
        input_height: int,
        enc_type: str = 'resnet18',
        first_conv: bool = False,
        maxpool1: bool = False,
        enc_out_dim: int = 512,
        kl_coeff: float = 0.1,
        latent_dim: int = 256,
        lr: float = 1e-4,
        **kwargs
    ):
        """
        Args:
            input_height: height of the images
            enc_type: option between resnet18 or resnet50
            first_conv: use standard kernel_size 7, stride 2 at start or
                replace it with kernel_size 3, stride 1 conv
            maxpool1: use standard maxpool to reduce spatial dim of feat by a factor of 2
            enc_out_dim: set according to the out_channel count of
                encoder used (512 for resnet18, 2048 for resnet50)
            kl_coeff: coefficient for kl term of the loss
            latent_dim: dim of latent space
            lr: learning rate for Adam
        """

        super(VAE, self).__init__()

        self.save_hyperparameters()

        self.lr = lr
        self.kl_coeff = kl_coeff
        self.enc_out_dim = enc_out_dim
        self.latent_dim = latent_dim
        self.input_height = input_height

        valid_encoders = {
            'resnet18': {
                'enc': resnet18_encoder,
                'dec': resnet18_decoder,
            },
            'resnet50': {
                'enc': resnet50_encoder,
                'dec': resnet50_decoder,
            },
        }

        if enc_type not in valid_encoders:
            self.encoder = resnet18_encoder(first_conv, maxpool1)
            self.decoder = resnet18_decoder(self.latent_dim, self.input_height, first_conv, maxpool1)
        else:
            self.encoder = valid_encoders[enc_type]['enc'](first_conv, maxpool1)
            self.decoder = valid_encoders[enc_type]['dec'](self.latent_dim, self.input_height, first_conv, maxpool1)

        self.fc_mu = nn.Linear(self.enc_out_dim, self.latent_dim)
        self.fc_var = nn.Linear(self.enc_out_dim, self.latent_dim)

    @staticmethod
    def pretrained_weights_available():
        return list(VAE.pretrained_urls.keys())

    def from_pretrained(self, checkpoint_name):
        if checkpoint_name not in VAE.pretrained_urls:
            raise KeyError(str(checkpoint_name) + ' not present in pretrained weights.')

        return self.load_from_checkpoint(VAE.pretrained_urls[checkpoint_name], strict=False)

    def forward(self, x):
        x = self.encoder(x)
        mu = self.fc_mu(x)
        log_var = self.fc_var(x)
        p, q, z = self.sample(mu, log_var)
        return self.decoder(z)

    def _run_step(self, x):
        x = self.encoder(x)
        mu = self.fc_mu(x)
        log_var = self.fc_var(x)
        p, q, z = self.sample(mu, log_var)
        return z, self.decoder(z), p, q

    def sample(self, mu, log_var):
        std = torch.exp(log_var / 2)
        p = torch.distributions.Normal(torch.zeros_like(mu), torch.ones_like(std))
        q = torch.distributions.Normal(mu, std)
        z = q.rsample()
        return p, q, z

    def step(self, batch, batch_idx):
        x, y = batch
        z, x_hat, p, q = self._run_step(x)

        recon_loss = F.mse_loss(x_hat, x, reduction='mean')

        log_qz = q.log_prob(z)
        log_pz = p.log_prob(z)

        kl = log_qz - log_pz
        kl = kl.mean()
        kl *= self.kl_coeff

        loss = kl + recon_loss

        logs = {
            "recon_loss": recon_loss,
            "kl": kl,
            "loss": loss,
        }
        return loss, logs

    def training_step(self, batch, batch_idx):
        loss, logs = self.step(batch, batch_idx)
        self.log_dict({f"train_{k}": v for k, v in logs.items()}, on_step=True, on_epoch=False)
        return loss

    def validation_step(self, batch, batch_idx):
        loss, logs = self.step(batch, batch_idx)
        self.log_dict({f"val_{k}": v for k, v in logs.items()})
        return loss

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=self.lr)

    @staticmethod
    def add_model_specific_args(parent_parser):
        parser = ArgumentParser(parents=[parent_parser], add_help=False)

        parser.add_argument("--enc_type", type=str, default='resnet18', help="resnet18/resnet50")
        parser.add_argument("--first_conv", action='store_true')
        parser.add_argument("--maxpool1", action='store_true')
        parser.add_argument("--lr", type=float, default=1e-4)

        parser.add_argument(
            "--enc_out_dim",
            type=int,
            default=512,
            help="512 for resnet18, 2048 for bigger resnets, adjust for wider resnets"
        )
        parser.add_argument("--kl_coeff", type=float, default=0.1)
        parser.add_argument("--latent_dim", type=int, default=256)
        parser.add_argument("--batch_size", type=int, default=256)
        parser.add_argument("--num_workers", type=int, default=8)
        parser.add_argument("--data_dir", type=str, default=".")

        return parser

def cli_main(args=None):
    from pl_bolts.datamodules import CIFAR10DataModule, ImagenetDataModule, STL10DataModule

    pl.seed_everything(1234)

    parser = ArgumentParser()
    parser.add_argument("--dataset", default="cifar10", type=str, choices=["cifar10", "stl10", "imagenet"])
    script_args, _ = parser.parse_known_args(args)

    if script_args.dataset == "cifar10":
        dm_cls = CIFAR10DataModule
    elif script_args.dataset == "stl10":
        dm_cls = STL10DataModule
    elif script_args.dataset == "imagenet":
        dm_cls = ImagenetDataModule
    else:
        raise ValueError(f"undefined dataset {script_args.dataset}")

    parser = VAE.add_model_specific_args(parser)
    parser = pl.Trainer.add_argparse_args(parser)
    args = parser.parse_args(args)

    dm = dm_cls.from_argparse_args(args)
    args.input_height = dm.size()[-1]

    if args.max_steps == -1:
        args.max_steps = None

    model = VAE(**vars(args))

    trainer = pl.Trainer.from_argparse_args(args)
    trainer.fit(model, datamodule=dm)
    return dm, model, trainer

if __name__ == "__main__":
    dm, model, trainer = cli_main()

  python vae.py --dataset=cifar10 --batch_size=256 # single gpu training
  python vae.py --dataset=cifar10 --batch_size=128 --gpus=2 --accelerator=ddp
  python vae.py --dataset=cifar10 --batch_size=256 --gpus=2 --accelerator=dp

Expected behavior

SInce the effective batch size and the rest of hyperparameters are the same, the training should be very close.

Environment

cc @tchaton @rohitgr7 @akihironitta @justusschock @kaushikb11 @awaelchli

zhiruiluo commented 3 years ago

You may get some inspiration from the discussion #3706.

mees commented 3 years ago

Hey @lzrpotato, thanks for the link, but that discussion doesn't apply here, as we have the same effective batch size in all three cases.

awaelchli commented 3 years ago

Hey, if there is batch norm hidden in there somewhere or any other operation that takes batch statistics into account for backprop, this could explain it.

tchaton commented 3 years ago

Dear @mees,

we just merged this PR https://github.com/PyTorchLightning/pytorch-lightning/pull/683. I think SyncBN wasn't properly configured and it could maybe explain the disparity of observed. Would you mind trying out master ?

Best, T.C

edenlightning commented 3 years ago

@mees friendly ping :)

ManiadisG commented 3 years ago

Dear @tchaton .

I have a similar problem about which I just raised an issue. I also thought the problem was SyncBN but the problem persisted when I tried models that didn't use BN at all. It may be related.

mees commented 3 years ago

Hi all, sorry for the late reply, I just re-runned the tests for PL master and it the results make more sense now :) image

awaelchli commented 3 years ago

I did some sanity testing yesterday. Will share later today the plots. It looks like something is still not right. Going off of your example @mees I can see it has nothing to do with multiple optimizers or log syncing. Will get back to you

mees commented 3 years ago

ah okey, yes please let me know and feel free to reopen the issue.

awaelchli commented 3 years ago

Here are my findings. Four experiments conducted, dummy model with dummy loss, effective batch size always 8:

Pure PyTorch DDP implementation:

  1. --gpus=2 --batch_size=4
  2. --gpus=1 --batch_size=8

Lightning DDP implementation:

  1. --gpus=2 --batch_size=4
  2. --gpus=1 --batch_size=8

Without learning rate scaling, I get two different curves, but we maintain parity between pure PyTorch and Lighting. The plots are accessible here in WandB:

We find that if we scale the learning rate by a factor of 4

--gpus=2 --batch_size=4 --lr_scale 4

we can perfectly replicate the loss plot of the single gpu experiment.

Here are the two equivalent (to the best of my ability) implementations:

PyTorch:

from argparse import ArgumentParser

import torch
import torch.distributed
from torch.nn.parallel import DistributedDataParallel
from torch.utils.data import DataLoader, DistributedSampler

from pl_examples.repro import BoringModel, RandomDataset
from pytorch_lightning import seed_everything
import wandb

def train():
    parser = ArgumentParser()
    parser.add_argument("--gpus", type=int, default=2)
    parser.add_argument("--batch_size", type=int, default=4)
    parser.add_argument("--local_rank", type=int)
    parser.add_argument("--lr_scale", type=float, default=1.0)
    parser.add_argument("--name", type=str, default="debug")
    args = parser.parse_args()
    device_ids = list(range(args.gpus))
    device = torch.device("cuda", args.local_rank)
    torch.cuda.set_device(args.local_rank)

    torch.distributed.init_process_group(backend='nccl', init_method='env://')

    if args.local_rank == 0:
        wandb.init(project="ddp-parity-1.3.0", name=args.name)
        wandb.config.update(vars(args))

    model = BoringModel(**vars(args)).to(device)
    opt = model.configure_optimizers()

    ddp_model = DistributedDataParallel(model, device_ids=[args.local_rank], output_device=args.local_rank)
    dataset = RandomDataset(32, 6400)
    train_data = DataLoader(
        dataset,
        batch_size=args.batch_size,
        sampler=DistributedSampler(dataset, num_replicas=len(device_ids), rank=args.local_rank)
    )

    global_step = 0
    for epoch in range(5):
        for i, batch in enumerate(train_data):
            batch = batch.to(device)
            opt.zero_grad()
            loss = ddp_model(batch).sum()
            loss.backward()
            opt.step()

            if args.local_rank == 0:
                print(f"{i:04d} / {len(train_data)}")
                wandb.log({"train_loss": loss, "trainer/global_step": global_step})

            global_step += 1

if __name__ == "__main__":
    seed_everything(0)
    train()

# run command:
# python -m torch.distributed.launch --nproc_per_node=2 pl_examples/repro_pt.py --batch_size 4 --gpus 2 --name pt-ddp

Lightning:


import os
from argparse import ArgumentParser

import torch
from torch.nn.parallel import DistributedDataParallel
from torch.utils.data import Dataset, DataLoader
from pytorch_lightning import LightningModule, Trainer, seed_everything
from pytorch_lightning.loggers import WandbLogger
from pytorch_lightning.plugins import DDPPlugin

class RandomDataset(Dataset):

    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len

class BoringModel(LightningModule):

    def __init__(self, **kwargs):
        super().__init__()
        self.save_hyperparameters(kwargs)
        self.lr_scale = kwargs.get("lr_scale")
        self.layer = torch.nn.Linear(32, 2)

    def forward(self, x):
        return self.layer(x)

    def training_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("train_loss", loss, sync_dist=False)
        assert isinstance(self.trainer.model, DistributedDataParallel)
        assert isinstance(self.trainer.training_type_plugin, DDPPlugin)
        return {"loss": loss}

    def validation_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("valid_loss", loss, sync_dist=False)

    def test_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("test_loss", loss, sync_dist=False)

    def configure_optimizers(self):
        return torch.optim.SGD(self.layer.parameters(), lr=(0.1 * self.lr_scale))

def run():
    parser = ArgumentParser()
    parser = Trainer.add_argparse_args(parser)
    parser.add_argument("--name", type=str, default="debug")
    parser.add_argument("--batch_size", type=int, default=4)
    parser.add_argument("--lr_scale", type=float, default=1.0)
    parser.set_defaults(
        max_epochs=5,
        log_every_n_steps=1,
        limit_val_batches=0,
    )
    args = parser.parse_args()
    logger = WandbLogger(project="ddp-parity-1.3.0", name=args.name)
    trainer = Trainer.from_argparse_args(args, logger=logger, plugins=[DDPPlugin(find_unused_parameters=False)])
    model = BoringModel(**vars(args))
    train_data = DataLoader(RandomDataset(32, 6400), batch_size=args.batch_size)
    val_data = DataLoader(RandomDataset(32, 6400), batch_size=args.batch_size)
    test_data = DataLoader(RandomDataset(32, 6400), batch_size=args.batch_size)
    trainer.fit(model, train_dataloader=train_data, val_dataloaders=val_data)

if __name__ == '__main__':
    seed_everything(0, workers=True)
    run()

Environment details for reproduction:

ManiadisG commented 3 years ago

@awaelchli I may be wrong but I think in your example this may be due to the sum() in the loss. If the loss is not averaged over the batch then the magnitude of the gradients increases as the batch size increases, so the 2-GPU model has a learning signal half as strong as the single GPU one. The reported loss is also always 2 times smaller than the loss for the same model with double the batch. Therefore, in order to get the same reported loss, the 2-GPU model needs to move twice as fast. These two factors together get you the 4X learning rate scaling requirement.

awaelchli commented 3 years ago

No you are right and this is stupid of me not noticing. It should be the mean! I will update the results later today. And I will check again if 1.2.6 gave different results as you reported initially. Thanks for the help. Need to be sure everything is right before our 1.3 release.

awaelchli commented 3 years ago

Reran my toy example and the curves match. But the initial code posted by @mees shows a difference and I was able to confirm that on my gpus. @mees did you change anything when you reran your experiments here https://github.com/PyTorchLightning/pytorch-lightning/issues/6789#issuecomment-830198938 ?

ManiadisG commented 3 years ago

@mees @awaelchli I believe I might have found the issue that I had and likely what is going on in this case as well.

I re-implemented my code using standard pytorch ddp and found that, although it didn't throw errors, it didn't really synchronize gradients unless the parameters in question were used in the model's forward function. Apparently (and although I couldn't find that requirement on any pytorch docs) ddp registers parameters to-be-synced via the forward function of the model wrapped in DistributedDataParallel. Therefore, and likely also in PTL's case, all training step functions must use the main model's forward to work correctly, even if the forward function includes multiple cases (e.g. via ifs) to accomodate more complex models.

I notice that mees' code also doesn't use the forward function in the training step so I'm guessing that is the issue. The way to confirm this is to save the weights for models in all devices. If what I suggest holds and they don't sync, the models will be different. If that is the case I would strongly suggest an edit to the docs to highlight this requirement.

ananthsub commented 3 years ago

Therefore, and likely also in PTL's case, all training step functions must use the main model's forward to work correctly, even if the forward function includes multiple cases (e.g. via ifs) to accomodate more complex models.

By default, Lightning redirects the forward function for DistributedDataParallel to the correct *_step function inside the lightning module. So implementing forward on the lightning module is not a hard requirement: https://github.com/PyTorchLightning/pytorch-lightning/blob/e0c64f0ef639a8b9f46e0d8e32e5e0a6b7532cff/pytorch_lightning/overrides/base.py#L22-L63

Additionally, find_unused_parameters is set to True by default in Lightning, which does differ from the PyTorch default for DDP. @awaelchli this is a difference between your Lightning code and PyTorch code. For the BoringModel this shouldn't matter, but for more complex models it's easy to get different benchmarking results between native pytorch and Lightning. This is a big reason why I'm in favor of changing the default back again to match upstream PyTorch

ManiadisG commented 3 years ago

I see, thank you. I would note that in DDP with PyTorch I had set find_unused_parameters to True and still parameters didn't sync unless used by forward. Although I don't think it is important in this case given the overriding you mentioned.

awaelchli commented 3 years ago

I agree with @ananthsub this should not be an issue. I ran a quick experiment where I converted the step() to forward() in @mees code and I see no change. Repro for his reported numbers/plots are here: https://wandb.ai/awaelchli/ddp-parity

EDIT: I compared the weights with and without "proper" forward use and they are the same.

awaelchli commented 3 years ago

Did you have a model where there were actually unused parameters? Because find_unused_parameters=True only adds runtime overhead to my knowledge if there are no unused params.

If you want to know what's up, carefully backtrack your changes. I bet there was a subtle change when you did the refactor with forward that made the difference. I would be very interested in finding what that is!

ManiadisG commented 3 years ago

In my case I had a "parent" model class with 2 sub-models, each updated individually (different optimizers). Pure PyTorch DDP and PTL with automatic optimization required that I set find_unused_parameters=True, otherwise an error was thrown. PTL with manual optimization (which also gave very different results for 1 and 2 GPUS) did not require that.

I am fairly certain that the switch to forward did not change anything because I used the same functions as before, only they were under forward and I used ifs to choose which path I wanted forward to take. However, that only applies to pure PyTorch. I have not yet solved the problem in my PTL implementation.

awaelchli commented 3 years ago

Yes I agree that in pure PyTorch, not calling the forward method properly will not activate the hooks DDP needs for sync. However I would be interested how you are using Lightning in a way that you still see this issue? Are you able to reproduce it in a minimal example? Let's track the manual optimization case in a separate issue, as it seems very different than the report here.

ManiadisG commented 3 years ago

Thank you, I will try to create an example in the coming days and get back to you with more information. The model I am using is fairly complex so it may take a while to find a way to reproduce the issue in a simpler example.

mees commented 3 years ago

Reran my toy example and the curves match. But the initial code posted by @mees shows a difference and I was able to confirm that on my gpus. @mees did you change anything when you reran your experiments here #6789 (comment) ?

Hi @awaelchli no I didn't change anything, just the lightning version. Do you have any suspicion of what might be going on?

senarvi commented 3 years ago

I noticed a similar thing when training a YOLO object detection model using DDP on 4, 8, or 16 GPUs. (There is a pull request for the YOLO model in Lightning Bolts.) I'm using the same batch size in every case and accumulating gradients from 4, 2, or 1 steps to compensate for the number of GPUs.

There's not much difference going from 4 to 8 GPUs, but going to 16 GPUs there's a radical drop in average precision:

ngpus

The model is using batch normalization. Shouldn't make a difference, since I'm using the same batch size and I'm not using synchronized batch normalization, right? (I tried SyncBN too and it doesn't make much difference either.) The losses are calculated as an average over the batch size. But I don't see how any problem in the model could matter, as the model always processes batches of the same size. In one case the gradients are just accumulated from two steps, and in another case they are computed in parallel on twice as many GPUs.

I was repeating the experiment on the public COCO dataset, and there was not much difference between 8 and 16 GPUs in the model performance, but I did experience some nan/inf values with 16 GPUs.

I thought the difference might come from more GPU processes using more workers for data loading. If the workers - that also perform data augmentation - use the same random seed, which was found to be true in many cases, that can have a negative effect. Actually I noticed that Lightning was not seeding the workers correctly when using DDP, so I thought that might be the culprit. Unfortunately fixing the random seed didn't help, neither did reducing the number of workers.

If I disable data augmentation altogether, the results are a lot worse, but there's not that big of a difference between 8 and 16 GPUs anymore. It doesn't necessarily mean that there's something wrong with data augmentation, though.

noaugment

I'm using gradient clipping and another thought was that maybe the clipping is applied differently when using gradient accumulation. However, looking at the code, it seems that clipping is applied only after accumulating the gradients for one model update, so it should work identically whether one is using gradient accumulation or not.

I did try using a lower threshold for gradient clipping, and it reduced the difference:

clipping

To me the most likely explanation is that something is making training less stable when using more GPUs, and this can be mitigated using stronger gradient clipping.

senarvi commented 3 years ago

I'm not seeing the regression when using 16 GPUs anymore. I fixed a couple of errors in the loss function. Essentially the loss was scaled incorrectly. But I don't see how these could have a different effect with different number of GPUs, when the batch size is kept constant. Then I noticed that another regression and in some cases an error message was caused by using an incompatible version of Torchvision. I'm installing Lightning and Torchvision in a Docker container. The base image from NVIDIA includes Torch and Torchvision. Installing the latest Torchvision manually using pip caused these strange problems. To me it sounds like this is the most likely cause of the differences in average precision. Could be related to the NMS that is performed by Torchvision. In that case this has nothing to do with the bug that was reported in this issue.

mhamilton723 commented 3 years ago

I am encountering this issue in 1.4.4 any idea what to do to fix this? @senarvi @edenlightning

ohayonguy commented 3 years ago

Any update on this? I am facing the exact same issue. @awaelchli

tchaton commented 3 years ago

Dear @senarvi,

Are you precision=16 while training ?

Best, T.C

senarvi commented 3 years ago

@tchaton I was using 32-bit precision [Edit: not 16] all the time. I don't have a clear idea what caused the problem. After fixing the Torchvision version, I don't see the problem anymore. I'm using Torchvision for performing non-maximum suppression before scoring, so it could be related to that. Sorry that I cannot help more.

awaelchli commented 3 years ago

In case you are using gradient clipping + 16 bit precision then there is a bug here: #9330 which could lead to the difference you are seeing.

senarvi commented 3 years ago

@awaelchli sorry, I wrote incorrectly above. I was using gradient clipping, but 32-bit precision. Good to know anyway!

SeanNaren commented 3 years ago

After speaking to @awaelchli @tchaton we should introduce some form of correctness test between 2 GPUs vs 1 GPU for model loss/convergence.

This may also allow us to easily test correctness for our training system, and showcase any changes needed to maintain parity.

tchaton commented 3 years ago

Hey @SeanNaren,

Any progress there ?

tshu-w commented 2 years ago

Any update? I'm facing the same issue.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

tshu-w commented 2 years ago

active

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

tshu-w commented 2 years ago

active?

amorehead commented 2 years ago

Hi, all. Have there been any updates in this regard? I am using PyTorch Lightning 1.5.4 for training my models with 1 vs. 2 vs. 4 GPUs. When ensuring the same effective batch size (e.g., 8), I am still seeing different results for my three model training runs even though I am using the same random seed with Lightning's seed_everything().

JinLi711 commented 2 years ago

I think one issue with DDP having the same seed across different processes for each GPU is that there is a lower diversity of randomness compared to DP.

For example, say that your batch size is 2 and the number of GPUs is two and you're randomly rotating images by a random angle. Then using DP, pytorch will rotate each of the two images with a different angle. However, if you're using DDP, each GPU will get one sample and each image will be rotated the same way across GPUs because each process gets the same seed. This doesn't just apply to rotations; it applies to all operations that have some sort of randomness (like dropout).

The more GPUs you use while keeping the effective batch size the same, the lower the diversity of randomness. This lower diversity in randomness can lead to poorer results, which may explain the difference between DDP and DP even when the effective batch size is the same.

The solution to this seems to be just set different seeds for each GPU process and this is no longer an issue. However, one big issue to this solution is that this will make batch loading inconsistent. For example, if each GPU process has the same seed, then each batch loader of each GPU will load indices [5, 9, 4, 7]. And then the DistributedSampler from pytorch will subselect that batch based on the GPU rank (so GPU 0 will get [5, 4] and GPU 1 will get [9, 7]). If you set a different seed for each GPU process, then each GPU will load in a different set of indices. So for example, GPU process 0 will load in [3, 8, 9, 6] and GPU process 1 will load in [5, 3, 4, 8].

@awaelchli Does this look right?

awaelchli commented 2 years ago

Yes, I think your explanation checks out when no extra steps are taken to diversify the augmentations, in general. In Lightning, if you use seed_everything(workers=True), we take care of this by seeding every dataloader worker differently, across all ranks (across multi-node too), and this will then lead to diversified augmentations, provided that they are applied in the dataset/worker and num_workers > 0. In other words, the seed in each worker is different but deterministically related to the main seed. The default is seed_everything(workers=False) however, so this needs to be turned on by the user explicitly.

I don't recommend setting a different seed per rank though unless the user knows what they are doing and what implications this has (under Lightning and/or PyTorch). This is difficult to reason about and difficult to debug. I strongly recommend to let the distributed sampler fully handle the shuffling to avoid duplicated or missing samples across the ranks.

Two other thoughts regarding the discussion on the thread here:

  1. Note that if you go from 1 to 2 GPUs while keeping the effective batch size the same (and the seed), you can't expect the results to be exactly the same numerically because the order in which the samples get batched and hence used to update the model weights will be different. When we talk about achieving "same results" we mean stochastically, after convergence. I hope this makes sense. I wasn't sure maybe this was already mentioned.
  2. Consider turning on sync-batch norm, if it applies to your network. If there are other terms in your network or loss function that take batch statistics into account, one has to account for that when switching to ddp.
stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions - the Lightning Team!