DDP - Worse performance with 2 GPUs compared to 1.

ManiadisG commented 3 years ago

🐛 Bug

I am running a model with multiple optimizers using DDP and automatic optimization. When I run it on two GPUs (with the same effective batch size), the model performs consistently worse than it does on 1 GPU (the loss decreases at a significantly slower rate).

I have ruled out errors in my code as it runs as without errors and, in both cases, the model performs as expected. I have also tested simplified versions of my models (e.g. no batchnorm layers) to rule out issues related to the architectures. When the model runs on 2 GPUs both are utilized, and I can tell by the number of steps/epoch and in-code asserts that the batch size is the same in both cases (256 in the single GPU case, 128/128 when 2 GPUs are used). The problem persists for both 32 and 16 precision, and also when I tested it with manual optimization.

To Reproduce

My training step is formatted as follows:

def training_step(self, batch, batch_idx, optimizer_idx)
    x, y = batch

    if optimizer_idx=0:
        output1=model1(x)
        loss1=LossFunction1(output1)
        self.log('loss1', loss1, sync_dist=True, prog_bar=False, on_step=False, on_epoch=True)
        return loss1
    else:
        output2=model2(x)
        loss2=LossFunction2(output2)
        self.log('loss2', loss2, sync_dist=True, prog_bar=False, on_step=False, on_epoch=True)
        return loss2

The main function that runs the model is:

def main():
    pl.seed_everything(seed)
    data_module = get_datasets()
    model = MyModel()
    trainer = pl.Trainer(max_epochs=epochs, gpus=gpus, precision=precision,
                         default_root_dir=experiment_dir, accelerator='ddp', sync_batchnorm=True)
    trainer.fit(model, data_module)

Expected behavior

I would expect the traces of loss1 and loss2 to be similar (if not identical), however the losses decrease considerably faster when the model is run on 1 GPU rather than 2. Instead, the losses develop as can be seen in the screenshot below (the higher one coming from the 2 GPU version).

Environment

PyTorch Lightning Version: 1.2.8
PyTorch Version: 1.8.1
OS: Linux
How you installed PyTorch : Conda
Python version: 3.8.8
CUDA/cuDNN version: 11.1.1
GPU models and configuration: 2 RTX2080Tis

awaelchli commented 3 years ago

in-code asserts that the dataloader correctly splits the batch between the devices.

It does not split the batch in DDP.

How did you set the batch size for these two experiments? And did you scale the LR when increasing the batch size? How does the performance change if you double the batch size (on one GPU)?

ManiadisG commented 3 years ago

I'm sorry, this thing with the dataloaders and the batch sizes is always hard to communicate, and saying that the dataloader "splits" the batches was definitely poor phrasing, I edited it out.

What I meant is that when I trained the model on a single GPU I had a batch size of 256. When I trained it on two GPUs the dataloaders had a batch size of 128. So the effective batch size for both cases is the same (and therefore the LR is not changed).

tchaton commented 3 years ago

Hey @ManiadisG,

Could you provide a reproducible example with an open source model / dataset ?

Best, T.C

ManiadisG commented 3 years ago

Thank you @tchaton !

I think the following script is close to what I am doing. My actual model is similar to MoCo (memory bank, momentum-updated models) but I have isolated and removed each of these component and none of them caused the issue.

The results I got with this script weren't as extreme (which makes sense given that classifying STL10 is straightforward), but even with a fixed seed and deterministic=True there were notable differences in the losses between some steps.

Results

Code

import torch
import torch.nn as nn
import pytorch_lightning as pl
from torch.utils.data import DataLoader
from pytorch_lightning import LightningDataModule
import torchvision
from torchvision.datasets import STL10
from torchvision import transforms
import argparse

parser = argparse.ArgumentParser(conflict_handler='resolve')
parser.add_argument("--working_dir", default="./", type=str)
parser.add_argument("--dataset_dir", default="./", type=str)
parser.add_argument("--gpus", default=[0], type=int, nargs="+")
parser.add_argument("--batch_size", default=256, type=int)
parser.add_argument("--num_workers", default=8, type=int)
parser.add_argument("--epochs", default=10, type=int)

args = parser.parse_known_args()[0]

class DataModule(LightningDataModule):
    def __init__(self, train_dataset):
        super().__init__()
        self.train_dataset = train_dataset
        self.batch_size = args.batch_size // len(args.gpus)
        self.num_workers = args.num_workers // 2

    def train_dataloader(self):
        return DataLoader(dataset=self.train_dataset, batch_size=self.batch_size, shuffle=True,
                          num_workers=self.num_workers,
                          pin_memory=True, drop_last=True)

def get_datamodule(dataset_dir):
    normalize = transforms.Normalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225))
    augmentation = [transforms.RandomResizedCrop(96, scale=(0.2, 1.)),
                    transforms.RandomApply([transforms.ColorJitter(0.4, 0.4, 0.4, 0.1)], p=0.8),
                    transforms.RandomGrayscale(p=0.2), transforms.RandomHorizontalFlip(),
                    transforms.ToTensor(), normalize]
    dataset = STL10(dataset_dir, 'test', download=True, transform=transforms.Compose(augmentation))
    return DataModule(dataset)

class ToyModel(pl.LightningModule):

    def __init__(self):
        super().__init__()

        self.resnet1 = torchvision.models.resnet18()
        self.resnet1.fc = nn.Linear(512, 10)
        self.resnet2 = torchvision.models.resnet18()
        self.resnet2.fc = nn.Linear(512, 10)
        self.Loss1 = nn.CrossEntropyLoss()
        self.Loss2 = nn.CrossEntropyLoss()

    def configure_optimizers(self):

        prms1 = [{'params': self.resnet1.parameters(), 'lr': 0.0001, 'weight_decay': 0.0001, 'momentum': 0.9}]
        optimizer_1 = torch.optim.SGD(prms1)

        prms2 = [{'params': self.resnet2.parameters(), 'lr': 0.0001, 'weight_decay': 0.0001, 'momentum': 0.9}]
        optimizer_2 = torch.optim.SGD(prms2)

        return [optimizer_1, optimizer_2]

    def forward(self, x, isfirst=True):
        if isfirst:
            return self.resnet1(x)
        else:
            return self.resnet2(x)

    def training_step(self, batch, batch_idx, optimizer_idx):

        x, y = batch

        if optimizer_idx == 0:
            pr1 = self(x, True)
            loss1 = self.Loss1(pr1, y)
            self.log('loss1', loss1, sync_dist=True, prog_bar=True, on_step=True, on_epoch=True)
            return loss1
        else:
            pr2 = self(x, False)
            loss2 = self.Loss2(pr2, y)
            self.log('loss2', loss2, sync_dist=True, prog_bar=True, on_step=True, on_epoch=True)
            return loss2

def main():
    pl.seed_everything(0)
    data_module = get_datamodule(args.dataset_dir)

    model = ToyModel()

    trainer = pl.Trainer(max_epochs=args.epochs, gpus=args.gpus, precision=16, deterministic=True,
                         default_root_dir=args.working_dir, accelerator='ddp', sync_batchnorm=True)

    trainer.fit(model, data_module)

    return model.float()

if __name__ == '__main__':
    model = main()

ManiadisG commented 3 years ago

I also tried without random augmentations (only normalization, resizing and center crop) to rule out the different dataloader instances having an impact. There are still (small) differences in how the models train despite the fact that, as I understand it, they should be identical.

ushahid commented 3 years ago

I am facing the same issue, there is significant difference in F1 scores for classification models, did you find a solution?

awaelchli commented 3 years ago

@ushahid when you compute F1 score, or any metric for that matter, make sure you take into account the values from all processes and that they are correctly reduced across processes. You can consider using tochmetrics for this, here is F1: https://torchmetrics.readthedocs.io/en/latest/references/modules.html#f1

Correct metric usage and logging in Lightning: https://torchmetrics.readthedocs.io/en/latest/pages/lightning.html

ushahid commented 3 years ago

Hi @awaelchli, thanks for your quick response. I am currently using a torchmetrics based accumulator which essentially collects predictions from all the processes and evaluates on the master using a callback based method. I have already verified that it is working correctly in both cases. I also have torchmetrics based F1 evaluating at epoch level and the results are consistent for both.

Another thing I would like to highlight is that the difference for me is between DDP and DP based training which is a bit different from the original question, the only thing that I change is the accelerator parameter for trainer which results in a drop of 4-5% Macro F1 in the case of DDP which is quite significant for the problem. I set the seed at the start of the script (seed_everything, workers=True) but not within the LightningModule and I am wondering if that has something to do with it. A few more facts about the model:

It is an ELECTRA backbone based classifier
The classifier layer has LayerNorm and Dropout along with Linear and LeakyReLU layers
I am using 8 GPUs with batch size of 64 on DP and 8 on DDP, which as per my understanding is the same effective batch size and also the same batch size per GPU
The learning rate is the same for both methods
Another strange thing I noticed is that there are a lot more than 8 dumps on the screen saying "Global seed set to X" which is a bit strange

Any help is greatly appreciated! Thanks.

awaelchli commented 3 years ago

but not within the LightningModule and I am wondering if that has something to do with it

as long as seed_everything() runs before your initialization of LM, it's all good.

don't know it, most likely irrelevant to the issue
with the seed set, should dropout be the same on all ddp processes? not sure if it can hurt in anyway
correct, seems reasonable! 👍
reasonable, LR scales with effective batch size 👍
yep, I noticed too. there are probably 16, right? Lightning is resetting the seed to the initial value after the "santiy check" validation loop. and that's why you see the message a second time. printing it on every rank is supposed to give the user assurance that it is done properly, but yes it results in many outputs which can be unpleasant!

should we investigate 2)? and generally all places where randomness is involved? I have at the moment no better idea. for these debugging purposes I would also suggest to deactivate as many trainer features as possible.

there is also a small chance of a bug in the F1 metric of torchmetrics. are you logging it via self.log(metric)?

ushahid commented 3 years ago

I have torchmetrics F1 (yes, logged with self.log) and F1 based on my accumulator derived from Metric class, both giving the same results, so if it's something related to calculations then it is consistent, I have also verified that all test records are accounted for in the accumulator and nothing is missing or repeated.

I think randomness is worth exploring although I am not sure exactly how to approach it, I am calling seed_everything before initialization but should I move it to model setup? Which other sources of randomness could there be which are not accounted for by seeding in the start? I also have deterministic flag set to True in Trainer. I agree that Dropout in theory should be the same provided that the seed is consistent.

Also which trainer flags would you recommend disabling and how does it effect DDP v DP performance difference?

ushahid commented 3 years ago

@awaelchli I have noticed that the batch indices passed to the 8 GPUs are completely different for DP and DDP for the first training step (I am printing in training_step for all GPUs), I have shuffling enabled for training, isn't pytorch lightning ensuring that for every step the same indices get passed to GPUs? If yes, then what can possibly be wrong in my setup?

awaelchli commented 3 years ago

Hmm, isn't this expected, considering the way DP handles data vs DDP is quite different? DP splits a single batch in N pieces, then sends each part to a GPU. DDP on the other hand partition all samples beforehand evenly across all processes, so each GPU/process sees a split of the dataset and then forms the batches.

We cannot directly compare the sampling process here.

ushahid commented 3 years ago

@awaelchli that makes sense, I was wondering if that would effect the results when used in combination with dropout I also tried removing dropout from the model and removed layer norm as well but the results are unfortunately still inconsistent.

ushahid commented 3 years ago

@awaelchli Another thing I noticed is that I get multiple loss values in the DP mode in training_step_end, I don't see a mention of that in the documentation but it is understandable why it is happening. I then average it in the case of DP as follows:

if outputs['loss'].numel() > 1:
    outputs['loss'] = outputs['loss'].mean()

I am assuming backward is called on this average, is this expected and normal?

I then log the loss as following:

self.log('loss/train', outputs['loss'], on_step=True, on_epoch=False)

In validation_step_end, I average again and then log it as:

self.log('loss/val_epoch', outputs['loss'], on_epoch=True, sync_dist=True, sync_dist_op='mean')

This is important because I am picking the best model based on validation loss. I am using the same code for DP and DDP, are these two equivalent for validation epoch loss?

awaelchli commented 3 years ago

I was wondering if that would effect the results when used in combination with dropout I also tried removing dropout from the model and removed layer norm as well but the results are unfortunately still inconsistent.

as far as I know, the dropout would not affect this. each gpu will use the same starting seed so all dropout masks will be the same across gpus of a single training step. In both DDP and DP. I am not 100% sure but I will try to verify it.

I am assuming backward is called on this average, is this expected and normal?

Yes expected and normal. See torch DataParallel. You can also just call outputs["loss"].mean() as it would not matter if there is only one element. There are as many elements as GPUs were involved in the loss computation.

This is important because I am picking the best model based on validation loss. I am using the same code for DP and DDP, are these two equivalent for validation epoch loss?

It looks ok. The DP version will always be correct in the sense that you get all batches in validation as there is only one main process and i works when the dataset is not evenly divisible by the number of gpus and batch size. With DDP, it's slightly more complicated and can be incorrect when the dataset size is not divisible by the number of gpus and batch size. If that's the case the DistributedSampler will repeat some dataset samples to make all processes see the same amount of data. It's an implementation detail that often gets forgotten (I only remember it now that you asked about it)

tigert1998 commented 2 years ago

Is there any conclusion?

fedorsc commented 1 year ago

If anyone comes across this and is using SLURM: I encountered an issue because I didn't specify the number of tasks in my SLURM script (see https://lightning.ai/docs/pytorch/stable/clouds/cluster_advanced.html).

VincentPelletier1 commented 5 months ago

I still cannot make my instance work, but for those trying to work with multi-machines:

put shuffle=True in the DataLoader. It should be set to False if you want to train across multiple machines in order to mitigate an overlap in data samples between your machines.

Lightning-AI / pytorch-lightning