Open mees opened 3 years ago
Hey @lzrpotato, thanks for the link, but that discussion doesn't apply here, as we have the same effective batch size in all three cases.
Hey, if there is batch norm hidden in there somewhere or any other operation that takes batch statistics into account for backprop, this could explain it.
Dear @mees,
we just merged this PR https://github.com/PyTorchLightning/pytorch-lightning/pull/683. I think SyncBN wasn't properly configured and it could maybe explain the disparity of observed. Would you mind trying out master ?
Best, T.C
@mees friendly ping :)
Dear @tchaton .
I have a similar problem about which I just raised an issue. I also thought the problem was SyncBN but the problem persisted when I tried models that didn't use BN at all. It may be related.
Hi all, sorry for the late reply, I just re-runned the tests for PL master and it the results make more sense now :)
I did some sanity testing yesterday. Will share later today the plots. It looks like something is still not right. Going off of your example @mees I can see it has nothing to do with multiple optimizers or log syncing. Will get back to you
ah okey, yes please let me know and feel free to reopen the issue.
Here are my findings. Four experiments conducted, dummy model with dummy loss, effective batch size always 8:
Pure PyTorch DDP implementation:
Lightning DDP implementation:
Without learning rate scaling, I get two different curves, but we maintain parity between pure PyTorch and Lighting. The plots are accessible here in WandB:
We find that if we scale the learning rate by a factor of 4
--gpus=2 --batch_size=4 --lr_scale 4
we can perfectly replicate the loss plot of the single gpu experiment.
Here are the two equivalent (to the best of my ability) implementations:
PyTorch:
from argparse import ArgumentParser
import torch
import torch.distributed
from torch.nn.parallel import DistributedDataParallel
from torch.utils.data import DataLoader, DistributedSampler
from pl_examples.repro import BoringModel, RandomDataset
from pytorch_lightning import seed_everything
import wandb
def train():
parser = ArgumentParser()
parser.add_argument("--gpus", type=int, default=2)
parser.add_argument("--batch_size", type=int, default=4)
parser.add_argument("--local_rank", type=int)
parser.add_argument("--lr_scale", type=float, default=1.0)
parser.add_argument("--name", type=str, default="debug")
args = parser.parse_args()
device_ids = list(range(args.gpus))
device = torch.device("cuda", args.local_rank)
torch.cuda.set_device(args.local_rank)
torch.distributed.init_process_group(backend='nccl', init_method='env://')
if args.local_rank == 0:
wandb.init(project="ddp-parity-1.3.0", name=args.name)
wandb.config.update(vars(args))
model = BoringModel(**vars(args)).to(device)
opt = model.configure_optimizers()
ddp_model = DistributedDataParallel(model, device_ids=[args.local_rank], output_device=args.local_rank)
dataset = RandomDataset(32, 6400)
train_data = DataLoader(
dataset,
batch_size=args.batch_size,
sampler=DistributedSampler(dataset, num_replicas=len(device_ids), rank=args.local_rank)
)
global_step = 0
for epoch in range(5):
for i, batch in enumerate(train_data):
batch = batch.to(device)
opt.zero_grad()
loss = ddp_model(batch).sum()
loss.backward()
opt.step()
if args.local_rank == 0:
print(f"{i:04d} / {len(train_data)}")
wandb.log({"train_loss": loss, "trainer/global_step": global_step})
global_step += 1
if __name__ == "__main__":
seed_everything(0)
train()
# run command:
# python -m torch.distributed.launch --nproc_per_node=2 pl_examples/repro_pt.py --batch_size 4 --gpus 2 --name pt-ddp
Lightning:
import os
from argparse import ArgumentParser
import torch
from torch.nn.parallel import DistributedDataParallel
from torch.utils.data import Dataset, DataLoader
from pytorch_lightning import LightningModule, Trainer, seed_everything
from pytorch_lightning.loggers import WandbLogger
from pytorch_lightning.plugins import DDPPlugin
class RandomDataset(Dataset):
def __init__(self, size, length):
self.len = length
self.data = torch.randn(length, size)
def __getitem__(self, index):
return self.data[index]
def __len__(self):
return self.len
class BoringModel(LightningModule):
def __init__(self, **kwargs):
super().__init__()
self.save_hyperparameters(kwargs)
self.lr_scale = kwargs.get("lr_scale")
self.layer = torch.nn.Linear(32, 2)
def forward(self, x):
return self.layer(x)
def training_step(self, batch, batch_idx):
loss = self(batch).sum()
self.log("train_loss", loss, sync_dist=False)
assert isinstance(self.trainer.model, DistributedDataParallel)
assert isinstance(self.trainer.training_type_plugin, DDPPlugin)
return {"loss": loss}
def validation_step(self, batch, batch_idx):
loss = self(batch).sum()
self.log("valid_loss", loss, sync_dist=False)
def test_step(self, batch, batch_idx):
loss = self(batch).sum()
self.log("test_loss", loss, sync_dist=False)
def configure_optimizers(self):
return torch.optim.SGD(self.layer.parameters(), lr=(0.1 * self.lr_scale))
def run():
parser = ArgumentParser()
parser = Trainer.add_argparse_args(parser)
parser.add_argument("--name", type=str, default="debug")
parser.add_argument("--batch_size", type=int, default=4)
parser.add_argument("--lr_scale", type=float, default=1.0)
parser.set_defaults(
max_epochs=5,
log_every_n_steps=1,
limit_val_batches=0,
)
args = parser.parse_args()
logger = WandbLogger(project="ddp-parity-1.3.0", name=args.name)
trainer = Trainer.from_argparse_args(args, logger=logger, plugins=[DDPPlugin(find_unused_parameters=False)])
model = BoringModel(**vars(args))
train_data = DataLoader(RandomDataset(32, 6400), batch_size=args.batch_size)
val_data = DataLoader(RandomDataset(32, 6400), batch_size=args.batch_size)
test_data = DataLoader(RandomDataset(32, 6400), batch_size=args.batch_size)
trainer.fit(model, train_dataloader=train_data, val_dataloaders=val_data)
if __name__ == '__main__':
seed_everything(0, workers=True)
run()
Environment details for reproduction:
@awaelchli I may be wrong but I think in your example this may be due to the sum() in the loss. If the loss is not averaged over the batch then the magnitude of the gradients increases as the batch size increases, so the 2-GPU model has a learning signal half as strong as the single GPU one. The reported loss is also always 2 times smaller than the loss for the same model with double the batch. Therefore, in order to get the same reported loss, the 2-GPU model needs to move twice as fast. These two factors together get you the 4X learning rate scaling requirement.
No you are right and this is stupid of me not noticing. It should be the mean! I will update the results later today. And I will check again if 1.2.6 gave different results as you reported initially. Thanks for the help. Need to be sure everything is right before our 1.3 release.
Reran my toy example and the curves match. But the initial code posted by @mees shows a difference and I was able to confirm that on my gpus. @mees did you change anything when you reran your experiments here https://github.com/PyTorchLightning/pytorch-lightning/issues/6789#issuecomment-830198938 ?
@mees @awaelchli I believe I might have found the issue that I had and likely what is going on in this case as well.
I re-implemented my code using standard pytorch ddp and found that, although it didn't throw errors, it didn't really synchronize gradients unless the parameters in question were used in the model's forward function. Apparently (and although I couldn't find that requirement on any pytorch docs) ddp registers parameters to-be-synced via the forward function of the model wrapped in DistributedDataParallel. Therefore, and likely also in PTL's case, all training step functions must use the main model's forward to work correctly, even if the forward function includes multiple cases (e.g. via ifs) to accomodate more complex models.
I notice that mees' code also doesn't use the forward function in the training step so I'm guessing that is the issue. The way to confirm this is to save the weights for models in all devices. If what I suggest holds and they don't sync, the models will be different. If that is the case I would strongly suggest an edit to the docs to highlight this requirement.
Therefore, and likely also in PTL's case, all training step functions must use the main model's forward to work correctly, even if the forward function includes multiple cases (e.g. via ifs) to accomodate more complex models.
By default, Lightning redirects the forward function for DistributedDataParallel to the correct *_step
function inside the lightning module. So implementing forward
on the lightning module is not a hard requirement: https://github.com/PyTorchLightning/pytorch-lightning/blob/e0c64f0ef639a8b9f46e0d8e32e5e0a6b7532cff/pytorch_lightning/overrides/base.py#L22-L63
Additionally, find_unused_parameters
is set to True by default in Lightning, which does differ from the PyTorch default for DDP. @awaelchli this is a difference between your Lightning code and PyTorch code. For the BoringModel this shouldn't matter, but for more complex models it's easy to get different benchmarking results between native pytorch and Lightning. This is a big reason why I'm in favor of changing the default back again to match upstream PyTorch
I see, thank you. I would note that in DDP with PyTorch I had set find_unused_parameters
to True and still parameters didn't sync unless used by forward. Although I don't think it is important in this case given the overriding you mentioned.
I agree with @ananthsub this should not be an issue. I ran a quick experiment where I converted the step() to forward() in @mees code and I see no change. Repro for his reported numbers/plots are here: https://wandb.ai/awaelchli/ddp-parity
EDIT: I compared the weights with and without "proper" forward use and they are the same.
Did you have a model where there were actually unused parameters? Because find_unused_parameters=True only adds runtime overhead to my knowledge if there are no unused params.
If you want to know what's up, carefully backtrack your changes. I bet there was a subtle change when you did the refactor with forward that made the difference. I would be very interested in finding what that is!
In my case I had a "parent" model class with 2 sub-models, each updated individually (different optimizers). Pure PyTorch DDP and PTL with automatic optimization required that I set find_unused_parameters=True, otherwise an error was thrown. PTL with manual optimization (which also gave very different results for 1 and 2 GPUS) did not require that.
I am fairly certain that the switch to forward did not change anything because I used the same functions as before, only they were under forward and I used ifs to choose which path I wanted forward to take. However, that only applies to pure PyTorch. I have not yet solved the problem in my PTL implementation.
Yes I agree that in pure PyTorch, not calling the forward method properly will not activate the hooks DDP needs for sync. However I would be interested how you are using Lightning in a way that you still see this issue? Are you able to reproduce it in a minimal example? Let's track the manual optimization case in a separate issue, as it seems very different than the report here.
Thank you, I will try to create an example in the coming days and get back to you with more information. The model I am using is fairly complex so it may take a while to find a way to reproduce the issue in a simpler example.
Reran my toy example and the curves match. But the initial code posted by @mees shows a difference and I was able to confirm that on my gpus. @mees did you change anything when you reran your experiments here #6789 (comment) ?
Hi @awaelchli no I didn't change anything, just the lightning version. Do you have any suspicion of what might be going on?
I noticed a similar thing when training a YOLO object detection model using DDP on 4, 8, or 16 GPUs. (There is a pull request for the YOLO model in Lightning Bolts.) I'm using the same batch size in every case and accumulating gradients from 4, 2, or 1 steps to compensate for the number of GPUs.
There's not much difference going from 4 to 8 GPUs, but going to 16 GPUs there's a radical drop in average precision:
The model is using batch normalization. Shouldn't make a difference, since I'm using the same batch size and I'm not using synchronized batch normalization, right? (I tried SyncBN too and it doesn't make much difference either.) The losses are calculated as an average over the batch size. But I don't see how any problem in the model could matter, as the model always processes batches of the same size. In one case the gradients are just accumulated from two steps, and in another case they are computed in parallel on twice as many GPUs.
I was repeating the experiment on the public COCO dataset, and there was not much difference between 8 and 16 GPUs in the model performance, but I did experience some nan/inf values with 16 GPUs.
I thought the difference might come from more GPU processes using more workers for data loading. If the workers - that also perform data augmentation - use the same random seed, which was found to be true in many cases, that can have a negative effect. Actually I noticed that Lightning was not seeding the workers correctly when using DDP, so I thought that might be the culprit. Unfortunately fixing the random seed didn't help, neither did reducing the number of workers.
If I disable data augmentation altogether, the results are a lot worse, but there's not that big of a difference between 8 and 16 GPUs anymore. It doesn't necessarily mean that there's something wrong with data augmentation, though.
I'm using gradient clipping and another thought was that maybe the clipping is applied differently when using gradient accumulation. However, looking at the code, it seems that clipping is applied only after accumulating the gradients for one model update, so it should work identically whether one is using gradient accumulation or not.
I did try using a lower threshold for gradient clipping, and it reduced the difference:
To me the most likely explanation is that something is making training less stable when using more GPUs, and this can be mitigated using stronger gradient clipping.
I'm not seeing the regression when using 16 GPUs anymore. I fixed a couple of errors in the loss function. Essentially the loss was scaled incorrectly. But I don't see how these could have a different effect with different number of GPUs, when the batch size is kept constant. Then I noticed that another regression and in some cases an error message was caused by using an incompatible version of Torchvision. I'm installing Lightning and Torchvision in a Docker container. The base image from NVIDIA includes Torch and Torchvision. Installing the latest Torchvision manually using pip caused these strange problems. To me it sounds like this is the most likely cause of the differences in average precision. Could be related to the NMS that is performed by Torchvision. In that case this has nothing to do with the bug that was reported in this issue.
I am encountering this issue in 1.4.4 any idea what to do to fix this? @senarvi @edenlightning
Any update on this? I am facing the exact same issue. @awaelchli
Dear @senarvi,
Are you precision=16 while training ?
Best, T.C
@tchaton I was using 32-bit precision [Edit: not 16] all the time. I don't have a clear idea what caused the problem. After fixing the Torchvision version, I don't see the problem anymore. I'm using Torchvision for performing non-maximum suppression before scoring, so it could be related to that. Sorry that I cannot help more.
In case you are using gradient clipping + 16 bit precision then there is a bug here: #9330 which could lead to the difference you are seeing.
@awaelchli sorry, I wrote incorrectly above. I was using gradient clipping, but 32-bit precision. Good to know anyway!
After speaking to @awaelchli @tchaton we should introduce some form of correctness test between 2 GPUs vs 1 GPU for model loss/convergence.
This may also allow us to easily test correctness for our training system, and showcase any changes needed to maintain parity.
Hey @SeanNaren,
Any progress there ?
Any update? I'm facing the same issue.
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!
active
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!
active?
Hi, all. Have there been any updates in this regard? I am using PyTorch Lightning 1.5.4 for training my models with 1 vs. 2 vs. 4 GPUs. When ensuring the same effective batch size (e.g., 8), I am still seeing different results for my three model training runs even though I am using the same random seed with Lightning's seed_everything()
.
I think one issue with DDP having the same seed across different processes for each GPU is that there is a lower diversity of randomness compared to DP.
For example, say that your batch size is 2 and the number of GPUs is two and you're randomly rotating images by a random angle. Then using DP, pytorch will rotate each of the two images with a different angle. However, if you're using DDP, each GPU will get one sample and each image will be rotated the same way across GPUs because each process gets the same seed. This doesn't just apply to rotations; it applies to all operations that have some sort of randomness (like dropout).
The more GPUs you use while keeping the effective batch size the same, the lower the diversity of randomness. This lower diversity in randomness can lead to poorer results, which may explain the difference between DDP and DP even when the effective batch size is the same.
The solution to this seems to be just set different seeds for each GPU process and this is no longer an issue. However, one big issue to this solution is that this will make batch loading inconsistent. For example, if each GPU process has the same seed, then each batch loader of each GPU will load indices [5, 9, 4, 7]. And then the DistributedSampler
from pytorch will subselect that batch based on the GPU rank (so GPU 0 will get [5, 4]
and GPU 1 will get [9, 7]
). If you set a different seed for each GPU process, then each GPU will load in a different set of indices. So for example, GPU process 0 will load in [3, 8, 9, 6]
and GPU process 1 will load in [5, 3, 4, 8]
.
@awaelchli Does this look right?
Yes, I think your explanation checks out when no extra steps are taken to diversify the augmentations, in general. In Lightning, if you use seed_everything(workers=True)
, we take care of this by seeding every dataloader worker differently, across all ranks (across multi-node too), and this will then lead to diversified augmentations, provided that they are applied in the dataset/worker and num_workers > 0
. In other words, the seed in each worker is different but deterministically related to the main seed. The default is seed_everything(workers=False)
however, so this needs to be turned on by the user explicitly.
I don't recommend setting a different seed per rank though unless the user knows what they are doing and what implications this has (under Lightning and/or PyTorch). This is difficult to reason about and difficult to debug. I strongly recommend to let the distributed sampler fully handle the shuffling to avoid duplicated or missing samples across the ranks.
Two other thoughts regarding the discussion on the thread here:
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions - the Lightning Team!
🐛 Bug
Training a network with DDP on 2 GPUs with a batch size of N/2 should give the same result a training on a single GPU with batch size N. I tested this using the Cifar VAE from lightning-bolts. The configurations I tried are single GPU with the default batch size 256, Data Parallel on 2 GPUs (each GPU gets then a batch of 128) and DDP on 2GPUs (manually setting batch size to 128). Although all three experiments have the same effective batch size, DDP doesn't show the same performance as the single GPU training and DP, specially with respect to the kl loss. The experiments are with the default setting, without fancy stuff like 16bit precision or sharded training. I used a VAE network to analyze this probem as it is close in spirit to the networks I am using in my current research. @SeanNaren @awaelchli
To Reproduce
Expected behavior
SInce the effective batch size and the rest of hyperparameters are the same, the training should be very close.
Environment
cc @tchaton @rohitgr7 @akihironitta @justusschock @kaushikb11 @awaelchli