Open pranavrao-qure opened 1 week ago
set strategy == "ddp_find_unused_parameters_true" as error log said
Doesn't setting strategy == "ddp_find_unused_parameters_true"
make an extra forward pass, using more computation and time? As far I understant the tutorial, there doesn't seem to be any parameters with requires_grad=True
during the computation of d_loss
which should have grad=None
, as the function call self.toggle_optimizer(optimizer_d)
will set the value of the requires_grad to False
for parameters other than ones being optimised by optimizer_d
Doesn't setting
strategy == "ddp_find_unused_parameters_true"
make an extra forward pass, using more computation and time? As far I understant the tutorial, there doesn't seem to be any parameters withrequires_grad=True
during the computation ofd_loss
which should havegrad=None
, as the function callself.toggle_optimizer(optimizer_d)
will set the value of therequires_grad to False
for parameters other than ones being optimised byoptimizer_d
I check the code you provided and find out the "unused params" follow the https://discuss.pytorch.org/t/how-to-find-the-unused-parameters-in-network/63948/5, it looks like the main reason is discriminator and generator calculate loss separately but lightning module make them as single model, follow the debug method i mentioned above:
# adversarial loss is binary cross-entropy
g_loss = self.adversarial_loss(self.discriminator(self.generated_imgs), valid)
self.log("g_loss", g_loss, prog_bar=True)
self.manual_backward(g_loss)
for name, param in self.named_parameters():
if param.grad is None:
print(name)
optimizer_g.step()
optimizer_g.zero_grad()
self.untoggle_optimizer(optimizer_g)
# train discriminator
# Measure discriminator's ability to classify real from generated samples
self.toggle_optimizer(optimizer_d)
# how well can it label as real?
valid = torch.ones(imgs.size(0), 1)
valid = valid.type_as(imgs)
real_loss = self.adversarial_loss(self.discriminator(imgs), valid)
# how well can it label as fake?
fake = torch.zeros(imgs.size(0), 1)
fake = fake.type_as(imgs)
fake_loss = self.adversarial_loss(self.discriminator(self.generated_imgs.detach()), fake)
# discriminator loss is the average of these
d_loss = (real_loss + fake_loss) / 2
self.log("d_loss", d_loss, prog_bar=True)
self.manual_backward(d_loss)
for name, param in self.named_parameters():
if param.grad is None:
print(name)
optimizer_d.step()
optimizer_d.zero_grad()
self.untoggle_optimizer(optimizer_d)
i got the output (by setting "ddp_find_unused_parameters_true"):
if you call backward by :
self.manual_backward(d_loss + g_loss)
self.toggle_optimizer(optimizer_d)
optimizer_d.step()
optimizer_d.zero_grad()
self.untoggle_optimizer(optimizer_d)
self.toggle_optimizer(optimizer_g)
optimizer_g.step()
optimizer_g.zero_grad()
self.untoggle_optimizer(optimizer_g)
"ddp" setting will work correctly
Bug description
I am trying to train a GAN model on multiple GPUs using DDP. I followed the tutorial at https://lightning.ai/docs/pytorch/stable/notebooks/lightning_examples/basic-gan.html, changing the arguments to Trainer to
Running the script raise Runtime error as follows:
What version are you seeing the problem on?
v2.4
How to reproduce the bug
Error messages and logs
Environment
Current environment
* CUDA: - GPU: - NVIDIA L40S - NVIDIA L40S - NVIDIA L40S - NVIDIA L40S - available: True - version: 12.1 * Lightning: - lightning: 2.4.0 - lightning-utilities: 0.11.7 - pytorch-lightning: 2.4.0 - torch: 2.4.1 - torchmetrics: 1.4.2 - torchvision: 0.19.1 * Packages: - aiohappyeyeballs: 2.4.3 - aiohttp: 3.10.9 - aiosignal: 1.3.1 - async-timeout: 4.0.3 - attrs: 24.2.0 - autocommand: 2.2.2 - backports.tarfile: 1.2.0 - cxr-training: 0.1.0 - filelock: 3.16.1 - frozenlist: 1.4.1 - fsspec: 2024.9.0 - idna: 3.10 - importlib-metadata: 8.0.0 - importlib-resources: 6.4.0 - inflect: 7.3.1 - jaraco.collections: 5.1.0 - jaraco.context: 5.3.0 - jaraco.functools: 4.0.1 - jaraco.text: 3.12.1 - jinja2: 3.1.4 - lightning: 2.4.0 - lightning-utilities: 0.11.7 - markupsafe: 3.0.1 - more-itertools: 10.3.0 - mpmath: 1.3.0 - multidict: 6.1.0 - networkx: 3.3 - numpy: 2.1.2 - nvidia-cublas-cu12: 12.1.3.1 - nvidia-cuda-cupti-cu12: 12.1.105 - nvidia-cuda-nvrtc-cu12: 12.1.105 - nvidia-cuda-runtime-cu12: 12.1.105 - nvidia-cudnn-cu12: 9.1.0.70 - nvidia-cufft-cu12: 11.0.2.54 - nvidia-curand-cu12: 10.3.2.106 - nvidia-cusolver-cu12: 11.4.5.107 - nvidia-cusparse-cu12: 12.1.0.106 - nvidia-nccl-cu12: 2.20.5 - nvidia-nvjitlink-cu12: 12.6.77 - nvidia-nvtx-cu12: 12.1.105 - packaging: 24.1 - pillow: 10.4.0 - pip: 24.2 - platformdirs: 4.2.2 - propcache: 0.2.0 - pytorch-lightning: 2.4.0 - pyyaml: 6.0.2 - setuptools: 75.1.0 - sympy: 1.13.3 - tomli: 2.0.1 - torch: 2.4.1 - torchmetrics: 1.4.2 - torchvision: 0.19.1 - tqdm: 4.66.5 - triton: 3.0.0 - typeguard: 4.3.0 - typing-extensions: 4.12.2 - wheel: 0.44.0 - yarl: 1.14.0 - zipp: 3.19.2 * System: - OS: Linux - architecture: - 64bit - ELF - processor: x86_64 - python: 3.10.0 - release: 5.15.0-1063-nvidia - version: #64-Ubuntu SMP Fri Aug 9 17:13:45 UTC 2024More info
No response