Lightning-AI / pytorch-lightning

Pretrain, finetune and deploy AI models on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
27.98k stars 3.35k forks source link

Backward error when using mixed precision in manual optimization #17949

Closed kuviki closed 12 months ago

kuviki commented 1 year ago

Bug description

After disabling automatic optimization, the Trainer behaves inconsistently between precision='32' and precision='16-mixed'.

What version are you seeing the problem on?

v2.0

How to reproduce the bug

import lightning.pytorch as pl
import torch
import torch.nn.functional as F
from torch import nn
from torch.utils.data import DataLoader
from torchvision import transforms
from torchvision.datasets import MNIST

class Encoder(nn.Module):
    def __init__(self):
        super().__init__()
        self.l1 = nn.Sequential(nn.Linear(28 * 28, 64), nn.ReLU(), nn.Linear(64, 3))

    def forward(self, x):
        return self.l1(x)

class Decoder(nn.Module):
    def __init__(self):
        super().__init__()
        self.l1 = nn.Sequential(nn.Linear(3, 64), nn.ReLU(), nn.Linear(64, 28 * 28))

    def forward(self, x):
        return self.l1(x)

class LitAutoEncoder(pl.LightningModule):
    def __init__(self, encoder, decoder):
        super().__init__()
        self.automatic_optimization = False
        self.encoder = encoder
        self.decoder = decoder

    def training_step(self, batch, batch_idx):
        opt = self.optimizers()
        opt.zero_grad()

        x, y = batch
        x = x.view(x.size(0), -1)
        with torch.no_grad():
            z = self.encoder(x)
            x_target = self.decoder(z)
        z = self.encoder(x)
        x_hat = self.decoder(z)
        loss = F.mse_loss(x_hat, x_target)

        self.manual_backward(loss)
        print([(n, p.grad is not None) for n, p in self.named_parameters()])
        opt.step()

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)
        return optimizer

dataset = MNIST('data', download=True, transform=transforms.ToTensor())
train_loader = DataLoader(dataset)

# model
autoencoder = LitAutoEncoder(Encoder(), Decoder())

trainer = pl.Trainer(
    logger=False,
    enable_progress_bar=False,
    accelerator='gpu',
    precision='32',
    max_steps=1,
    enable_model_summary=False,
    enable_checkpointing=False,
)
trainer.fit(model=autoencoder, train_dataloaders=train_loader)

trainer = pl.Trainer(
    logger=False,
    enable_progress_bar=False,
    accelerator='gpu',
    precision='16-mixed',
    max_steps=1,
    enable_model_summary=False,
    enable_checkpointing=False,
)
trainer.fit(model=autoencoder, train_dataloaders=train_loader)

Error messages and logs

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
You are using a CUDA device ('NVIDIA GeForce RTX 3090') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
/home/kuviki/miniconda3/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:432: PossibleUserWarning: The dataloader, train_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 16 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  rank_zero_warn(
[('encoder.l1.0.weight', True), ('encoder.l1.0.bias', True), ('encoder.l1.2.weight', True), ('encoder.l1.2.bias', True), ('decoder.l1.0.weight', True), ('decoder.l1.0.bias', True), ('decoder.l1.2.weight', True), ('decoder.l1.2.bias', True)]
`Trainer.fit` stopped: `max_steps=1` reached.
Using 16bit Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
You are using a CUDA device ('NVIDIA GeForce RTX 3090') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Traceback (most recent call last):
  File "/mnt/c/Users/kuviki/PycharmProjects/thinking-ml/testbed.py", line 83, in <module>
    trainer.fit(model=autoencoder, train_dataloaders=train_loader)
  File "/home/kuviki/miniconda3/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 531, in fit
    call._call_and_handle_interrupt(
  File "/home/kuviki/miniconda3/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 42, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/kuviki/miniconda3/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 570, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/kuviki/miniconda3/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 975, in _run
    results = self._run_stage()
  File "/home/kuviki/miniconda3/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 1018, in _run_stage
    self.fit_loop.run()
  File "/home/kuviki/miniconda3/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 201, in run
    self.advance()
  File "/home/kuviki/miniconda3/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 354, in advance
    self.epoch_loop.run(self._data_fetcher)
  File "/home/kuviki/miniconda3/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 133, in run
    self.advance(data_fetcher)
  File "/home/kuviki/miniconda3/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 220, in advance
    batch_output = self.manual_optimization.run(kwargs)
  File "/home/kuviki/miniconda3/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/manual.py", line 90, in run
    self.advance(kwargs)
  File "/home/kuviki/miniconda3/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/manual.py", line 109, in advance
    training_step_output = call._call_strategy_hook(trainer, "training_step", *kwargs.values())
  File "/home/kuviki/miniconda3/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 287, in _call_strategy_hook
    output = fn(*args, **kwargs)
  File "/home/kuviki/miniconda3/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/strategies/strategy.py", line 367, in training_step
    return self.model.training_step(*args, **kwargs)
  File "/mnt/c/Users/kuviki/PycharmProjects/thinking-ml/testbed.py", line 48, in training_step
    self.manual_backward(loss)
  File "/home/kuviki/miniconda3/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/core/module.py", line 1028, in manual_backward
    self.trainer.strategy.backward(loss, None, *args, **kwargs)
  File "/home/kuviki/miniconda3/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/strategies/strategy.py", line 200, in backward
    self.precision_plugin.backward(closure_loss, self.lightning_module, optimizer, *args, **kwargs)
  File "/home/kuviki/miniconda3/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/plugins/precision/precision_plugin.py", line 67, in backward
    model.backward(tensor, *args, **kwargs)
  File "/home/kuviki/miniconda3/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/core/module.py", line 1046, in backward
    loss.backward(*args, **kwargs)
  File "/home/kuviki/miniconda3/envs/pytorch/lib/python3.10/site-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/home/kuviki/miniconda3/envs/pytorch/lib/python3.10/site-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

Environment

Current environment * CUDA: - GPU: - NVIDIA GeForce RTX 3090 - available: True - version: 11.8 * Lightning: - flash-pytorch: 0.1.7 - lightning: 2.0.4 - lightning-cloud: 0.5.37 - lightning-utilities: 0.8.0 - lion-pytorch: 0.0.7 - pytorch-lightning: 2.0.4 - reformer-pytorch: 1.4.4 - rotary-embedding-torch: 0.2.1 - torch: 2.0.1 - torchaudio: 2.0.2 - torchmetrics: 0.11.4 - torchvision: 0.15.2 * Packages: - absl-py: 1.4.0 - aiohttp: 3.8.4 - aiosignal: 1.3.1 - antialiased-cnns: 0.3 - anyio: 3.7.0 - appdirs: 1.4.4 - arrow: 1.2.3 - async-timeout: 4.0.2 - attrs: 22.2.0 - axial-positional-embedding: 0.2.1 - beautifulsoup4: 4.12.2 - blessed: 1.20.0 - brotlipy: 0.7.0 - cachetools: 5.3.0 - certifi: 2022.12.7 - cffi: 1.15.1 - charset-normalizer: 2.0.4 - click: 8.1.3 - contourpy: 1.0.7 - croniter: 1.3.15 - cryptography: 39.0.1 - cycler: 0.11.0 - dateutils: 0.6.12 - deepdiff: 6.3.0 - einops: 0.6.0 - exceptiongroup: 1.1.1 - fastapi: 0.98.0 - filelock: 3.9.0 - flash-attn: 1.0.2 - flash-pytorch: 0.1.7 - flit-core: 3.6.0 - fonttools: 4.39.0 - frozenlist: 1.3.3 - fsspec: 2023.3.0 - gmpy2: 2.1.2 - google-auth: 2.16.2 - google-auth-oauthlib: 0.4.6 - grpcio: 1.51.3 - h11: 0.14.0 - huggingface-hub: 0.13.2 - idna: 3.4 - inquirer: 3.1.3 - itsdangerous: 2.1.2 - jinja2: 3.1.2 - kiwisolver: 1.4.4 - lightning: 2.0.4 - lightning-cloud: 0.5.37 - lightning-utilities: 0.8.0 - lion-pytorch: 0.0.7 - local-attention: 1.8.4 - markdown: 3.4.1 - markdown-it-py: 3.0.0 - markupsafe: 2.1.1 - matplotlib: 3.7.1 - mdurl: 0.1.2 - mkl-fft: 1.3.1 - mkl-random: 1.2.2 - mkl-service: 2.4.0 - mpmath: 1.2.1 - multidict: 6.0.4 - networkx: 2.8.4 - numpy: 1.23.5 - oauthlib: 3.2.2 - ordered-set: 4.1.0 - packaging: 23.0 - pillow: 9.5.0 - pip: 23.0.1 - pooch: 1.4.0 - product-key-memory: 0.1.10 - protobuf: 4.22.1 - psutil: 5.9.5 - pyasn1: 0.4.8 - pyasn1-modules: 0.2.8 - pycparser: 2.21 - pydantic: 1.10.9 - pygments: 2.15.1 - pyjwt: 2.7.0 - pyopenssl: 23.0.0 - pyparsing: 3.0.9 - pysocks: 1.7.1 - python-dateutil: 2.8.2 - python-editor: 1.0.4 - python-multipart: 0.0.6 - pytorch-lightning: 2.0.4 - pytz: 2023.3 - pyyaml: 6.0 - readchar: 4.0.5 - reformer-pytorch: 1.4.4 - requests: 2.28.1 - requests-oauthlib: 1.3.1 - rich: 13.4.2 - rotary-embedding-torch: 0.2.1 - rsa: 4.9 - safetensors: 0.3.0 - scipy: 1.10.1 - setuptools: 65.6.3 - six: 1.16.0 - sniffio: 1.3.0 - soupsieve: 2.4.1 - starlette: 0.27.0 - starsessions: 1.3.0 - sympy: 1.11.1 - tensorboard: 2.12.0 - tensorboard-data-server: 0.7.0 - tensorboard-plugin-wit: 1.8.1 - timm: 0.6.13 - torch: 2.0.1 - torchaudio: 2.0.2 - torchmetrics: 0.11.4 - torchvision: 0.15.2 - tornado: 6.2 - tqdm: 4.65.0 - traitlets: 5.9.0 - triton: 2.0.0 - typing-extensions: 4.4.0 - urllib3: 1.26.14 - uvicorn: 0.22.0 - wcwidth: 0.2.6 - websocket-client: 1.6.1 - websockets: 11.0.3 - werkzeug: 2.2.3 - wheel: 0.38.4 - yarl: 1.8.2 * System: - OS: Linux - architecture: - 64bit - ELF - processor: x86_64 - python: 3.10.9 - release: 5.15.90.1-microsoft-standard-WSL2 - version: #1 SMP Fri Jan 27 02:56:13 UTC 2023

More info

No response

rjarun8 commented 1 year ago

The issue was due to the use of a torch.no_grad() context in the training_step method. This signals PyTorch not to keep track of any operation for gradient computation. In a scenario of just model evaluation, this should have been fine. But in this code, there is a backward pass that is signaling to compute gradients using the parameters which it has not kept track of because of the torch.no_grad().

The fix involved removing the torch.no_grad() context from the training_step method. Here's the corrected method:


def training_step(self, batch, batch_idx):
    opt = self.optimizers()
    opt.zero_grad()

    x, y = batch
    x = x.view(x.size(0), -1)
    z = self.encoder(x)
    x_hat = self.decoder(z)
    loss = F.mse_loss(x_hat, x)

    self.manual_backward(loss)
    print([(n, p.grad is not None) for n, p in self.named_parameters()])
    opt.step()

INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_steps=1` reached.
INFO:pytorch_lightning.utilities.rank_zero:Using 16bit Automatic Mixed Precision (AMP)
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_steps=1` reached.
[('encoder.l1.0.weight', True), ('encoder.l1.0.bias', True), ('encoder.l1.2.weight', True), ('encoder.l1.2.bias', True), ('decoder.l1.0.weight', True), ('decoder.l1.0.bias', True), ('decoder.l1.2.weight', True), ('decoder.l1.2.bias', True)]
[('encoder.l1.0.weight', True), ('encoder.l1.0.bias', True), ('encoder.l1.2.weight', True), ('encoder.l1.2.bias', True), ('decoder.l1.0.weight', True), ('decoder.l1.0.bias', True), ('decoder.l1.2.weight', True), ('decoder.l1.2.bias', True)]
kuviki commented 1 year ago

The issue was due to the use of a torch.no_grad() context in the training_step method. This signals PyTorch not to keep track of any operation for gradient computation. In a scenario of just model evaluation, this should have been fine. But in this code, there is a backward pass that is signaling to compute gradients using the parameters which it has not kept track of because of the torch.no_grad().

The fix involved removing the torch.no_grad() context from the training_step method. Here's the corrected method:

def training_step(self, batch, batch_idx):
    opt = self.optimizers()
    opt.zero_grad()

    x, y = batch
    x = x.view(x.size(0), -1)
    z = self.encoder(x)
    x_hat = self.decoder(z)
    loss = F.mse_loss(x_hat, x)

    self.manual_backward(loss)
    print([(n, p.grad is not None) for n, p in self.named_parameters()])
    opt.step()

INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_steps=1` reached.
INFO:pytorch_lightning.utilities.rank_zero:Using 16bit Automatic Mixed Precision (AMP)
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_steps=1` reached.
[('encoder.l1.0.weight', True), ('encoder.l1.0.bias', True), ('encoder.l1.2.weight', True), ('encoder.l1.2.bias', True), ('decoder.l1.0.weight', True), ('decoder.l1.0.bias', True), ('decoder.l1.2.weight', True), ('decoder.l1.2.bias', True)]
[('encoder.l1.0.weight', True), ('encoder.l1.0.bias', True), ('encoder.l1.2.weight', True), ('encoder.l1.2.bias', True), ('decoder.l1.0.weight', True), ('decoder.l1.0.bias', True), ('decoder.l1.2.weight', True), ('decoder.l1.2.bias', True)]

Thank you for your response.

I appreciate your explanation regarding the issue with torch.no_grad(). However, in my actual project, the situation is more complex, and simply removing torch.no_grad() is not a feasible solution.

Currently, I have implemented a workaround by creating a copy of the model and running it within the torch.no_grad() context. After each step, I synchronize the parameters.

In my specific project, this bug manifests itself not as an exception or error message, but rather as certain parameters in the model having None gradients. I was able to identify this issue only after careful investigation.

rjarun8 commented 1 year ago

Ok ! Interesting. So The workaround involves using two versions of the model: the original model, which is used for both forward and backward propagation, and a copy of the model, which is used solely for forward propagation with graadient tracking disabled (using no_grad()). In this setup, the copy of the model performs the forward pass without tracking gradients, which can improve computational efficiency. After this forward pass, the parameters of the copied model are synchronized with the original model. Then, the original model performs backward propagation. This approach circumvents the issue of encountering None gradients during the backward pass.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions - the Lightning Team!

awaelchli commented 12 months ago

This is an old PyTorch quirk and unrelated to Lightning. Here is a good answer by the PyTorch devs: https://discuss.pytorch.org/t/autocast-and-torch-no-grad-unexpected-behaviour/93475/3

I took your example and verified the workaround given there. I disabled the torch autocasting on the first forward pass and then enabled it again on the second:

def training_step(self, batch, batch_idx):
        opt = self.optimizers()
        opt.zero_grad()

        x, y = batch
        x = x.view(x.size(0), -1)

        with torch.autocast("cuda", enabled=False):  # <--------- HERE False
            with torch.no_grad():
                z = self.encoder(x)
                x_target = self.decoder(z)

        with torch.autocast("cuda", enabled=True):  # <--------- HERE True
            z = self.encoder(x)
            x_hat = self.decoder(z)
            loss = F.mse_loss(x_hat, x_target)

        self.manual_backward(loss)
        print([(n, p.grad is not None) for n, p in self.named_parameters()])
        opt.step()

Based on this, I'm closing the issue. I hope the answer doesn't come too late. Cheers!