Closed kuviki closed 12 months ago
The issue was due to the use of a torch.no_grad()
context in the training_step
method. This signals PyTorch not to keep track of any operation for gradient computation. In a scenario of just model evaluation, this should have been fine. But in this code, there is a backward pass that is signaling to compute gradients using the parameters which it has not kept track of because of the torch.no_grad()
.
The fix involved removing the torch.no_grad()
context from the training_step
method. Here's the corrected method:
def training_step(self, batch, batch_idx):
opt = self.optimizers()
opt.zero_grad()
x, y = batch
x = x.view(x.size(0), -1)
z = self.encoder(x)
x_hat = self.decoder(z)
loss = F.mse_loss(x_hat, x)
self.manual_backward(loss)
print([(n, p.grad is not None) for n, p in self.named_parameters()])
opt.step()
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_steps=1` reached.
INFO:pytorch_lightning.utilities.rank_zero:Using 16bit Automatic Mixed Precision (AMP)
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_steps=1` reached.
[('encoder.l1.0.weight', True), ('encoder.l1.0.bias', True), ('encoder.l1.2.weight', True), ('encoder.l1.2.bias', True), ('decoder.l1.0.weight', True), ('decoder.l1.0.bias', True), ('decoder.l1.2.weight', True), ('decoder.l1.2.bias', True)]
[('encoder.l1.0.weight', True), ('encoder.l1.0.bias', True), ('encoder.l1.2.weight', True), ('encoder.l1.2.bias', True), ('decoder.l1.0.weight', True), ('decoder.l1.0.bias', True), ('decoder.l1.2.weight', True), ('decoder.l1.2.bias', True)]
The issue was due to the use of a
torch.no_grad()
context in thetraining_step
method. This signals PyTorch not to keep track of any operation for gradient computation. In a scenario of just model evaluation, this should have been fine. But in this code, there is a backward pass that is signaling to compute gradients using the parameters which it has not kept track of because of thetorch.no_grad()
.The fix involved removing the
torch.no_grad()
context from thetraining_step
method. Here's the corrected method:def training_step(self, batch, batch_idx): opt = self.optimizers() opt.zero_grad() x, y = batch x = x.view(x.size(0), -1) z = self.encoder(x) x_hat = self.decoder(z) loss = F.mse_loss(x_hat, x) self.manual_backward(loss) print([(n, p.grad is not None) for n, p in self.named_parameters()]) opt.step() INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_steps=1` reached. INFO:pytorch_lightning.utilities.rank_zero:Using 16bit Automatic Mixed Precision (AMP) INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_steps=1` reached. [('encoder.l1.0.weight', True), ('encoder.l1.0.bias', True), ('encoder.l1.2.weight', True), ('encoder.l1.2.bias', True), ('decoder.l1.0.weight', True), ('decoder.l1.0.bias', True), ('decoder.l1.2.weight', True), ('decoder.l1.2.bias', True)] [('encoder.l1.0.weight', True), ('encoder.l1.0.bias', True), ('encoder.l1.2.weight', True), ('encoder.l1.2.bias', True), ('decoder.l1.0.weight', True), ('decoder.l1.0.bias', True), ('decoder.l1.2.weight', True), ('decoder.l1.2.bias', True)]
Thank you for your response.
I appreciate your explanation regarding the issue with torch.no_grad()
. However, in my actual project, the situation is more complex, and simply removing torch.no_grad()
is not a feasible solution.
Currently, I have implemented a workaround by creating a copy of the model and running it within the torch.no_grad()
context. After each step, I synchronize the parameters.
In my specific project, this bug manifests itself not as an exception or error message, but rather as certain parameters in the model having None
gradients. I was able to identify this issue only after careful investigation.
Ok ! Interesting. So The workaround involves using two versions of the model: the original model, which is used for both forward and backward propagation, and a copy of the model, which is used solely for forward propagation with graadient tracking disabled (using no_grad()). In this setup, the copy of the model performs the forward pass without tracking gradients, which can improve computational efficiency. After this forward pass, the parameters of the copied model are synchronized with the original model. Then, the original model performs backward propagation. This approach circumvents the issue of encountering None gradients during the backward pass.
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions - the Lightning Team!
This is an old PyTorch quirk and unrelated to Lightning. Here is a good answer by the PyTorch devs: https://discuss.pytorch.org/t/autocast-and-torch-no-grad-unexpected-behaviour/93475/3
I took your example and verified the workaround given there. I disabled the torch autocasting on the first forward pass and then enabled it again on the second:
def training_step(self, batch, batch_idx):
opt = self.optimizers()
opt.zero_grad()
x, y = batch
x = x.view(x.size(0), -1)
with torch.autocast("cuda", enabled=False): # <--------- HERE False
with torch.no_grad():
z = self.encoder(x)
x_target = self.decoder(z)
with torch.autocast("cuda", enabled=True): # <--------- HERE True
z = self.encoder(x)
x_hat = self.decoder(z)
loss = F.mse_loss(x_hat, x_target)
self.manual_backward(loss)
print([(n, p.grad is not None) for n, p in self.named_parameters()])
opt.step()
Based on this, I'm closing the issue. I hope the answer doesn't come too late. Cheers!
Bug description
After disabling automatic optimization, the Trainer behaves inconsistently between
precision='32'
andprecision='16-mixed'
.What version are you seeing the problem on?
v2.0
How to reproduce the bug
Error messages and logs
Environment
Current environment
* CUDA: - GPU: - NVIDIA GeForce RTX 3090 - available: True - version: 11.8 * Lightning: - flash-pytorch: 0.1.7 - lightning: 2.0.4 - lightning-cloud: 0.5.37 - lightning-utilities: 0.8.0 - lion-pytorch: 0.0.7 - pytorch-lightning: 2.0.4 - reformer-pytorch: 1.4.4 - rotary-embedding-torch: 0.2.1 - torch: 2.0.1 - torchaudio: 2.0.2 - torchmetrics: 0.11.4 - torchvision: 0.15.2 * Packages: - absl-py: 1.4.0 - aiohttp: 3.8.4 - aiosignal: 1.3.1 - antialiased-cnns: 0.3 - anyio: 3.7.0 - appdirs: 1.4.4 - arrow: 1.2.3 - async-timeout: 4.0.2 - attrs: 22.2.0 - axial-positional-embedding: 0.2.1 - beautifulsoup4: 4.12.2 - blessed: 1.20.0 - brotlipy: 0.7.0 - cachetools: 5.3.0 - certifi: 2022.12.7 - cffi: 1.15.1 - charset-normalizer: 2.0.4 - click: 8.1.3 - contourpy: 1.0.7 - croniter: 1.3.15 - cryptography: 39.0.1 - cycler: 0.11.0 - dateutils: 0.6.12 - deepdiff: 6.3.0 - einops: 0.6.0 - exceptiongroup: 1.1.1 - fastapi: 0.98.0 - filelock: 3.9.0 - flash-attn: 1.0.2 - flash-pytorch: 0.1.7 - flit-core: 3.6.0 - fonttools: 4.39.0 - frozenlist: 1.3.3 - fsspec: 2023.3.0 - gmpy2: 2.1.2 - google-auth: 2.16.2 - google-auth-oauthlib: 0.4.6 - grpcio: 1.51.3 - h11: 0.14.0 - huggingface-hub: 0.13.2 - idna: 3.4 - inquirer: 3.1.3 - itsdangerous: 2.1.2 - jinja2: 3.1.2 - kiwisolver: 1.4.4 - lightning: 2.0.4 - lightning-cloud: 0.5.37 - lightning-utilities: 0.8.0 - lion-pytorch: 0.0.7 - local-attention: 1.8.4 - markdown: 3.4.1 - markdown-it-py: 3.0.0 - markupsafe: 2.1.1 - matplotlib: 3.7.1 - mdurl: 0.1.2 - mkl-fft: 1.3.1 - mkl-random: 1.2.2 - mkl-service: 2.4.0 - mpmath: 1.2.1 - multidict: 6.0.4 - networkx: 2.8.4 - numpy: 1.23.5 - oauthlib: 3.2.2 - ordered-set: 4.1.0 - packaging: 23.0 - pillow: 9.5.0 - pip: 23.0.1 - pooch: 1.4.0 - product-key-memory: 0.1.10 - protobuf: 4.22.1 - psutil: 5.9.5 - pyasn1: 0.4.8 - pyasn1-modules: 0.2.8 - pycparser: 2.21 - pydantic: 1.10.9 - pygments: 2.15.1 - pyjwt: 2.7.0 - pyopenssl: 23.0.0 - pyparsing: 3.0.9 - pysocks: 1.7.1 - python-dateutil: 2.8.2 - python-editor: 1.0.4 - python-multipart: 0.0.6 - pytorch-lightning: 2.0.4 - pytz: 2023.3 - pyyaml: 6.0 - readchar: 4.0.5 - reformer-pytorch: 1.4.4 - requests: 2.28.1 - requests-oauthlib: 1.3.1 - rich: 13.4.2 - rotary-embedding-torch: 0.2.1 - rsa: 4.9 - safetensors: 0.3.0 - scipy: 1.10.1 - setuptools: 65.6.3 - six: 1.16.0 - sniffio: 1.3.0 - soupsieve: 2.4.1 - starlette: 0.27.0 - starsessions: 1.3.0 - sympy: 1.11.1 - tensorboard: 2.12.0 - tensorboard-data-server: 0.7.0 - tensorboard-plugin-wit: 1.8.1 - timm: 0.6.13 - torch: 2.0.1 - torchaudio: 2.0.2 - torchmetrics: 0.11.4 - torchvision: 0.15.2 - tornado: 6.2 - tqdm: 4.65.0 - traitlets: 5.9.0 - triton: 2.0.0 - typing-extensions: 4.4.0 - urllib3: 1.26.14 - uvicorn: 0.22.0 - wcwidth: 0.2.6 - websocket-client: 1.6.1 - websockets: 11.0.3 - werkzeug: 2.2.3 - wheel: 0.38.4 - yarl: 1.8.2 * System: - OS: Linux - architecture: - 64bit - ELF - processor: x86_64 - python: 3.10.9 - release: 5.15.90.1-microsoft-standard-WSL2 - version: #1 SMP Fri Jan 27 02:56:13 UTC 2023More info
No response