Closed quinor closed 3 years ago
more details/analysis at https://pytorch-lightning.slack.com/archives/CRBLFHY79/p1604523271185900
I think it is a similar problem we face with Tuner...
@Borda I believe issue is with
being called (and optimizer being called in general) when the training_step returned None
- the check to skip an optimizer (https://github.com/PyTorchLightning/pytorch-lightning/blob/e81707ba0242f12f47d742e86a982f529a7ae65b/pytorch_lightning/trainer/training_loop.py#L716) does it after calling optimizer_step
please @ me if you find a solution, I probably need to hotfix it to resume my research.
is this the fix? I have little time to debug, so sorry for making wild guesses:
# LightningModule
def optimizer_step(
self,
epoch: int,
batch_idx: int,
optimizer: Optimizer,
optimizer_idx: int,
optimizer_closure: Optional[Callable],
on_tpu: bool,
using_native_amp: bool,
using_lbfgs: bool,
) -> None:
if on_tpu:
xm.optimizer_step(optimizer, optimizer_args={'closure': optimizer_closure})
elif self.trainer.amp_backend == AMPType.NATIVE:
# native amp does not yet support closures.
# TODO: pass the closure to the step ASAP
# FIX --------------------------------------------------
result = optimizer_closure()
if result is not None: # <--- I added this line
self.trainer.scaler.step(optimizer)
# FIX --------------------------------------------------
elif self.trainer.amp_backend == AMPType.APEX:
# apex amp does not yet support closures.
# TODO: pass the closure to the step ASAP
optimizer_closure()
optimizer.step()
else:
optimizer.step(closure=optimizer_closure)
@quinor if my fix works, override optimizer_step in your LightningModule and put the None check there if you need this hotfix right now.
@awaelchli the fix almost works, after applying it similar issue happens at https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/accelerators/accelerator.py#L134 (stacktrace below) and I don't see a way of knowing what the closure returned here. I applied a VERY dirty hotfix for myself, so I'm just waiting for someone to fix it properly.
Traceback (most recent call last):
File "./train_stage1.py", line 365, in <module>
trainer.fit(model)
File "/home/wj359634/venv/lib64/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 440, in fit
results = self.accelerator_backend.train()
File "/home/wj359634/venv/lib64/python3.6/site-packages/pytorch_lightning/accelerators/gpu_accelerator.py", line 54, in train
results = self.train_or_test()
File "/home/wj359634/venv/lib64/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 68, in train_or_test
results = self.trainer.train()
File "/home/wj359634/venv/lib64/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 485, in train
self.train_loop.run_training_epoch()
File "/home/wj359634/venv/lib64/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 544, in run_training_epoch
batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
File "/home/wj359634/venv/lib64/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 713, in run_training_batch
self.optimizer_step(optimizer, opt_idx, batch_idx, train_step_and_backward_closure)
File "/home/wj359634/venv/lib64/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 453, in optimizer_step
optimizer, batch_idx, opt_idx, train_step_and_backward_closure
File "/home/wj359634/venv/lib64/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 127, in optimizer_step
self.trainer.scaler.update()
File "/home/wj359634/venv/lib64/python3.6/site-packages/torch/cuda/amp/grad_scaler.py", line 366, in update
assert len(found_infs) > 0, "No inf checks were recorded prior to update."
AssertionError: No inf checks were recorded prior to update.
CLosing this. See comments in Native AMP effectively broken when rewriting the optimizer_step function #4572
@edenlightning It doesn't feel to me like this issue is resolved, why are you closing it? Is the recommended solution (for now or in general) to use the manual optimization route when those specific conditions are met? It still isn't very clear to me how to train things like GANs with AMP in Lightning now.
(for future reference, reproduction here below) https://colab.research.google.com/drive/1Fi-r5twOjag0YKrIy2SxqMAAMf3zX5xj#scrollTo=Flyi--SpvsJN
Hi @quinor
I'll try to debug this soon, hopefully we can find a fix for it. I am not an expert on AMP internals but I think it should work.
@edenlightning can you re-open this?
@carmocca is this still happening in 1.1?
@edenlightning still there unfortunately, checked the reproduction against Lightning 1.1 (the collab linked above).
@edenlightning yes, unfortunately I can't debug it this week.
Can we chech if this is still the case on master?
I just ran the notebook with current master (pip install git+https://github.com/PytorchLightning/pytorch-lightning.git@release/1.2-dev --upgrade
) and the issue is still there.
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!
I can try to send out a fix for this.
Awesome @yifuwang! Go for it.
π Bug
When running my Lightning code with:
None
for it intraining_step
)I'm getting the following stacktrace:
To Reproduce
(I'm hoping those are all conditions that have to be met) Run a Lightning model with
None
for it intraining_step
)Expected behavior
The code should skip this optimizer
Environment
Additional context