Error when disabling an optimizer with native AMP turned on

quinor commented 3 years ago

🐛 Bug

When running my Lightning code with:

fp16 native AMP
Multiple optimizers
One of the optimizers disabled (in this case by returning None for it in training_step)

I'm getting the following stacktrace:

Traceback (most recent call last):
  File "./train_stage1.py", line 353, in <module>
    trainer.fit(model)
  File "/home/wj359634/venv/lib64/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 440, in fit
    results = self.accelerator_backend.train()
  File "/home/wj359634/venv/lib64/python3.6/site-packages/pytorch_lightning/accelerators/gpu_accelerator.py", line 54, in train
    results = self.train_or_test()
  File "/home/wj359634/venv/lib64/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 68, in train_or_test
    results = self.trainer.train()
  File "/home/wj359634/venv/lib64/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 485, in train
    self.train_loop.run_training_epoch()
  File "/home/wj359634/venv/lib64/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 544, in run_training_epoch
    batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
  File "/home/wj359634/venv/lib64/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 713, in run_training_batch
    self.optimizer_step(optimizer, opt_idx, batch_idx, train_step_and_backward_closure)
  File "/home/wj359634/venv/lib64/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 453, in optimizer_step
    optimizer, batch_idx, opt_idx, train_step_and_backward_closure
  File "/home/wj359634/venv/lib64/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 122, in optimizer_step
    using_lbfgs=is_lbfgs
  File "/home/wj359634/venv/lib64/python3.6/site-packages/pytorch_lightning/core/lightning.py", line 1209, in optimizer_step
    self.trainer.scaler.step(optimizer)
  File "/home/wj359634/venv/lib64/python3.6/site-packages/torch/cuda/amp/grad_scaler.py", line 318, in step
    assert len(optimizer_state["found_inf_per_device"]) > 0, "No inf checks were recorded for this optimizer."
AssertionError: No inf checks were recorded for this optimizer.
Exception ignored in: <object repr() failed>
Traceback (most recent call last):
  File "/home/wj359634/venv/lib64/python3.6/site-packages/tqdm/std.py", line 1086, in __del__
  File "/home/wj359634/venv/lib64/python3.6/site-packages/tqdm/std.py", line 1293, in close
  File "/home/wj359634/venv/lib64/python3.6/site-packages/tqdm/std.py", line 1471, in display
  File "/home/wj359634/venv/lib64/python3.6/site-packages/tqdm/std.py", line 1089, in __repr__
  File "/home/wj359634/venv/lib64/python3.6/site-packages/tqdm/std.py", line 1433, in format_dict
TypeError: 'NoneType' object is not iterable

To Reproduce

(I'm hoping those are all conditions that have to be met) Run a Lightning model with

fp16 native AMP
Multiple optimizers
One of the optimizers disabled (in this case by returning None for it in training_step)

Expected behavior

The code should skip this optimizer

Environment

* CUDA:
        - GPU:
                - Tesla V100-PCIE-32GB
        - available:         True
        - version:           10.2
* Packages:
        - numpy:             1.18.4
        - pyTorch_debug:     True
        - pyTorch_version:   1.7.0
        - pytorch-lightning: 1.0.4
        - tqdm:              4.46.0
* System:
        - OS:                Linux
        - architecture:
                - 64bit
                - ELF
        - processor:         x86_64
        - python:            3.6.8
        - version:           #1 SMP Tue Aug 25 17:23:54 UTC 2020

Additional context

quinor commented 3 years ago

more details/analysis at https://pytorch-lightning.slack.com/archives/CRBLFHY79/p1604523271185900

Borda commented 3 years ago

I think it is a similar problem we face with Tuner...

quinor commented 3 years ago

@Borda I believe issue is with

https://github.com/PyTorchLightning/pytorch-lightning/blob/e81707ba0242f12f47d742e86a982f529a7ae65b/pytorch_lightning/core/lightning.py#L1229

being called (and optimizer being called in general) when the training_step returned None - the check to skip an optimizer (https://github.com/PyTorchLightning/pytorch-lightning/blob/e81707ba0242f12f47d742e86a982f529a7ae65b/pytorch_lightning/trainer/training_loop.py#L716) does it after calling optimizer_step

please @ me if you find a solution, I probably need to hotfix it to resume my research.

awaelchli commented 3 years ago

is this the fix? I have little time to debug, so sorry for making wild guesses:


# LightningModule

   def optimizer_step(
        self,
        epoch: int,
        batch_idx: int,
        optimizer: Optimizer,
        optimizer_idx: int,
        optimizer_closure: Optional[Callable],
        on_tpu: bool,
        using_native_amp: bool,
        using_lbfgs: bool,
    ) -> None:
        if on_tpu:
            xm.optimizer_step(optimizer, optimizer_args={'closure': optimizer_closure})
        elif self.trainer.amp_backend == AMPType.NATIVE:
            # native amp does not yet support closures.
            # TODO: pass the closure to the step ASAP

# FIX --------------------------------------------------

             result = optimizer_closure()
             if result is not None:                           # <--- I added this line
                self.trainer.scaler.step(optimizer)

# FIX --------------------------------------------------

        elif self.trainer.amp_backend == AMPType.APEX:
            # apex amp does not yet support closures.
            # TODO: pass the closure to the step ASAP
            optimizer_closure()
            optimizer.step()
        else:
            optimizer.step(closure=optimizer_closure)

awaelchli commented 3 years ago

@quinor if my fix works, override optimizer_step in your LightningModule and put the None check there if you need this hotfix right now.

quinor commented 3 years ago

@awaelchli the fix almost works, after applying it similar issue happens at https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/accelerators/accelerator.py#L134 (stacktrace below) and I don't see a way of knowing what the closure returned here. I applied a VERY dirty hotfix for myself, so I'm just waiting for someone to fix it properly.

Traceback (most recent call last):
  File "./train_stage1.py", line 365, in <module>
    trainer.fit(model)
  File "/home/wj359634/venv/lib64/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 440, in fit
    results = self.accelerator_backend.train()
  File "/home/wj359634/venv/lib64/python3.6/site-packages/pytorch_lightning/accelerators/gpu_accelerator.py", line 54, in train
    results = self.train_or_test()
  File "/home/wj359634/venv/lib64/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 68, in train_or_test
    results = self.trainer.train()
  File "/home/wj359634/venv/lib64/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 485, in train
    self.train_loop.run_training_epoch()
  File "/home/wj359634/venv/lib64/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 544, in run_training_epoch
    batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
  File "/home/wj359634/venv/lib64/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 713, in run_training_batch
    self.optimizer_step(optimizer, opt_idx, batch_idx, train_step_and_backward_closure)
  File "/home/wj359634/venv/lib64/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 453, in optimizer_step
    optimizer, batch_idx, opt_idx, train_step_and_backward_closure
  File "/home/wj359634/venv/lib64/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 127, in optimizer_step
    self.trainer.scaler.update()
  File "/home/wj359634/venv/lib64/python3.6/site-packages/torch/cuda/amp/grad_scaler.py", line 366, in update
    assert len(found_infs) > 0, "No inf checks were recorded prior to update."
AssertionError: No inf checks were recorded prior to update.

edenlightning commented 3 years ago

CLosing this. See comments in Native AMP effectively broken when rewriting the optimizer_step function #4572

quinor commented 3 years ago

@edenlightning It doesn't feel to me like this issue is resolved, why are you closing it? Is the recommended solution (for now or in general) to use the manual optimization route when those specific conditions are met? It still isn't very clear to me how to train things like GANs with AMP in Lightning now.

quinor commented 3 years ago

(for future reference, reproduction here below) https://colab.research.google.com/drive/1Fi-r5twOjag0YKrIy2SxqMAAMf3zX5xj#scrollTo=Flyi--SpvsJN

carmocca commented 3 years ago

Hi @quinor

I'll try to debug this soon, hopefully we can find a fix for it. I am not an expert on AMP internals but I think it should work.

@edenlightning can you re-open this?

edenlightning commented 3 years ago

@carmocca is this still happening in 1.1?

quinor commented 3 years ago

@edenlightning still there unfortunately, checked the reproduction against Lightning 1.1 (the collab linked above).

carmocca commented 3 years ago

@edenlightning yes, unfortunately I can't debug it this week.

edenlightning commented 3 years ago

Can we chech if this is still the case on master?

quinor commented 3 years ago

I just ran the notebook with current master (pip install git+https://github.com/PytorchLightning/pytorch-lightning.git@release/1.2-dev --upgrade) and the issue is still there.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

yifuwang commented 3 years ago

I can try to send out a fix for this.

kaushikb11 commented 3 years ago

Awesome @yifuwang! Go for it.

Lightning-AI / pytorch-lightning