Open manifoldhiker opened 3 years ago
Hi! thanks for your contribution!, great first issue!
Looking at the warning message, it seems that this is a problem related to the precision. As it is explained in documentation, if 16-bit precision is used, optimization is automatically managed by PyTorch Lightning. From versions >= 1.1.0 in PyTorch, Detected call of lr_scheduler.step()
before optimizer.step()
. I do not know how to follow the trace on colab, when I figure it out, I will search for the origin of this call. It seems that for 16-bit precision, the order of the calls is different in this scheduler creation procedure.
I'm getting the same warning when ddp_sharded
is turned on. My optimizer is defined similarly to configure_optimizers_1
Same issue.
I am getting the same issue still as well
@griff4692 @sanxing-chen Hi, thank you for your report. Which version are you using? Could you try with the latest version of pytorch-lightning
? pip install pytorch-lightning -U
@akihironitta I have run again the colab from the beginning of the issue and the warning problem is still there. This is the environment printed by the collecting script:
* CUDA:
- GPU:
- Tesla P100-PCIE-16GB
- available: True
- version: 10.1
* Packages:
- numpy: 1.19.5
- pyTorch_debug: False
- pyTorch_version: 1.8.1+cu101
- pytorch-lightning: 1.3.1
- tqdm: 4.41.1
* System:
- OS: Linux
- architecture:
- 64bit
-
- processor: x86_64
- python: 3.7.10
- version: #1 SMP Tue Apr 20 19:55:43 PDT 2021
@javierlorenzod Thanks a lot for your report! Let me look into it.
I am using pytorch-lightning==1.3.3
, problems seems to exist here as well ...
As another datapoint, I'm finding this issue with pytorch-lightning==1.3.4
Same issue here with pytorch-lightning==1.3.1
.
Same issue with pytorch-lightning==1.4.1
.
Any updates on this?
Hi @BttMA @aleSuglia The fix is still wip in #9923.
This issue only happens when
Trainer(precision=16)
ANDlr_scheduler.step()
runs every small steps (not epochs), i.e.
def configure_optimizers(self):
optimizer = ...
scheduler = {
"scheduler": ...,
"interval": "step",
"frequency": 1, # other small numbers may also cause this issue.
}
return {"optimizer": optimizer, "lr_scheduler": scheduler}
What's happening is that scaler.step(optimizer)
(getting called when using native amp) is likely to skip optimizer.step()
for the first few iterations, and thus, it makes lr_scheduler.step()
called before any call of optimizer.step()
.
For side note, you'll get the same behaviour in pure PyTorch, too, as reported in "optimizer.step()
before lr_scheduler.step()
error using GradScaler".
Hi @akihironitta, thanks for your mention/comment. I am not sure how can I adapt this to my code but I can show you what my configure_optimizers looks like if it will help you fix the issue :)
def configure_optimizers(self):
optimizer = AdamW(self.parameters(), lr=self.lr, eps=self.eps)
scheduler = get_linear_schedule_with_warmup(optimizer,
num_training_steps = self.n_training_steps,
num_warmup_steps = self.n_warmup_steps)
return dict(lr_scheduler = dict(scheduler=scheduler, interval='step'),
optimizer = optimizer)
@BttMA I'm sorry for your inconvenience. I'm not sure if there's a workaround for this issue at the moment... I'll try to have this issue resolved asap within this week and keep you updated.
@akihironitta do we have to switch them? I feel it's just warning when the first few iteration got skipped, but impact to overall accuracy is very minor. In fact, this is desired behavior right?
@four4fish
I feel it's just warning when the first few iteration got skipped,
PyTorch raises the warning ONLY IF scheduler.step()
is called before any call of optimizer.step()
, but in practice, optimizer.step()
can be skipped in later iterations, too, with amp used.
but impact to overall accuracy is very minor. In fact, this is desired behavior right?
This is a known but not desired behaviour for sure because users would expect the following call orders:
optimizer.step() # 1st call
scheduler.step() # 1st call
# optimizer.step() # skipped by the amp scaler
# scheduler.step() # skipped because no optmizer.step() has been called in the iteration
optimizer.step() # 2nd call
scheduler.step() # 2nd call
...
but currently, when using Lightning with native amp, the orders can be:
optimizer.step() # 1st call
scheduler.step() # 1st call
# optimizer.step() # skipped by the amp scaler
scheduler.step() # 2nd call # should be skipped
optimizer.step() # 2nd call
scheduler.step() # 3rd call
...
As you can see from the above example, it can lead to excessive calls of scheduler.step()
.
I agree that its impact might not be significant, but it has to be fixed for reproducibility IMO.
My 2 cents: Users should never get a warning when they aren't doing anything wrong and/or there is no way for them to do something correctly. Specifically, unless this bug is fixed there is no way to run CyclicLR
or OneCycleLR
correctly without getting this warning.
Same issue here. I saw a workaround in the implementation of sentence-transformers or SBERT.
[...]
scale_before_step = scaler.get_scale()
scaler.scale(loss_value).backward()
scaler.unscale_(optimizer)
torch.nn.utils.clip_grad_norm_(self.model.parameters(), max_grad_norm)
scaler.step(optimizer)
scaler.update()
skip_scheduler = scaler.get_scale() != scale_before_step
[...]
if not skip_scheduler:
scheduler.step()
Same issue.
i think this issue is with pytorch instead of pytorch lightning.
I'm using PL pytorch-lightning==1.6.4 but still same issue
A quick around is to override LightningModule.lr_scheduler_step()
(only with PL 1.6.0 or later) so that it skips lr_scheduler.step()
whenever the scaler skips optimizer.step()
. For multiple single optimizers, it needs some change, but for a single optimizer, the following should work:
class YourLightningModule(LightningModule):
def optimizer_step(self, epoch, batch_idx, optimizer, optimizer_idx, optimizer_closure, **kwargs):
self.should_skip_lr_scheduler_step = False
scaler = getattr(self.trainer.strategy.precision_plugin, "scaler", None)
if scaler:
scale_before_step = scaler.get_scale()
optimizer.step(closure=optimizer_closure)
if scaler:
scale_after_step = scaler.get_scale()
self.should_skip_lr_scheduler_step = scale_before_step > scale_after_step
def lr_scheduler_step(self, scheduler, optimizer_idx, metric):
if self.should_skip_lr_scheduler_step:
return
scheduler.step()
See here for a complete script using BoringModel: https://github.com/akihironitta/gist/blob/repro/5558-amp-scheduler-workaround/pl_boring_model/main.py
I'm not using PTL right now but I'm interested in the "right" solution here. The issue has nothing to do with PTL like other people have said.
@akihironitta A couple of comments / questions.
scaler.get_scale()
will simply return None
if optimizer.step()
was never called due to NaN/inf (see here)scheduler.step()
every time and just (try) to catch/squash the warning. Maybe that doesn't work well for PTL, but if I'm using a LR schedule I expect it to be followed regardless of whether or not 16-bit precision errors are inhibiting grad updates a few iterations. I think I'd rather just stick to the schedule and update it every time. It's not like I'm re-doing the batch if the scaling produced NaNs, I'm just moving onto the next batch. Again, I'm torn as to the "right" approach, but in the end it probably doesn't matter in terms of the final trained weights.Cheers, -Collin
Edit: I couldn't successfully suppress the warning, ended up comparing to None
and skipping
Edit2: Testing for None as a return value doesn't work for all optimizers, e.g. AdamW without a closure will return None even when stepped. So testing the scale before and after seems like the best way.
Hi @collinmccarthy, thank you for your comment.
The issue has nothing to do with PTL like other people have said.
Yes, as I commented a while ago https://github.com/Lightning-AI/lightning/issues/5558#issuecomment-968751947, this issue stems from how amp is implemented.
if I’m using a LR schedule I expect it to be followed regardless of whether or not 16-bit precision errors are inhibiting grad updates a few iterations. I think I’d rather just stick to the schedule and update it every time. It’s not like I’m re-doing the batch if the scaling produced NaNs, I’m just moving onto the next batch
That’s totally fine if you’re fine with it. However, some people might still prefer to use the hack above to avoid excessive lr_scheduler.step()
calls, and that's why I left the code snippet above. If you’re fine with calling lr_scheduler.step()
excessively, you can just ignore the warning. If you find it too noisy, you can suppress the warning with:
import warnings
warnings.filterwarnings(“ignore”, "Detected call of", UserWarning)
https://docs.python.org/3/library/warnings.html#warnings.filterwarnings
@akihironitta hi! i'm using YourLightningModule code, but, some epoch, get this error
ValueError: Tried to step 42552 times. The specified number of total steps is 42550
self._outputs = self.epoch_loop.run(self._data_fetcher)
File "/data/asr_proj/stt/RNNTransducer/.venv/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/data/asr_proj/stt/RNNTransducer/.venv/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 220, in advance
self.update_lr_schedulers("step", update_plateau_schedulers=False)
File "/data/asr_proj/stt/RNNTransducer/.venv/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 397, in update_lr_schedulers
self._update_learning_rates(
File "/data/asr_proj/stt/RNNTransducer/.venv/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 458, in _update_learning_rates
self.trainer._call_lightning_module_hook(
File "/data/asr_proj/stt/RNNTransducer/.venv/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1305, in _call_lightning_module_hook
output = fn(*args, **kwargs)
File "/data/asr_proj/stt/RNNTransducer/model.py", line 200, in lr_scheduler_step
scheduler.step()
File "/data/asr_proj/stt/RNNTransducer/.venv/lib/python3.9/site-packages/torch/optim/lr_scheduler.py", line 161, in step
values = self.get_lr()
File "/data/asr_proj/stt/RNNTransducer/.venv/lib/python3.9/site-packages/torch/optim/lr_scheduler.py", line 1686, in get_lr
raise ValueError("Tried to step {} times. The specified number of total steps is {}"
ValueError: Tried to step 42552 times. The specified number of total steps is 42550
my code like this
def optimizer_step(self, epoch, batch_idx, optimizer, optimizer_idx, optimizer_closure, **kwargs):
self.should_skip_lr_scheduler_step = False
scaler = getattr(self.trainer.strategy.precision_plugin, "scaler", None)
if scaler:
scale_before_step = scaler.get_scale()
optimizer.step(closure=optimizer_closure)
if scaler:
scale_after_step = scaler.get_scale()
self.should_skip_lr_scheduler_step = scale_before_step > scale_after_step
def lr_scheduler_step(self, scheduler, optimizer_idx, metric):
if self.should_skip_lr_scheduler_step:
return
scheduler.step()
def configure_optimizers(self):
optimizer = torch.optim.AdamW(
[{"params": [p for p in self.parameters()], "name": "OneCycleLR"}],
lr=self.args.learning_rate,
weight_decay=self.args.weight_decay,
)
scheduler = torch.optim.lr_scheduler.OneCycleLR(
optimizer,
max_lr=self.args.max_lr,
steps_per_epoch=self.steps_per_epoch,
epochs=self.trainer.max_epochs,
pct_start=0.05,
)
lr_scheduler = {"interval": "step", "scheduler": scheduler, "name": "AdamW"}
return [optimizer], [lr_scheduler]
@collinmccarthy totally agree... i'm agree with you and i will testing 32fp OneCycleLR and 16fp OneCycleLR except warning. when i managed learning_rate in my self, i am faced much more error or side effect 😂. optimize is so hard to me 😣
In my case, warning is not important. i logged loss, lr
3 case is diffrent value each other, but very very very small. so, i don't mind printing warning now. cuda 11.4 python 3.9 torch-lightning 1.8.1 torch 1.13.0
I propose a fix in https://github.com/Lightning-AI/lightning/pull/16229 . The issue is not on the PyT side, it's on PTL side.
When using LR scheduler for each step together with AMP, the PyT user (PTL) should check that the optimizer step wasn't skipped by the grad scaler before stepping the scheduler.
In this PR I use the same check that PyT uses to generate that warning optimizer._step_count
.
I think we should implement https://github.com/pytorch/pytorch/issues/67590 (PyTorch). Any additions in Lightning would always be workarounds.
Following and waiting.
any update?
pytorch==2.1.0
pytorch-lightning==2.1.0
In my case, the warning is raised during the first four steps, while an epoch consists of 500+ steps. Since the warning occurs in the first step, I also receive the "scheduler called before optimizer is called" warning. I like to address these warnings not only because they are annoying and can lead others in the project to assume there is a significant problem, but also because there is no guarantee that the skipped optimizer steps will always be limited.
I have noticed that my optimizer (AdamW) has _step_count in it. After debugging it, I observed that the count is not increased during skipped steps. Therefore, another possible workaround would be:
...
self.scheduler_step_counter = 0
...
def lr_scheduler_step(self, scheduler, metric):
if self.scheduler_step_counter < scheduler.optimizer._step_count:
super().lr_scheduler_step(scheduler, metric)
self.scheduler_step_counter += 1
assert (
self.scheduler_step_counter == scheduler.optimizer._step_count
), "scheduler_step_counter should be equal to optimizer._step_count"
pytorch==2.2.2
lightning==2.2.1
I'm having the same warning UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()
when I set precision='16'
But the warning disappears when I set: precision='16-mixed'
🐛 Bug
When using mixed-precision training, scheduler and optimizer are called in the wrong order. Warning is generated:
Please reproduce using the BoringModel
https://colab.research.google.com/drive/1G7pk6E9XUYq-pS41DXKhqM9Srx8sikiP?usp=sharing
There are four tests. Three of them doesn't raise the warning:
This testcase raises the warning:
To Reproduce
configure_optimizers
in a following dictionary style:precision=16
in aTrainer
Note
When scheduler is defined in another way, the issue seems to not occur:
Expected behavior
No warning
Environment
cc @tchaton @rohitgr7 @carmocca @justusschock @awaelchli @akihironitta