Open jkyl opened 1 year ago
I corrected a mistake in the replication script (self.manual_backward
versus loss.backward
). The output is the same in both cases.
It should also be noted that the max_steps=1000
argument to the trainer depends on global_step
, which you can tell by the fact that the script terminates after 500 calls to training_step
, even though max_steps
was set to 1000. This is contrary to the definition of step used by log_every_n_steps
and val_check_interval
.
The source of this behavior starts with the fact that trainer.global_step
refers to the global_step
property of training_epoch_loop
.
In turn, that property derives its result from the optim_step_progress
attribute of the _ManualOptimization
loop object, whose total.completed
attribute is incremented in _ManualOptimization._on_after_step
.
Ultimately, _ManualOptimization._on_after_step
is called via all of the LightningOptimizer
s created by the lightning module here. All optimizers are injected with the method here.
One possible fix would be to inject only one of the optimizers with the total.completed
incrementing behavior, rather than all.
Why this matters:
max_steps
that they specify.global_step
will be out-of-sync with the true current iteration. global_step
and the true current iteration will not be easily correctible, i.e. n_critic would need to be propagated into any callbacks or stop criteria. +1 met the same thing
Thought I'd provide a little more detail on my use case since other people have encountered this.
I'm training a GAN with multiple discriminator steps per generator step. My training step looks like this:
def training_step(self, batch, batch_idx):
if batch_idx % self.n_critic == 0:
self.update_generator_and_discriminator(batch)
else:
self.update_discriminator_only(batch)
This is more efficient than only updating one of the networks each iteration, because it allows one to re-use the generator outputs for the discriminator update. But, it also means that update_generator_and_discriminator
makes two calls to optimizer.step
.
As a workaround to this bug, I subclassed the trainer, like this:
class MyTrainer(pl.Trainer):
def __init__(self, *, n_critic: int, **kwargs):
super().__init__(**kwargs)
self.n_critic = n_critic
@property
def global_step(self) -> int:
return convert_global_step_to_current_iter(super().global_step, self.n_critic)
And I also implemented the following method:
def convert_global_step_to_current_iter(step: int, nc: int) -> int:
return int(step * nc / (nc + 1))
This lets my callbacks run at the correct frequency, but is not a general solution. This only applies to the case where every n_critic
number of steps, global_step
is incremented by 2, and every other step it's incremented by 1.
This is a common issue for GAN and we should take a look.
Thanks for the response! I'm happy to help with a PR, if you or anyone else has guidance for a way forward.
hello, any triage or advice for this?
Hello, I have just just tried to use lightning (2.0.6), I observed my global_step is also out of sync with actual steps, which is also reflected on the tensorboard, making learning rate unchanged with more training in my case:
trainer = pl.Trainer(devices=2, accelerator='gpu', strategy='ddp', max_epochs=EPOCHS, logger=True, log_every_n_steps=50, check_val_every_n_epoch=1, callbacks=checkpoint_callback, accumulate_grad_batches=16,) What trainer sets have overwritten my nemo configure in what follows:
trainer: devices: -1 # number of GPUs, -1 would use all available GPUs num_nodes: 1 max_epochs: 1000 max_steps: 200000 # computed at runtime if not set val_check_interval: 1.0 # Set to 0.25 to check 4 times per epoch, or an int for number of iterations accelerator: auto strategy: ddp accumulate_grad_batches: 1 gradient_clip_val: 0.0 precision: bf16 # 16, 32, or bf16 log_every_n_steps: 100 # Interval of logging. enable_progress_bar: True resume_from_checkpoint: null # The path to a checkpoint file to continue the training, restores the whole state including the epoch, step, LR schedulers, apex, etc. num_sanity_val_steps: 0 # number of steps to perform validation steps for sanity check the validation process before starting the training, setting to 0 disables it check_val_every_n_epoch: 1 # number of evaluations on validation every n epochs sync_batchnorm: true enable_checkpointing: False # Provided by exp_manager logger: false # Provided by exp_manager benchmark: false # needs to be false for models with variable-length speech input as it slows down training So far, my training progress is like: Epoch 254: 23%|██▎ | 201/883 [02:12<07:29, 1.52it/s, v_num=44, loss_step=49.80, loss_epoch=51.10] From this, I think I have already run 253*883 steps However, what my tensorboard is displaying:
It told me only 14k global steps has been run, obviously wrong.
This is annoying since lightning changed the learning rate according to global_steps, and now global steps are mis-calculated Besides, it cannot step normally. For instance, I set my max_steps as 200000, and actual running steps are already over 223399, and it is not stopped as expected.
When i set self.automatic_optimization = False, I got same error. it caused by optimizer.step() that increase self.global_step 1 for each called.
My observation is as follow.
When i using 2 optimizer, i got 2 times larger global step then actual step. When i using 3 optimizer, i got 3 times larger global step then actual step.
So, in this case, we need to figure out how to handle global_step increase when called optimizer.step() for proper training.
Maybe considering change the definition of global step as the number of time "training_step" is called. But this will be a breaking change... Adding a flag to open it will be better.
I've encountered the same problem and solved this problem as below. However, I'm not sure this method does not makes another problem. If someone finds possible edge case about my logic, please commnet below.
[Background]
trainer.fit_loop.epoch_roop.manual_optimization.optim_step_progress.total.completed
trainer.fit
with manual optimization, actual training logic (your lightningmodule.training_step
implementation) execuated at the trainer.fit_loop.epoch_roop.manual_optimization.run()
. (@jkyl mentiond same thing above)trainer.fit_loop.epoch_roop.manual_optimization.run()
, three methods will be called.
trainer.fit_loop.epoch_roop.manual_optimization.on_run_start
trainer.fit_loop.epoch_roop.manual_optimization.advance
trainer.fit_loop.epoch_roop.manual_optimization.on_run_end
trainer.fit_loop.epoch_roop.manual_optimization.on_run_start
, override all optimizer's _on_before_step
and _on_after_step
so that each optimizer's step increases trainer.fit_loop.epoch_roop.manual_optimization.optim_step_progress.total.completed
by one.self.optim_step_progress.increment_completed()
method increases trainer.fit_loop.epoch_roop.manual_optimization.optim_step_progress.total.completed
.class _ManualOptimization(_Loop):
"""A special loop implementing what is known in Lightning as Manual Optimization where the optimization happens
entirely in the :meth:`~lightning.pytorch.core.module.LightningModule.training_step` and therefore the user is
responsible for back-propagating gradients and making calls to the optimizers.
This loop is a trivial case because it performs only a single iteration (calling directly into the module's
:meth:`~lightning.pytorch.core.module.LightningModule.training_step`) and passing through the output(s).
"""
output_result_cls = ManualResult
def __init__(self, trainer: "pl.Trainer") -> None:
super().__init__(trainer)
# since manual optimization does not track lr scheduler or optimizer frequencies, we use a simpler progress than
# `_OptimizationProgress`
self.optim_step_progress = _Progress.from_defaults(_ReadyCompletedTracker)
self._output: _OUTPUTS_TYPE = {}
def run(self, kwargs: OrderedDict) -> _OUTPUTS_TYPE:
self.on_run_start()
with suppress(StopIteration): # no loop to break at this level
self.advance(kwargs)
self._restarting = False
return self.on_run_end()
def on_run_start(self) -> None:
# inject logic around the optimizer step
for lightning_optimizer in self.trainer.strategy._lightning_optimizers:
lightning_optimizer._on_before_step = self._on_before_step
lightning_optimizer._on_after_step = self._on_after_step
def advance(self, kwargs: OrderedDict) -> None:
"""Performs the training step for manual optimization.
Args:
kwargs: The kwargs passed down to the hooks.
"""
trainer = self.trainer
# manually capture logged metrics
training_step_output = call._call_strategy_hook(trainer, "training_step", *kwargs.values())
del kwargs # release the batch from memory
self.trainer.strategy.post_training_step()
result = self.output_result_cls.from_training_step_output(training_step_output)
self._output = result.asdict()
def on_run_end(self) -> _OUTPUTS_TYPE:
"""Returns the result of this loop, i.e., the post-processed outputs from the training step."""
output, self._output = self._output, {} # free memory
# reset logic around the optimizer step
for lightning_optimizer in self.trainer.strategy._lightning_optimizers:
lightning_optimizer._on_before_step = do_nothing_closure
lightning_optimizer._on_after_step = do_nothing_closure
return output
def _on_before_step(self) -> None:
self.optim_step_progress.increment_ready()
self.trainer.profiler.start("optimizer_step")
def _on_after_step(self) -> None:
self.trainer.profiler.stop("optimizer_step")
self.optim_step_progress.increment_completed()
[Solution]
...
def training_step(self, batch, batch_idx):
gamma_opt, beta_opt = self.optimizers()
beta_opt._on_before_step = lambda : self.trainer.profiler.start("optimizer_step")
beta_opt._on_after_step = lambda : self.trainer.profiler.stop("optimizer_step")
...
[Suggestion for the PytorchLightning]
configure_optimziers
interface to suppert options like this...
def configure_optimziers():
opt1 = Adam(...)
opt2 = Adam(...)
return (
{"optimizer": opt1},
{"optimizer": opt2, "do_not_count_global_step": True},
)
I've encountered the same problem and solved this problem as below. However, I'm not sure this method does not makes another problem. If someone finds possible edge case about my logic, please commnet below.
[Background]
- trainer's global step is alias of
trainer.fit_loop.epoch_roop.manual_optimization.optim_step_progress.total.completed
- When you call
trainer.fit
with manual optimization, actual training logic (yourlightningmodule.training_step
implementation) execuated at thetrainer.fit_loop.epoch_roop.manual_optimization.run()
. (@jkyl mentiond same thing above)At
trainer.fit_loop.epoch_roop.manual_optimization.run()
, three methods will be called.
trainer.fit_loop.epoch_roop.manual_optimization.on_run_start
trainer.fit_loop.epoch_roop.manual_optimization.advance
trainer.fit_loop.epoch_roop.manual_optimization.on_run_end
- At
trainer.fit_loop.epoch_roop.manual_optimization.on_run_start
, override all optimizer's_on_before_step
and_on_after_step
so that each optimizer's step increasestrainer.fit_loop.epoch_roop.manual_optimization.optim_step_progress.total.completed
by one.- Below code is manual optimization class.
self.optim_step_progress.increment_completed()
method increasestrainer.fit_loop.epoch_roop.manual_optimization.optim_step_progress.total.completed
.class _ManualOptimization(_Loop): """A special loop implementing what is known in Lightning as Manual Optimization where the optimization happens entirely in the :meth:`~lightning.pytorch.core.module.LightningModule.training_step` and therefore the user is responsible for back-propagating gradients and making calls to the optimizers. This loop is a trivial case because it performs only a single iteration (calling directly into the module's :meth:`~lightning.pytorch.core.module.LightningModule.training_step`) and passing through the output(s). """ output_result_cls = ManualResult def __init__(self, trainer: "pl.Trainer") -> None: super().__init__(trainer) # since manual optimization does not track lr scheduler or optimizer frequencies, we use a simpler progress than # `_OptimizationProgress` self.optim_step_progress = _Progress.from_defaults(_ReadyCompletedTracker) self._output: _OUTPUTS_TYPE = {} def run(self, kwargs: OrderedDict) -> _OUTPUTS_TYPE: self.on_run_start() with suppress(StopIteration): # no loop to break at this level self.advance(kwargs) self._restarting = False return self.on_run_end() def on_run_start(self) -> None: # inject logic around the optimizer step for lightning_optimizer in self.trainer.strategy._lightning_optimizers: lightning_optimizer._on_before_step = self._on_before_step lightning_optimizer._on_after_step = self._on_after_step def advance(self, kwargs: OrderedDict) -> None: """Performs the training step for manual optimization. Args: kwargs: The kwargs passed down to the hooks. """ trainer = self.trainer # manually capture logged metrics training_step_output = call._call_strategy_hook(trainer, "training_step", *kwargs.values()) del kwargs # release the batch from memory self.trainer.strategy.post_training_step() result = self.output_result_cls.from_training_step_output(training_step_output) self._output = result.asdict() def on_run_end(self) -> _OUTPUTS_TYPE: """Returns the result of this loop, i.e., the post-processed outputs from the training step.""" output, self._output = self._output, {} # free memory # reset logic around the optimizer step for lightning_optimizer in self.trainer.strategy._lightning_optimizers: lightning_optimizer._on_before_step = do_nothing_closure lightning_optimizer._on_after_step = do_nothing_closure return output def _on_before_step(self) -> None: self.optim_step_progress.increment_ready() self.trainer.profiler.start("optimizer_step") def _on_after_step(self) -> None: self.trainer.profiler.stop("optimizer_step") self.optim_step_progress.increment_completed()
[Solution]
- Since manual optimization logic overrides optimizer's hook before "training_step" called, we can re-override the optimizer's hook at the top of the "training_step".
- Example:
... def training_step(self, batch, batch_idx): gamma_opt, beta_opt = self.optimizers() beta_opt._on_before_step = lambda : self.trainer.profiler.start("optimizer_step") beta_opt._on_before_step = lambda : self.trainer.profiler.stop("optimizer_step") ...
[Suggestion for the PytorchLightning]
If this method seems safe, we could contribute by PR in two ways.
- Add this method as a guide in the Pytorch Lightning documentation (somewhere like PYTORCH LIGHTNING BASIC GAN TUTORIAL)
- Or, we can make Lightningmodule's
configure_optimziers
interface to suppert options like this... def configure_optimziers(): opt1 = Adam(...) opt2 = Adam(...) return ( {"optimizer": opt1}, {"optimizer": opt2, "do_not_count_global_step": True}, )
Thanks for your solution, it works!
But there is a typo in your code, the beta_opt._on_before_step = lambda : self.trainer.profiler.stop("optimizer_step")
should be _on_after_step
.
@yzslab Great! Also, thanks for your commen. Typo fixed :)
Thanks @Fitree for the neat fix! Do we need to update beta_opt._on_before_step and beta_opt._on_after_step at each step, or only the first step? Thanks.
Just ran into this as well, thanks @yzslab for the quick fix. Considering this is still a problem while this github issue looks like going stale, I'll have a stab at getting a PR in
@askerlee yes, the _on_before_step
and _on_after_step
functions get reassigned for each step, so you'll have to overwrite them in each step
Separately, in the meantime if anybody needs a quick fix for any number of optimizers, update to this:
for i, opt in enumerate(self.optimizers()):
opt.zero_grad()
if i+1 < len(self.optimizers()):
opt._on_before_step = lambda : self.trainer.profiler.start("optimizer_step")
opt._on_after_step = lambda : self.trainer.profiler.stop("optimizer_step")
(@ repo owners feel free to assign the issue to me)
Thank you for your PR @Anner-deJong! Hope it is merged soon
Bug description
Hello,
I encountered a bug when training with
automatic_optimization = False
and two optimizers.In summary: the
global_step
attribute of the trainer and the lightning module is tracking the total number of calls tooptimizer.step()
(in my case, two pertraining_step
), rather than the total number of iterations of the dataloader.This conflicts with the notion of
step
in arguments likelog_every_n_steps
andval_check_interval
in the trainer. Case in point, if we callinside
training_step
, withCSVLogger
,log_every_n_steps=10
, and twooptimizer.step()
s pertraining_step
, the CSV logs show:Note how
global_step
conflicts withstep
, and in fact is twice the expected value, since we have two optimizers.I have attached a complete code example that replicates the issue.
What version are you seeing the problem on?
v2.0
How to reproduce the bug
Error messages and logs
Environment
Current environment
* CUDA: - GPU: None - available: False - version: None * Lightning: - lightning-utilities: 0.9.0 - pytorch-lightning: 2.0.4 - torch: 2.0.1 - torchmetrics: 0.11.4 * Packages: - aiohttp: 3.8.4 - aiosignal: 1.3.1 - async-timeout: 4.0.2 - attrs: 23.1.0 - certifi: 2023.5.7 - charset-normalizer: 3.1.0 - filelock: 3.12.2 - frozenlist: 1.3.3 - fsspec: 2023.6.0 - idna: 3.4 - jinja2: 3.1.2 - lightning-utilities: 0.9.0 - markupsafe: 2.1.3 - mpmath: 1.3.0 - multidict: 6.0.4 - networkx: 3.1 - numpy: 1.25.0 - packaging: 23.1 - pip: 23.0.1 - pytorch-lightning: 2.0.4 - pyyaml: 6.0 - requests: 2.31.0 - setuptools: 67.6.0 - sympy: 1.12 - torch: 2.0.1 - torchmetrics: 0.11.4 - tqdm: 4.65.0 - typing-extensions: 4.7.0 - urllib3: 2.0.3 - wheel: 0.38.4 - yarl: 1.9.2 * System: - OS: Darwin - architecture: - 64bit - - processor: i386 - python: 3.10.11 - release: 20.6.0 - version: Darwin Kernel Version 20.6.0: Thu Mar 9 20:39:26 PST 2023; root:xnu-7195.141.49.700.6~1/RELEASE_X86_64More info
If this is the intended behavior, it should be reconciled with the trainer's notion of step. Arguments like
log_every_n_steps
andval_check_interval
use a different definition of step.