Closed levhaikin closed 3 years ago
Hi! thanks for your contribution!, great first issue!
@levhaikin Thanks for the proposal. I think it should be fine to use. However, I am not sure we want to have this as option within lightning.
Thoughts @tchaton @Borda ?
In the case of invalid losses, you can return None
in your training_step
to skip it.
thanks @carmocca, this is definitely a much simpler way!
- is it expected to work with ddp?
I'm not sure. We don't have a test for it using DDP. I'll try it and report back.
If it doesn't, this could be fixed with https://github.com/PyTorchLightning/pytorch-lightning/issues/3325 cc @rohan-varma
- is it documented somewhere? if not, I guess it would be nice to have that documented explicitly, to save some effort for people like me :)
See the returns of training_step
https://pytorch-lightning.readthedocs.io/en/stable/lightning_module.html#training-step
Do you still want to support skipping invalid gradients or is skipping losses enough?
if it works with ddp then I guess it should be enough.
thanks for pointing to the docs. I didn't notice that on my own..
i'll probably test it as well, once my current training finishes (don't want to interrupt it)
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!
In the case of invalid losses, you can return
None
in yourtraining_step
to skip it.
Is there a way to change the result of training_step
from NaN
to None
in the Callbacks?
No. The optimization procedure is completely managed by the loops calling the LightningModule
hooks and Callback
s have no access to it.
It seems returning None
in training_step
is a bad idea when using AMP. I am using the native
backend and for me it causes gradient scaler issues.
Interestingly it fails only on GPU, on CPU everything works just fine.
Should I submit a bug report?
No. Returning None
from training_step
is not supported with AMP
Hi @carmocca , what happens to the batch size when __getitem__
returns None
? Will the batch be smaller?
@mfoglio Depends on your collate function, but assuming the batch_size is 1, then the loop will skip that batch:
Note that this is a different topic than returning None
from training_step
as this issue discusses.
The code presented in the first comment does not work for me. I'm using mixed precision with pytorch_lightning 1.5.5
. self.zero_grad()
present in the function written above on_after_backward
is somehow causing the loss to explode. When I comment self.zero_grad()
out, even though I get inf
gradients, the training converges. I know the gradients are inf
since I've added the print statement there.
def on_after_backward(self) -> None:
"""
Skipping updates in case of unstable gradients
https://github.com/Lightning-AI/lightning/issues/4956
"""
valid_gradients = True
for name, param in self.named_parameters():
if param.grad is not None:
valid_gradients = not (torch.isnan(param.grad).any() or torch.isinf(param.grad).any())
if not valid_gradients:
break
if not valid_gradients:
print(f'detected inf or nan values in gradients. not updating model parameters')
# self.zero_grad()
However, with self.zero_grad()
present, the loss diverges to larger values.
My guess is that with float16 datatype, lower values for the gradient would also count as the inf. However, while updating the weights, the value of the gradient is used. inf
therefore looks symbolic in that sense. Please correct me if I'm wrong anywhere.
I'm using gradient_clip_val
in pl.Trainer
to stablize the training.
FWIW, I encountered a similar problem and it seems to have been resolved by switching from Trainer(precision=16)
to Trainer(precision="bf16")
, if you have a suitable device for that floating-point type.
@ashesh-0 I use like this
def optimizer_step(
self,
epoch,
batch_idx,
optimizer,
optimizer_idx,
optimizer_closure,
on_tpu=False,
using_lbfgs=False,
):
"""
Skipping updates in case of unstable gradients
https://github.com/Lightning-AI/lightning/issues/4956
"""
valid_gradients = True
for name, param in self.named_parameters():
if param.grad is not None:
# valid_gradients = not (torch.isnan(param.grad).any() or torch.isinf(param.grad).any())
valid_gradients = not (torch.isnan(param.grad).any())
if not valid_gradients:
break
if not valid_gradients:
print("detected inf or nan values in gradients. not updating model parameters")
self.zero_grad()
optimizer.step(closure=optimizer_closure)
and precision 16, gradient_clip_val 1.0
i just can have gradient nan problem
because, gradient clipping
is done before optimizer_step
(so, i think can not happen inf
problem)
how about this?
@YooSungHyun can I put this code in self.training_step or do I need to create self.optimizer step after the training_step happens?
@unlugi i just override optimizer_step and, it is called on global step in training loop plz chk this https://github.com/YooSungHyun/lightning-U2/blob/main/models/u2/lightningmodule.py#L135
Hi, @YooSungHyun! When you mentioned -
i just can have gradient nan problem because, gradient clipping is done before optimizer_step (so, i think can not happen inf problem) how about this?
do you mean that it is still possible that nan
gradients are not handled by the modified optimizer_step
that you shared?
Sorry, I'm unable to get the case that you are trying to describe, could you please explain?
def optimizer_step( self, epoch, batch_idx, optimizer, optimizer_idx, optimizer_closure, on_tpu=False, using_lbfgs=False, ): """ Skipping updates in case of unstable gradients https://github.com/Lightning-AI/lightning/issues/4956 """ valid_gradients = True for name, param in self.named_parameters(): if param.grad is not None: # valid_gradients = not (torch.isnan(param.grad).any() or torch.isinf(param.grad).any()) valid_gradients = not (torch.isnan(param.grad).any()) if not valid_gradients: break if not valid_gradients: print("detected inf or nan values in gradients. not updating model parameters") self.zero_grad() optimizer.step(closure=optimizer_closure)
and precision 16, gradient_clip_val 1.0
i just can have gradient nan problem because,
gradient clipping
is done beforeoptimizer_step
(so, i think can not happeninf
problem) how about this?
FYI, the suggested optimizer_step
override throws
TypeError: optimizer_step() missing 1 required positional argument: 'optimizer_closure'
for me in lightning 2.0. Removing optimizer_idx
from args works for me. Ref - #16539 and docs
@ashesh-0 I use like this
def optimizer_step( self, epoch, batch_idx, optimizer, optimizer_idx, optimizer_closure, on_tpu=False, using_lbfgs=False, ): """ Skipping updates in case of unstable gradients https://github.com/Lightning-AI/lightning/issues/4956 """ valid_gradients = True for name, param in self.named_parameters(): if param.grad is not None: # valid_gradients = not (torch.isnan(param.grad).any() or torch.isinf(param.grad).any()) valid_gradients = not (torch.isnan(param.grad).any()) if not valid_gradients: break if not valid_gradients: print("detected inf or nan values in gradients. not updating model parameters") self.zero_grad() optimizer.step(closure=optimizer_closure)
and precision 16, gradient_clip_val 1.0
i just can have gradient nan problem because,
gradient clipping
is done beforeoptimizer_step
(so, i think can not happeninf
problem) how about this?
I had "MisconfigurationException: When optimizer.step(closure)
is called, the closure should be callable"
To avoid any problems with not "reimplementing" the method the correct way, I did:
def optimizer_step(
self,
*args, **kwargs
):
"""
Skipping updates in case of unstable gradients
https://github.com/Lightning-AI/lightning/issues/4956
"""
valid_gradients = True
for name, param in self.named_parameters():
if param.grad is not None:
# valid_gradients = not (torch.isnan(param.grad).any() or torch.isinf(param.grad).any())
valid_gradients = not (torch.isnan(param.grad).any())
if not valid_gradients:
break
if not valid_gradients:
print("detected inf or nan values in gradients. not updating model parameters")
self.zero_grad()
pl.LightningModule.optimizer_step(self, *args, **kwargs)
I'm not sure if this is enough when using gradient accumulation + ddp + amp.
tl;dr
does the approach in the code snippet below look ok, or is there a better alternative for automatically skipping few "bad" samples in the data that cause inf/nan gradients/loss? (is it a good practice altogether?)
details
sometimes, there is a small percentage (but annoyingly large in absolute value) of "dirty" samples in the data that cause the loss to be nan, although the neural-network architecture itself is fine and stable in terms of numerical stability.
one approach is to automatically stop training (use
terminate_on_nan
) and then somehow isolate all these samples and remove them from the data permanently. but..sometimes we simply want to automatically skip these samples as if they never existed (perhaps with a warning), and continue training.
I couldn't find any documentation about how to do that, nor anyone who asked this question. so i decided to ask and offer a solution I found, for others that might need it as well.
in the end, i came up with the following approach - override
on_after_backwards
method in my lightning-module with the following code:code
pros
cons
final question
is it worth having such functionality integrated into lightning as a simple command-line-switch/parameter?