how to properly skip samples that cause inf/nan gradients/loss

levhaikin commented 3 years ago

tl;dr

does the approach in the code snippet below look ok, or is there a better alternative for automatically skipping few "bad" samples in the data that cause inf/nan gradients/loss? (is it a good practice altogether?)

details

sometimes, there is a small percentage (but annoyingly large in absolute value) of "dirty" samples in the data that cause the loss to be nan, although the neural-network architecture itself is fine and stable in terms of numerical stability.
one approach is to automatically stop training (use terminate_on_nan) and then somehow isolate all these samples and remove them from the data permanently. but..
sometimes we simply want to automatically skip these samples as if they never existed (perhaps with a warning), and continue training.
I couldn't find any documentation about how to do that, nor anyone who asked this question. so i decided to ask and offer a solution I found, for others that might need it as well.
in the end, i came up with the following approach - override on_after_backwards method in my lightning-module with the following code:

code

    def on_after_backward(self) -> None:
        valid_gradients = True
        for name, param in self.named_parameters():
            if param.grad is not None:
                valid_gradients = not (torch.isnan(param.grad).any() or torch.isinf(param.grad).any())
                if not valid_gradients:
                    break

        if not valid_gradients:
            log.warning(f'detected inf or nan values in gradients. not updating model parameters')
            self.zero_grad()

pros

this code successfully identifies nan/inf gradients, and skips parameter update by zeroing gradients for the specific batch
support multi-gpu (at least ddp which I tested). when done this way, detecting inf/nan gradients (instead of inf/nan loss), we avoid a potential cases of losing synchronization between different processes, because typically one of the processes would generate an inf loss, while the others won't. if we stop only one process from doing a backwards pass, we lose synchronization, and would stumble into a never-ending processes that wait for nothing. training stalls. when checking gradients, it is after all gradients in all processes have been affected by the bad inf loss. so we have synchronization.
cons
can't catch bad samples that way.. need to work harder..
might not be future proof
clutters lightning module code (it is essentially architecture agnostic, boiler-plate code)
perhaps there is a better way..

final question

is it worth having such functionality integrated into lightning as a simple command-line-switch/parameter?

github-actions[bot] commented 3 years ago

Hi! thanks for your contribution!, great first issue!

justusschock commented 3 years ago

@levhaikin Thanks for the proposal. I think it should be fine to use. However, I am not sure we want to have this as option within lightning.

Thoughts @tchaton @Borda ?

carmocca commented 3 years ago

In the case of invalid losses, you can return None in your training_step to skip it.

levhaikin commented 3 years ago

thanks @carmocca, this is definitely a much simpler way!

is it expected to work with ddp?
is it documented somewhere? if not, I guess it would be nice to have that documented explicitly, to save some effort for people like me :)

carmocca commented 3 years ago

is it expected to work with ddp?

I'm not sure. We don't have a test for it using DDP. I'll try it and report back.

If it doesn't, this could be fixed with https://github.com/PyTorchLightning/pytorch-lightning/issues/3325 cc @rohan-varma

is it documented somewhere? if not, I guess it would be nice to have that documented explicitly, to save some effort for people like me :)

See the returns of training_step https://pytorch-lightning.readthedocs.io/en/stable/lightning_module.html#training-step

Do you still want to support skipping invalid gradients or is skipping losses enough?

levhaikin commented 3 years ago

if it works with ddp then I guess it should be enough.
thanks for pointing to the docs. I didn't notice that on my own..
i'll probably test it as well, once my current training finishes (don't want to interrupt it)

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

jiaruipeng1994 commented 2 years ago

In the case of invalid losses, you can return None in your training_step to skip it.

Is there a way to change the result of training_step from NaN to None in the Callbacks?

carmocca commented 2 years ago

No. The optimization procedure is completely managed by the loops calling the LightningModule hooks and Callbacks have no access to it.

yozhikoff commented 2 years ago

It seems returning None in training_step is a bad idea when using AMP. I am using the native backend and for me it causes gradient scaler issues. Interestingly it fails only on GPU, on CPU everything works just fine.

Should I submit a bug report?

carmocca commented 2 years ago

No. Returning None from training_step is not supported with AMP

mfoglio commented 2 years ago

Hi @carmocca , what happens to the batch size when __getitem__ returns None? Will the batch be smaller?

carmocca commented 2 years ago

@mfoglio Depends on your collate function, but assuming the batch_size is 1, then the loop will skip that batch:

https://github.com/PyTorchLightning/pytorch-lightning/blob/3f78c4ca7a89fed8f79d9015e9510f0a0dcc46b5/pytorch_lightning/loops/epoch/training_epoch_loop.py#L191-L192

Note that this is a different topic than returning None from training_step as this issue discusses.

ashesh-0 commented 2 years ago

The code presented in the first comment does not work for me. I'm using mixed precision with pytorch_lightning 1.5.5. self.zero_grad() present in the function written above on_after_backward is somehow causing the loss to explode. When I comment self.zero_grad() out, even though I get inf gradients, the training converges. I know the gradients are inf since I've added the print statement there.

    def on_after_backward(self) -> None:
        """
        Skipping updates in case of unstable gradients
        https://github.com/Lightning-AI/lightning/issues/4956
        """
        valid_gradients = True
        for name, param in self.named_parameters():
            if param.grad is not None:
                valid_gradients = not (torch.isnan(param.grad).any() or torch.isinf(param.grad).any())
                if not valid_gradients:
                    break
        if not valid_gradients:
            print(f'detected inf or nan values in gradients. not updating model parameters')
            # self.zero_grad()

However, with self.zero_grad() present, the loss diverges to larger values.

My guess is that with float16 datatype, lower values for the gradient would also count as the inf. However, while updating the weights, the value of the gradient is used. inf therefore looks symbolic in that sense. Please correct me if I'm wrong anywhere.

I'm using gradient_clip_val in pl.Trainer to stablize the training.

arlofaria commented 1 year ago

FWIW, I encountered a similar problem and it seems to have been resolved by switching from Trainer(precision=16) to Trainer(precision="bf16"), if you have a suitable device for that floating-point type.

YooSungHyun commented 1 year ago

@ashesh-0 I use like this

def optimizer_step(
    self,
    epoch,
    batch_idx,
    optimizer,
    optimizer_idx,
    optimizer_closure,
    on_tpu=False,
    using_lbfgs=False,
):
    """
    Skipping updates in case of unstable gradients
    https://github.com/Lightning-AI/lightning/issues/4956
    """
    valid_gradients = True
    for name, param in self.named_parameters():
        if param.grad is not None:
            # valid_gradients = not (torch.isnan(param.grad).any() or torch.isinf(param.grad).any())
            valid_gradients = not (torch.isnan(param.grad).any())
            if not valid_gradients:
                break
    if not valid_gradients:
        print("detected inf or nan values in gradients. not updating model parameters")
        self.zero_grad()
    optimizer.step(closure=optimizer_closure)

and precision 16, gradient_clip_val 1.0

i just can have gradient nan problem because, gradient clipping is done before optimizer_step (so, i think can not happen inf problem) how about this?

unlugi commented 1 year ago

@YooSungHyun can I put this code in self.training_step or do I need to create self.optimizer step after the training_step happens?

YooSungHyun commented 1 year ago

@unlugi i just override optimizer_step and, it is called on global step in training loop plz chk this https://github.com/YooSungHyun/lightning-U2/blob/main/models/u2/lightningmodule.py#L135

DanTremonti commented 1 year ago

Hi, @YooSungHyun! When you mentioned -

i just can have gradient nan problem because, gradient clipping is done before optimizer_step (so, i think can not happen inf problem) how about this?

do you mean that it is still possible that nan gradients are not handled by the modified optimizer_step that you shared? Sorry, I'm unable to get the case that you are trying to describe, could you please explain?

DanTremonti commented 1 year ago

def optimizer_step(
    self,
    epoch,
    batch_idx,
    optimizer,
    optimizer_idx,
    optimizer_closure,
    on_tpu=False,
    using_lbfgs=False,
):
    """
    Skipping updates in case of unstable gradients
    https://github.com/Lightning-AI/lightning/issues/4956
    """
    valid_gradients = True
    for name, param in self.named_parameters():
        if param.grad is not None:
            # valid_gradients = not (torch.isnan(param.grad).any() or torch.isinf(param.grad).any())
            valid_gradients = not (torch.isnan(param.grad).any())
            if not valid_gradients:
                break
    if not valid_gradients:
        print("detected inf or nan values in gradients. not updating model parameters")
        self.zero_grad()
    optimizer.step(closure=optimizer_closure)

and precision 16, gradient_clip_val 1.0

i just can have gradient nan problem because, gradient clipping is done before optimizer_step (so, i think can not happen inf problem) how about this?

FYI, the suggested optimizer_step override throws

TypeError: optimizer_step() missing 1 required positional argument: 'optimizer_closure'

for me in lightning 2.0. Removing optimizer_idx from args works for me. Ref - #16539 and docs

znb899 commented 1 year ago

@ashesh-0 I use like this

def optimizer_step(
    self,
    epoch,
    batch_idx,
    optimizer,
    optimizer_idx,
    optimizer_closure,
    on_tpu=False,
    using_lbfgs=False,
):
    """
    Skipping updates in case of unstable gradients
    https://github.com/Lightning-AI/lightning/issues/4956
    """
    valid_gradients = True
    for name, param in self.named_parameters():
        if param.grad is not None:
            # valid_gradients = not (torch.isnan(param.grad).any() or torch.isinf(param.grad).any())
            valid_gradients = not (torch.isnan(param.grad).any())
            if not valid_gradients:
                break
    if not valid_gradients:
        print("detected inf or nan values in gradients. not updating model parameters")
        self.zero_grad()
    optimizer.step(closure=optimizer_closure)

and precision 16, gradient_clip_val 1.0

i just can have gradient nan problem because, gradient clipping is done before optimizer_step (so, i think can not happen inf problem) how about this?

I had "MisconfigurationException: When optimizer.step(closure) is called, the closure should be callable" To avoid any problems with not "reimplementing" the method the correct way, I did:

def optimizer_step(
        self,
        *args, **kwargs
    ):
        """
        Skipping updates in case of unstable gradients
        https://github.com/Lightning-AI/lightning/issues/4956
        """
        valid_gradients = True
        for name, param in self.named_parameters():
            if param.grad is not None:
                # valid_gradients = not (torch.isnan(param.grad).any() or torch.isinf(param.grad).any())
                valid_gradients = not (torch.isnan(param.grad).any())
                if not valid_gradients:
                    break
        if not valid_gradients:
            print("detected inf or nan values in gradients. not updating model parameters")
            self.zero_grad()

        pl.LightningModule.optimizer_step(self, *args, **kwargs)

I'm not sure if this is enough when using gradient accumulation + ddp + amp.

Lightning-AI / pytorch-lightning