Trainer validates `gradient_clip_algorithm` although `configure_gradient_clipping` is defined

hal-314 commented 2 years ago

🐛 Bug

I can't pass a custom gradient clipping algorithm, although I implemented configure_gradient_clipping hook, Documentation and release notes hints that you can use configure_gradient_clipping to implement your custom gradient clipping algorithm (.release notes: ... This means you can now implement state-of-the-art clipping algorithms with Lightning! ...) .

Please, allow to pass custom algorithm names in gradient_clip_algorithm when the model implements configure_gradient_clipping.

To Reproduce

from pytorch_lightning import LightningModule, Trainer

class BoringModel(LightningModule):
    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(32, 2)

    def forward(self, x):
        return self.layer(x)

    def training_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("train_loss", loss)
        return {"loss": loss}

    def validation_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("valid_loss", loss)

    def test_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("test_loss", loss)

    def configure_optimizers(self):
        return torch.optim.SGD(self.layer.parameters(), lr=0.1)

    def configure_gradient_clipping(
        self,
        optimizer: Optimizer,
        optimizer_idx: int,
        gradient_clip_val: Optional[Union[int, float]] = None,
        gradient_clip_algorithm: Optional[str] = None,
    ):     
        if gradient_clip_algorithm == "my_custom_clipping_algorithm":
            my_custom_clipping(optimizer, gradient_clip_val, gradient_clip_algorithm)
        else:  # Lightning will handle the gradient clipping
            self.clip_gradients(
                optimizer,
                gradient_clip_val=gradient_clip_val,
                gradient_clip_algorithm=gradient_clip_algorithm
            )

trainer = Trainer(
        default_root_dir=os.getcwd(),
        limit_train_batches=1,
        limit_val_batches=1,
        limit_test_batches=1,
        num_sanity_val_steps=0,
        max_epochs=1,
        enable_model_summary=False,
        gradient_clip_algorithm="my_custom_clipping_algorithm"
    )

Expected behavior

Environment

PyTorch Lightning Version: 1.5.4

Additional context

I want to train NFNets with Adaptive Gradient Clipping and compare with standard L2 gradient clipping.

rohitgr7 commented 2 years ago

    gradient_clip_algorithm="my_custom_clipping_algorithm"

gradient clipping arguments within Trainer are meant to be used only when you want to use them with built-in methods already implemented within Lightning. In case you want an identifier, you can use some sort of hparam or store it within your module state.

class LightningModule:
    def __init__(self, clip_algo, ...):
        self.clip_algo = clip_algo

    def configure_gradient_clipping(...):
        if clip_algo == 'my_custom_clipping_algorithm':
            my_custom_clipping(optimizer, gradient_clip_val)
        else:
            self.clip_gradients(...)

justusschock commented 2 years ago

@rohitgr7 I agree that we could improve on this and then just error out within self.clip_gradients

rohitgr7 commented 2 years ago

@justusschock well some of the plugins don't support custom gradient clipping so checking this value as early as possible during init seems more reliable to me.. also those flags are just meant if someone wants to use the built-in algorithms.

justusschock commented 2 years ago

@rohitgr7 We can check during init and then error out if it is not supported by the plugin (like we currently do) and for plugins that don't care we could do lazy checking.

So far I know that they are/were meant to switch between built-in algorithms, but this doesn't seem intuitive to me.

hal-314 commented 2 years ago

I find quite confusing that this flags are meant to only be used by built-in algorithms. In case someone implements a custom gradient clipping algorithm, the user would have two options:

Pass their clipping config through trainer flags. It works well for docs example where you are only applying gradient clipping to a model subset.
Pass their clipping config through lightning module. It allows to implement any case. Ideally, users should pass all arguments through LightningModule. In addition, it isn't clear to me what happens when gradient_clip_val=None (default value) as docs says that gradient clipping is disabled. In this case, will configure_gradient_clipping be called? If it isn't, we would force users to pass some arguments through trainer flags and some others through LightningModule. In the original example, users should set the gradient value through trainer flag and algorithm through LightningModule.

Another option is to forbid using trainer flags gradient_clip_algorithm and gradient_clip_val when the module implements configure_gradient_clipping. This would allow easy discoverability (as said in #10528), reduce BC, avoid undefined behavior and it would be only one way to configure gradient clipping. Note that if going with this option, it may be necessary to add should_gradients-be_clipped to avoid Lightning unscale gradients when it isn't necessary (simulate current gradient_clip_val=None behavior).

Lightning-AI / pytorch-lightning