Bug description

Hello! I found this weird interaction that took me a while to debug, so hopefully someone finds it useful or it's possible to fix something in Lightning.

When constructing large models, it's recommended to use configure_model. To configure the model creation outside the lightning, I've been using factories, so that a fully configured factory can just make a model under the strategy context (e.g. deepspeed).

    def configure_model(self) -> None:
        if self.model is None:
            self.model = self.model_factory() # make a large model

Additionally, I've been using self.save_hyperparameters() and wandb logger for convenience. I found that after certain model size, my setup started hanging. I found that the _sanitize_callable_params function inside log_hyperparams of WandbLogger calls my factory again hence temporarily creating yet another copy of a model.

I can't quite find docs on callable parameters for Modules. Is it a bug or a feature? Why would one resolve the callable second time?

Workaround: self.save_hyperparameters(ignore="model_factory")

What version are you seeing the problem on?

v2.1

How to reproduce the bug

class MyModule(LightningModule):
    def __init__(
        self,
        model_factory,
        **kwargs
    ):
        super().__init__()
        self.save_hyperparameters()
        self.model = None
        self.model_factory = model_factory

    def configure_model(self) -> None:
        if self.model is None:
            self.model = self.model_factory() # make a large model

trainer = Trainer(logger=pl.loggers.WandbLogger(...))
trainer.fit(MyModule(), dataloaders)

Error messages and logs

NCCL hanging for me because rank0 GPU reaches 99% capacity:

Unhandled std::runtime_error exception:
  [Rank 5] NCCL watchdog thread terminated with exception: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3864, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800838 milliseconds before timing out.
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6df4) [0x1513bd4e3df4]
/usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x1514006ac609]
/usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x15140046b133]
3bf14e68db564437a3846c1f8ce080e500003N:568:1667 [0] NCCL INFO comm 0x55cbb7249fb0 rank 4 nranks 8 cudaDev 4 busId b00000 - Abort COMPLETE
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 4] NCCL watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3864, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800826 milliseconds before timing out.
Unhandled std::runtime_error exception:

but probably can lead to OOMs?

Environment

Current environment

``` #- Lightning Component LightningModule, WandbLogger #- PyTorch Lightning Version (e.g., 1.5.0): 2.1.2 #- PyTorch Version (e.g., 2.0): 2.1.2 #- Python version (e.g., 3.9): 3.9 ```

More info

No response

Suggested Solution:

This issue appears to be caused by the unintended side-effects of the _sanitize_callable_params function in the WandbLogger of PyTorch Lightning. A fix can be implemented by modifying the WandbLogger to avoid creating unnecessary copies of large models. Here's a suggested plan of action:

Confirm the Cause: Start by reproducing the issue using the provided sample code. This will let us confirm if the issue is indeed due to the unintentional launching of the factory a second time when dealing with large models.
Examine _sanitize_callable_params Function: Look closely at the code for _sanitize_callable_params to understand exactly how it's behaving with callable parameters.
Modify WandbLogger: Update the WandbLogger to avoid evaluating the factory function a second time. This could involve setting a flag to indicate whether the factory has already been run or directly setting the model attribute with the instantiating it only once.
Implement the User's Workaround: Incorporate the user's workaround (self.save_hyperparameters(ignore="model_factory")) into the code as a temporary fix until a more permanent solution can be implemented.
Update the Documentation: If this behavior of callable params for Modules is as designed, then update the documentation to make it clear that the logger will create a new model when callable parameters are passed.
Test the Solution: Conduct unit and integration tests on the updated code to ensure the issue is resolved and there are no regressions.
Deploy the Solution: After thoroughly testing the solution, merge the fix into the main codebase and close the GitHub issue. Once everything is done, ensure to communicate to the user and the community about the applied changes and fixed issues.

Checklist

- [X] Modify `src/lightning/pytorch/loggers/wandb.py` ✓ https://github.com/ayulockin/pytorch-lightning/commit/a0f21497e0f689e5bafa45718de93abbb5e02920 [Edit](https://github.com/ayulockin/pytorch-lightning/edit/sweep/nccl_timeout_or_gpu_ooms_when_using_wand/src/lightning/pytorch/loggers/wandb.py#L416-L420) - [X] Running GitHub Actions for `src/lightning/pytorch/loggers/wandb.py` ✓ [Edit](https://github.com/ayulockin/pytorch-lightning/edit/sweep/nccl_timeout_or_gpu_ooms_when_using_wand/src/lightning/pytorch/loggers/wandb.py#L416-L420)

ayulockin / pytorch-lightning

Sweep: NCCL timeout (or GPU OOMs) when using Wandb + configure_model with passing a factory + save_hyperparameters + large models #10

Bug description

What version are you seeing the problem on?

How to reproduce the bug

Error messages and logs

Environment

More info

Suggested Solution:

🚀 Here's the PR! #12

Actions (click)

GitHub Actions✓

Step 1: 🔎 Searching

Step 2: ⌨️ Coding

Step 3: 🔁 Code Review