[Grid] You must call wandb.init() before wandb.log()

turian commented 3 years ago

🐛 Bug

I'm reopening #1356 because I'm getting this error running my code on grid.ai.

I am getting error:

wandb.errors.error.Error: You must call wandb.init() before wandb.log()

Please reproduce using the BoringModel

Not possible since colab has only one GPU, unlike grid.ai

To Reproduce

On grid.ai or multiple GPU machine, create a trainer with WandbLogger and do not specify an accelerator. Run with gpus=-1 and hit this error.

Despite https://github.com/PyTorchLightning/pytorch-lightning/pull/2029 the default is ddp_spawn, which triggers this error on grid.ai:

/opt/conda/lib/python3.7/site-packages/pytorch_lightning/utilities/distributed.py:52: UserWarning: You requested multiple GPUs but did not specify a backend, e.g. `Trainer(accelerator="dp"|"ddp"|"ddp2")`. Setting `accelerator="ddp_spawn"` for you.

Workaround: 1) In main, run

import wandb
wandb.init(project...)

(seems redudant and potentially dangerous/foot-gunny since you are already passing a WandbLogger to the trainer.

2) Make sure trainer has accelerator=ddp defined.

Expected behavior

wandb logger works when trainer is given WandbLogger, gpu=-1, and no accelerator is defined, nor is a duplicate wandb init needed to be called.

Environment

grid.ai

CUDA:
- GPU:
  - Tesla M60
  - Tesla M60
- available: True
- version: 10.2
Packages:
- numpy: 1.20.2
- pyTorch_debug: False
- pyTorch_version: 1.8.1+cu102
- pytorch-lightning: 1.2.7
- tqdm: 4.60.0
System:
- OS: Linux
- architecture:
  - 64bit
- processor: x86_64
- python: 3.7.10
- version: #1 SMP Tue Mar 16 04:56:19 UTC 2021

awaelchli commented 3 years ago

I ran this in an interactive session on grid with lightning 1.2.7

import os
import torch
from torch.utils.data import Dataset
from pytorch_lightning import LightningModule, Trainer
from pytorch_lightning.loggers import WandbLogger

class RandomDataset(Dataset):

    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len

class BoringModel(LightningModule):

    def __init__(self, my_param: int = 2):
        super().__init__()
        self.save_hyperparameters()
        self.layer = torch.nn.Linear(32, 2)

    def forward(self, x):
        return self.layer(x)

    def training_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("train_loss", loss)
        return {"loss": loss}

    def validation_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("valid_loss", loss)
        return {"x": loss}

    def test_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("test_loss", loss)
        return {"y": loss}

    def configure_optimizers(self):
        return torch.optim.SGD(self.layer.parameters(), lr=0.1)

def run():
    train_data = torch.utils.data.DataLoader(RandomDataset(32, 64), batch_size=2, num_workers=0)
    val_data = torch.utils.data.DataLoader(RandomDataset(32, 64), batch_size=2, num_workers=0)
    test_data = torch.utils.data.DataLoader(RandomDataset(32, 64), batch_size=2, num_workers=0)

    logger = WandbLogger(project="myproject")

    model = BoringModel()
    trainer = Trainer(
        gpus=-1,
        default_root_dir=os.getcwd(),
        limit_train_batches=1,
        limit_val_batches=1,
        num_sanity_val_steps=0,
        max_epochs=1,
        weights_summary=None,
        logger=logger,
    )
    trainer.fit(model, train_dataloader=train_data, val_dataloaders=val_data)
    trainer.test(model, test_dataloaders=test_data)

if __name__ == '__main__':
    run()


gridai@ixsession → python repro.py 
/opt/conda/lib/python3.7/site-packages/pytorch_lightning/utilities/distributed.py:52: UserWarning: You requested multiple GPUs but did not specify a backend, e.g. `Trainer(accelerator="dp"|"ddp"|"ddp2")`. Setting `accelerator="ddp_spawn"` for you.
  warnings.warn(*args, **kwargs)
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
wandb: Currently logged in as: awaelchli (use `wandb login --relogin` to force relogin)
wandb: Tracking run with wandb version 0.10.26
wandb: Syncing run driven-darkness-4
wandb: ⭐️ View project at https://wandb.ai/awaelchli/myproject
wandb: 🚀 View run at https://wandb.ai/awaelchli/myproject/runs/9tigayd9
wandb: Run data is saved locally in /home/jovyan/wandb/run-20210417_133841-9tigayd9
wandb: Run `wandb offline` to turn off syncing.

/opt/conda/lib/python3.7/site-packages/pytorch_lightning/utilities/distributed.py:52: UserWarning: MASTER_ADDR environment variable is not defined. Set as localhost
  warnings.warn(*args, **kwargs)
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2
/opt/conda/lib/python3.7/site-packages/pytorch_lightning/utilities/distributed.py:52: UserWarning: You are using `accelerator=ddp_spawn` with num_workers=0. For much faster performance, switch to `accelerator=ddp` and set `num_workers>0`
  warnings.warn(*args, **kwargs)
Epoch 0:   0%|                                                                                                                                                                                                                                                                         | 0/2 [00:00<?, ?it/s][W reducer.cpp:1050] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters, consider turning this flag off. Note that this warning may be a false positive your model has flow control causing later iterations to have unused parameters. (function operator())
[W reducer.cpp:1050] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters, consider turning this flag off. Note that this warning may be a false positive your model has flow control causing later iterations to have unused parameters. (function operator())
Epoch 0: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 131.16it/s, loss=-0.0434, v_num=ayd9w
andb: Currently logged in as: awaelchli (use `wandb login --relogin` to force relogin)                                                                                                                                                                                                                       
wandb: Tracking run with wandb version 0.10.26
wandb: Resuming run driven-darkness-4
wandb: ⭐️ View project at https://wandb.ai/awaelchli/myproject
wandb: 🚀 View run at https://wandb.ai/awaelchli/myproject/runs/9tigayd9
wandb: Run data is saved locally in /home/jovyan/wandb/run-20210417_133841-9tigayd9/files/wandb/run-20210417_133848-9tigayd9
wandb: Run `wandb offline` to turn off syncing.

Epoch 0: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.47it/s, loss=-0.0434, v_num=ayd9]
/opt/conda/lib/python3.7/site-packages/pytorch_lightning/utilities/distributed.py:52: UserWarning: cleaning up ddp environment...
  warnings.warn(*args, **kwargs)

I don't get the error message you are mentioning. Any hints as to what I need to modify?

turian commented 3 years ago

Here is an example trying to log images or audio to wandb that breaks.

The following works (one GPU). Make sure to pip3 install soundfile first:

import os
import torch
from torch.utils.data import Dataset
from pytorch_lightning import LightningModule, Trainer
from pytorch_lightning.loggers import WandbLogger
import wandb

class RandomDataset(Dataset):

    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len

class BoringModel(LightningModule):

    def __init__(self, my_param: int = 2):
        super().__init__()
        self.save_hyperparameters()
        self.layer = torch.nn.Linear(32, 2)

    def forward(self, x):
        return self.layer(x)

    def training_step(self, batch, batch_idx):
        wandb.log({"examples": [wandb.Audio(torch.rand(32).cpu().numpy(), caption="Nice", sample_rate=32)]})
        loss = self(batch).sum()
        self.log("train_loss", loss)
        return {"loss": loss}

    def validation_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("valid_loss", loss)
        return {"x": loss}

    def test_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("test_loss", loss)
        return {"y": loss}

    def configure_optimizers(self):
        return torch.optim.SGD(self.layer.parameters(), lr=0.1)

def run():
    train_data = torch.utils.data.DataLoader(RandomDataset(32, 64), batch_size=2, num_workers=0)
    val_data = torch.utils.data.DataLoader(RandomDataset(32, 64), batch_size=2, num_workers=0)
    test_data = torch.utils.data.DataLoader(RandomDataset(32, 64), batch_size=2, num_workers=0)

    logger = WandbLogger(project="myproject")

    model = BoringModel()
    trainer = Trainer(
#        gpus=-1,
        gpus=1,
        default_root_dir=os.getcwd(),
        limit_train_batches=1,
        limit_val_batches=1,
        num_sanity_val_steps=0,
        max_epochs=1,
        weights_summary=None,
        logger=logger,
    )
    trainer.fit(model, train_dataloader=train_data, val_dataloaders=val_data)
    trainer.test(model, test_dataloaders=test_data)

if __name__ == '__main__':
    wandb.init()
    run()

If you switch to multiple GPUs, it breaks with:

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
---------------------------------------------------------------------------
ProcessExitedException                    Traceback (most recent call last)
<ipython-input-4-b4487cb8ccc5> in <module>
     73 if __name__ == '__main__':
     74     wandb.init()
---> 75     run()

<ipython-input-4-b4487cb8ccc5> in run()
     67         logger=logger,
     68     )
---> 69     trainer.fit(model, train_dataloader=train_data, val_dataloaders=val_data)
     70     trainer.test(model, test_dataloaders=test_data)
     71 

/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py in fit(self, model, train_dataloader, val_dataloaders, datamodule)
    497 
    498         # dispath `start_training` or `start_testing` or `start_predicting`
--> 499         self.dispatch()
    500 
    501         # plugin will finalized fitting (e.g. ddp_spawn will load trained model)

/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py in dispatch(self)
    544 
    545         else:
--> 546             self.accelerator.start_training(self)
    547 
    548     def train_or_test_or_predict(self):

/opt/conda/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py in start_training(self, trainer)
     71 
     72     def start_training(self, trainer):
---> 73         self.training_type_plugin.start_training(trainer)
     74 
     75     def start_testing(self, trainer):

/opt/conda/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/ddp_spawn.py in start_training(self, trainer)
    106 
    107     def start_training(self, trainer):
--> 108         mp.spawn(self.new_process, **self.mp_spawn_kwargs)
    109         # reset optimizers, since main process is never used for training and thus does not have a valid optim state
    110         trainer.optimizers = []

/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py in spawn(fn, args, nprocs, join, daemon, start_method)
    228                ' torch.multiprocessing.start_process(...)' % start_method)
    229         warnings.warn(msg)
--> 230     return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')

/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py in start_processes(fn, args, nprocs, join, daemon, start_method)
    186 
    187     # Loop on join until it returns True or raises an exception.
--> 188     while not context.join():
    189         pass
    190 

/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py in join(self, timeout)
    142                     error_index=error_index,
    143                     error_pid=failed_process.pid,
--> 144                     exit_code=exitcode
    145                 )
    146 

ProcessExitedException: process 0 terminated with exit code 1

If you switch to self.log, you get:

TypeError: log() missing 1 required positional argument: 'value'

Basically I want to wandb log images + audio + matplotlibs from within DDP

awaelchli commented 3 years ago

Thanks. I tried this and can see where the problem is. Do the following:

Remove the manual wandb.init call at the bottom
Replace wandb.log({"examples": ... }) with self.logger.experiment.log(...)

This should work:) I can see the audio samples in the wandb run online. It doesn't play but I think that's because this dummy sample is too short.

Furthermore, we currently don't support images, audio etc. in self.log(), since the api depends on the specific logger. There are efforts to standardize this #6720 So for these custom objects, you have to call self.logger.experiment.log (which is basically the same as wandb.log)

EDIT: I tried your code with DDP as well. The fix above applies.

turian commented 3 years ago

@awaelchli thanks, I will try it. Is this documented somewhere?

awaelchli commented 3 years ago

We have a small section here https://pytorch-lightning.readthedocs.io/en/latest/extensions/logging.html#manual-logging Open for suggestions if needs improvements.

turian commented 3 years ago

I see. Thank.

I'm not exactly sure how to make it more clear, but the headline "Manual Logging" is maybe a bit off-base for me. "Manual Logging to a Supported or Custom Logger"?

yuelei0428 commented 1 year ago

I encountered the same issue and found that this can be simply fixed by moving wandb.init to the first line in your main function.

TaosLezz commented 3 months ago

You can try: import wandb wandb.init(mode='disabled')

Lightning-AI / pytorch-lightning