Closed turian closed 3 years ago
I ran this in an interactive session on grid with lightning 1.2.7
import os
import torch
from torch.utils.data import Dataset
from pytorch_lightning import LightningModule, Trainer
from pytorch_lightning.loggers import WandbLogger
class RandomDataset(Dataset):
def __init__(self, size, length):
self.len = length
self.data = torch.randn(length, size)
def __getitem__(self, index):
return self.data[index]
def __len__(self):
return self.len
class BoringModel(LightningModule):
def __init__(self, my_param: int = 2):
super().__init__()
self.save_hyperparameters()
self.layer = torch.nn.Linear(32, 2)
def forward(self, x):
return self.layer(x)
def training_step(self, batch, batch_idx):
loss = self(batch).sum()
self.log("train_loss", loss)
return {"loss": loss}
def validation_step(self, batch, batch_idx):
loss = self(batch).sum()
self.log("valid_loss", loss)
return {"x": loss}
def test_step(self, batch, batch_idx):
loss = self(batch).sum()
self.log("test_loss", loss)
return {"y": loss}
def configure_optimizers(self):
return torch.optim.SGD(self.layer.parameters(), lr=0.1)
def run():
train_data = torch.utils.data.DataLoader(RandomDataset(32, 64), batch_size=2, num_workers=0)
val_data = torch.utils.data.DataLoader(RandomDataset(32, 64), batch_size=2, num_workers=0)
test_data = torch.utils.data.DataLoader(RandomDataset(32, 64), batch_size=2, num_workers=0)
logger = WandbLogger(project="myproject")
model = BoringModel()
trainer = Trainer(
gpus=-1,
default_root_dir=os.getcwd(),
limit_train_batches=1,
limit_val_batches=1,
num_sanity_val_steps=0,
max_epochs=1,
weights_summary=None,
logger=logger,
)
trainer.fit(model, train_dataloader=train_data, val_dataloaders=val_data)
trainer.test(model, test_dataloaders=test_data)
if __name__ == '__main__':
run()
gridai@ixsession ā python repro.py
/opt/conda/lib/python3.7/site-packages/pytorch_lightning/utilities/distributed.py:52: UserWarning: You requested multiple GPUs but did not specify a backend, e.g. `Trainer(accelerator="dp"|"ddp"|"ddp2")`. Setting `accelerator="ddp_spawn"` for you.
warnings.warn(*args, **kwargs)
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
wandb: Currently logged in as: awaelchli (use `wandb login --relogin` to force relogin)
wandb: Tracking run with wandb version 0.10.26
wandb: Syncing run driven-darkness-4
wandb: āļø View project at https://wandb.ai/awaelchli/myproject
wandb: š View run at https://wandb.ai/awaelchli/myproject/runs/9tigayd9
wandb: Run data is saved locally in /home/jovyan/wandb/run-20210417_133841-9tigayd9
wandb: Run `wandb offline` to turn off syncing.
/opt/conda/lib/python3.7/site-packages/pytorch_lightning/utilities/distributed.py:52: UserWarning: MASTER_ADDR environment variable is not defined. Set as localhost
warnings.warn(*args, **kwargs)
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2
/opt/conda/lib/python3.7/site-packages/pytorch_lightning/utilities/distributed.py:52: UserWarning: You are using `accelerator=ddp_spawn` with num_workers=0. For much faster performance, switch to `accelerator=ddp` and set `num_workers>0`
warnings.warn(*args, **kwargs)
Epoch 0: 0%| | 0/2 [00:00<?, ?it/s][W reducer.cpp:1050] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters, consider turning this flag off. Note that this warning may be a false positive your model has flow control causing later iterations to have unused parameters. (function operator())
[W reducer.cpp:1050] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters, consider turning this flag off. Note that this warning may be a false positive your model has flow control causing later iterations to have unused parameters. (function operator())
Epoch 0: 100%|āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā| 2/2 [00:00<00:00, 131.16it/s, loss=-0.0434, v_num=ayd9w
andb: Currently logged in as: awaelchli (use `wandb login --relogin` to force relogin)
wandb: Tracking run with wandb version 0.10.26
wandb: Resuming run driven-darkness-4
wandb: āļø View project at https://wandb.ai/awaelchli/myproject
wandb: š View run at https://wandb.ai/awaelchli/myproject/runs/9tigayd9
wandb: Run data is saved locally in /home/jovyan/wandb/run-20210417_133841-9tigayd9/files/wandb/run-20210417_133848-9tigayd9
wandb: Run `wandb offline` to turn off syncing.
Epoch 0: 100%|āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā| 2/2 [00:01<00:00, 1.47it/s, loss=-0.0434, v_num=ayd9]
/opt/conda/lib/python3.7/site-packages/pytorch_lightning/utilities/distributed.py:52: UserWarning: cleaning up ddp environment...
warnings.warn(*args, **kwargs)
I don't get the error message you are mentioning. Any hints as to what I need to modify?
Here is an example trying to log images or audio to wandb that breaks.
The following works (one GPU). Make sure to pip3 install soundfile
first:
import os
import torch
from torch.utils.data import Dataset
from pytorch_lightning import LightningModule, Trainer
from pytorch_lightning.loggers import WandbLogger
import wandb
class RandomDataset(Dataset):
def __init__(self, size, length):
self.len = length
self.data = torch.randn(length, size)
def __getitem__(self, index):
return self.data[index]
def __len__(self):
return self.len
class BoringModel(LightningModule):
def __init__(self, my_param: int = 2):
super().__init__()
self.save_hyperparameters()
self.layer = torch.nn.Linear(32, 2)
def forward(self, x):
return self.layer(x)
def training_step(self, batch, batch_idx):
wandb.log({"examples": [wandb.Audio(torch.rand(32).cpu().numpy(), caption="Nice", sample_rate=32)]})
loss = self(batch).sum()
self.log("train_loss", loss)
return {"loss": loss}
def validation_step(self, batch, batch_idx):
loss = self(batch).sum()
self.log("valid_loss", loss)
return {"x": loss}
def test_step(self, batch, batch_idx):
loss = self(batch).sum()
self.log("test_loss", loss)
return {"y": loss}
def configure_optimizers(self):
return torch.optim.SGD(self.layer.parameters(), lr=0.1)
def run():
train_data = torch.utils.data.DataLoader(RandomDataset(32, 64), batch_size=2, num_workers=0)
val_data = torch.utils.data.DataLoader(RandomDataset(32, 64), batch_size=2, num_workers=0)
test_data = torch.utils.data.DataLoader(RandomDataset(32, 64), batch_size=2, num_workers=0)
logger = WandbLogger(project="myproject")
model = BoringModel()
trainer = Trainer(
# gpus=-1,
gpus=1,
default_root_dir=os.getcwd(),
limit_train_batches=1,
limit_val_batches=1,
num_sanity_val_steps=0,
max_epochs=1,
weights_summary=None,
logger=logger,
)
trainer.fit(model, train_dataloader=train_data, val_dataloaders=val_data)
trainer.test(model, test_dataloaders=test_data)
if __name__ == '__main__':
wandb.init()
run()
If you switch to multiple GPUs, it breaks with:
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
---------------------------------------------------------------------------
ProcessExitedException Traceback (most recent call last)
<ipython-input-4-b4487cb8ccc5> in <module>
73 if __name__ == '__main__':
74 wandb.init()
---> 75 run()
<ipython-input-4-b4487cb8ccc5> in run()
67 logger=logger,
68 )
---> 69 trainer.fit(model, train_dataloader=train_data, val_dataloaders=val_data)
70 trainer.test(model, test_dataloaders=test_data)
71
/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py in fit(self, model, train_dataloader, val_dataloaders, datamodule)
497
498 # dispath `start_training` or `start_testing` or `start_predicting`
--> 499 self.dispatch()
500
501 # plugin will finalized fitting (e.g. ddp_spawn will load trained model)
/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py in dispatch(self)
544
545 else:
--> 546 self.accelerator.start_training(self)
547
548 def train_or_test_or_predict(self):
/opt/conda/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py in start_training(self, trainer)
71
72 def start_training(self, trainer):
---> 73 self.training_type_plugin.start_training(trainer)
74
75 def start_testing(self, trainer):
/opt/conda/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/ddp_spawn.py in start_training(self, trainer)
106
107 def start_training(self, trainer):
--> 108 mp.spawn(self.new_process, **self.mp_spawn_kwargs)
109 # reset optimizers, since main process is never used for training and thus does not have a valid optim state
110 trainer.optimizers = []
/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py in spawn(fn, args, nprocs, join, daemon, start_method)
228 ' torch.multiprocessing.start_process(...)' % start_method)
229 warnings.warn(msg)
--> 230 return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py in start_processes(fn, args, nprocs, join, daemon, start_method)
186
187 # Loop on join until it returns True or raises an exception.
--> 188 while not context.join():
189 pass
190
/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py in join(self, timeout)
142 error_index=error_index,
143 error_pid=failed_process.pid,
--> 144 exit_code=exitcode
145 )
146
ProcessExitedException: process 0 terminated with exit code 1
If you switch to self.log
, you get:
TypeError: log() missing 1 required positional argument: 'value'
Basically I want to wandb log images + audio + matplotlibs from within DDP
Thanks. I tried this and can see where the problem is. Do the following:
wandb.log({"examples": ... })
with self.logger.experiment.log(...)
This should work:) I can see the audio samples in the wandb run online. It doesn't play but I think that's because this dummy sample is too short.
Furthermore, we currently don't support images, audio etc. in self.log(), since the api depends on the specific logger. There are efforts to standardize this #6720
So for these custom objects, you have to call self.logger.experiment.log
(which is basically the same as wandb.log
)
EDIT: I tried your code with DDP as well. The fix above applies.
@awaelchli thanks, I will try it. Is this documented somewhere?
We have a small section here https://pytorch-lightning.readthedocs.io/en/latest/extensions/logging.html#manual-logging Open for suggestions if needs improvements.
I see. Thank.
I'm not exactly sure how to make it more clear, but the headline "Manual Logging" is maybe a bit off-base for me. "Manual Logging to a Supported or Custom Logger"?
I encountered the same issue and found that this can be simply fixed by moving wandb.init to the first line in your main function.
You can try: import wandb wandb.init(mode='disabled')
š Bug
I'm reopening #1356 because I'm getting this error running my code on grid.ai.
I am getting error:
Please reproduce using the BoringModel
Not possible since colab has only one GPU, unlike grid.ai
To Reproduce
On grid.ai or multiple GPU machine, create a trainer with WandbLogger and do not specify an accelerator. Run with gpus=-1 and hit this error.
Despite https://github.com/PyTorchLightning/pytorch-lightning/pull/2029 the default is ddp_spawn, which triggers this error on grid.ai:
Workaround: 1) In main, run
(seems redudant and potentially dangerous/foot-gunny since you are already passing a WandbLogger to the trainer.
2) Make sure trainer has accelerator=ddp defined.
Expected behavior
wandb logger works when trainer is given WandbLogger, gpu=-1, and no accelerator is defined, nor is a duplicate wandb init needed to be called.
Environment
grid.ai