gan.py multi-gpu running problems

Lightning-AI / pytorch-lightning

Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.

https://lightning.ai

Apache License 2.0

28.31k stars 3.38k forks source link

gan.py multi-gpu running problems #1223

Closed lobantseff closed 4 years ago

lobantseff commented 4 years ago

Running gan.py example with Trainer(ngpus=2) causes two types of error:

if Trainer(ngpus=2, distributed_backend='dp')

Exception has occurred: AttributeError
'NoneType' object has no attribute 'detach'
  File "/home/user/gan.py", line 146, in training_step
    self.discriminator(self.generated_imgs.detach()), fake)

if Trainer(ngpus=2, distributed_backend='ddp')
- in ./lightling_logs one run creates two folders: version_0 and version_1
- Exception caused:
  File "/opt/miniconda3/envs/ctln-gan/lib/python3.7/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 122, in _del_model os.remove(filepath) FileNotFoundError: [Errno 2] No such file or directory: '/home/user/pyproj/DCGAN/lightning_logs/version_1/checkpoints/epoch=0.ckpt'

it seems that each subprocess tries to create its own checkpoints and delete not ctrated one.

Environment version:

python 3.7.5 pytorch 1.4.0
pytorch-lightning 0.7.1

github-actions[bot] commented 4 years ago

Hi! thanks for your contribution!, great first issue!

lobantseff commented 4 years ago

The problem is that gan.py example suppose to use buffered values self.generated_images and self.last_img, however during replicating and gathering in https://github.com/PyTorchLightning/pytorch-lightning/blob/22a7264e9a77ef70154e3ad7c926133c9f2205cd/pytorch_lightning/overrides/data_parallel.py#L64 buffered values are not replicated and not gathered to main LightningModule model

Borda commented 4 years ago

@armavox good catch, mind draft a PR? :robot:

lobantseff commented 4 years ago

yep, I'll try

Borda commented 4 years ago

@armavox how is it going?

lobantseff commented 4 years ago

@Borda I assume to fix it by May

axkoenig commented 4 years ago

@armavox Any updates on this? Having the same issue...

lobantseff commented 4 years ago

Made some updates. Sorry for waiting.

There is an official l warning about the use of local (buffered here) variables during the distributed training: https://pytorch.org/docs/stable/nn.html#torch.nn.DataParallel

So I didn't try to create detours in the Lightning code and fixed only the example to work with dp and ddp.

lobantseff commented 4 years ago

The problem from point 2 in the heading post seems to be fixed by someone. But the unused folder for parallel experiment still created during ddp training. The problem is in https://github.com/PyTorchLightning/pytorch-lightning/blob/fdbbe968256f6c68a5dbb840a2004b77a618ef61/pytorch_lightning/trainer/callback_config.py#L66, which doesn't use rank_zero_only decorator or something else.

I would propose some good fixes, but don't know how to do this elegant.

Thanks for your work! Best regards, Artem.