Closed lobantseff closed 4 years ago
Hi! thanks for your contribution!, great first issue!
The problem is that gan.py example suppose to use buffered values self.generated_images and self.last_img, however during replicating and gathering in https://github.com/PyTorchLightning/pytorch-lightning/blob/22a7264e9a77ef70154e3ad7c926133c9f2205cd/pytorch_lightning/overrides/data_parallel.py#L64
buffered values are not replicated and not gathered to main LightningModule
model
@armavox good catch, mind draft a PR? :robot:
yep, I'll try
@armavox how is it going?
@Borda I assume to fix it by May
@armavox Any updates on this? Having the same issue...
Made some updates. Sorry for waiting.
There is an official l warning about the use of local (buffered here) variables during the distributed training: https://pytorch.org/docs/stable/nn.html#torch.nn.DataParallel
So I didn't try to create detours in the Lightning code and fixed only the example to work with dp and ddp.
The problem from point 2 in the heading post seems to be fixed by someone. But the unused folder for parallel experiment still created during ddp training. The problem is in https://github.com/PyTorchLightning/pytorch-lightning/blob/fdbbe968256f6c68a5dbb840a2004b77a618ef61/pytorch_lightning/trainer/callback_config.py#L66, which doesn't use rank_zero_only
decorator or something else.
I would propose some good fixes, but don't know how to do this elegant.
Thanks for your work! Best regards, Artem.
Running gan.py example with Trainer(ngpus=2) causes two types of error:
Trainer(ngpus=2, distributed_backend='dp')
Trainer(ngpus=2, distributed_backend='ddp')
./lightling_logs
one run creates two folders:version_0
andversion_1
File "/opt/miniconda3/envs/ctln-gan/lib/python3.7/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 122, in _del_model os.remove(filepath) FileNotFoundError: [Errno 2] No such file or directory: '/home/user/pyproj/DCGAN/lightning_logs/version_1/checkpoints/epoch=0.ckpt'
it seems that each subprocess tries to create its own checkpoints and delete not ctrated one.
Environment version:
python 3.7.5 pytorch 1.4.0
pytorch-lightning 0.7.1