Closed bwdeng20 closed 2 years ago
Hi, I suspect this might have been caused by this week's release of lightning v1.5? I'm preparing an update for the template so perhaps it will be resolved soon
likewise, im getting the same error
This is a hydra+DDP issue. If dir path of mode.default.yaml is modified to current path, it seems temporarily runable.
Yes, DDP requires the working directory to be the same with each run which is not compatible with the way hydra manipulates it. However, lightning implements some workaround and it has been working correctly before. Are you running lightning v1.5? Perhaps that workaround has broken in the recent release. I will investigate it later today
@bwdeng20 @m-bain @smartdolphin Hi! Do you still experience the issue? I have failed to reproduce it.
The following line:
python run.py trainer.gpus=[0,1]
is incorrect with template default settings - you should also specify the ddp accelerator:
python run.py trainer.gpus=[0,1] +trainer.accelerator=ddp
With accelerator specified I don't experience the FileNotFoundError
.
Please update to the newest template version and let me know if the problem still exists and which pytorch version you're using.
Is ddp the only option available?? When I use dp option with following command the error above is bypassed, but another error is raised.
python run.py trainer.gpus=[0,1] +trainer.strategy=dp
The problem was on torchmetrics, but the repo said multi-gpus are supported. 'RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!'
Can you check this issue? Thank you !
@shim94kr There are many strategies available but I have not tested them.
Take a look at torchmetrics docs for DP: https://torchmetrics.readthedocs.io/en/latest/pages/overview.html#metrics-in-dataparallel-dp-mode
And lightning docs for DP: https://pytorch-lightning.readthedocs.io/en/latest/advanced/multi_gpu.html#data-parallel
Generally, DP use is discouraged by PyTorch and Lightning. Is there a reason you want to use DP instead of DDP?
I recommend everyone to download the current template from main branch, set up new conda environment, install requirements and see if the problem with DDP still occurs.
Thank you for providing the references!
I'm using DP in my project since it was only compatible with DP mode. I newly noticed that the DDP is the standard to PyTorch and Lightning. Thank You!
And I checked DDP works in the current template!
I checked DDP in latest template. It works! thank you!
(update to indicate the bug version is V1.1)
Thanks for your awesome work!
Reproduce
Directly download the main code (ReleaseV1.1) and execute this line(or any line using multiple GPUs)
Environment
Machine
1080Ti x 4, Ubuntu18.04
conda env showed with
pip list
Error Info