MultiGPU Error #197

Closed bwdeng20 closed 2 years ago

bwdeng20 commented 3 years ago

(update to indicate the bug version is V1.1)

Thanks for your awesome work!


Directly download the main code (ReleaseV1.1) and execute this line(or any line using multiple GPUs)

python trainer.gpus=[0,1]



1080Ti x 4, Ubuntu18.04

conda env showed with pip list

Error Info

raceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/amax/anaconda3/envs/pyg18/lib/python3.8/multiprocessing/", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/home/amax/anaconda3/envs/pyg18/lib/python3.8/multiprocessing/", line 125, in _main
  File "/home/amax/anaconda3/envs/pyg18/lib/python3.8/multiprocessing/", line 236, in prepare
  File "/home/amax/anaconda3/envs/pyg18/lib/python3.8/multiprocessing/", line 287, in _fixup_main_from_path
    main_content = runpy.run_path(main_path,
  File "/home/amax/anaconda3/envs/pyg18/lib/python3.8/", line 264, in run_path
    code, fname = _get_code_from_file(run_name, path_name)
  File "/home/amax/anaconda3/envs/pyg18/lib/python3.8/", line 234, in _get_code_from_file
    with io.open_code(decoded_path) as f:
FileNotFoundError: [Errno 2] No such file or directory: '/data/dbw/projects/lightning-hydra-template-main/logs/runs/2021-11-04/10-14-23/'
ashleve commented 3 years ago

Hi, I suspect this might have been caused by this week's release of lightning v1.5? I'm preparing an update for the template so perhaps it will be resolved soon

m-bain commented 3 years ago

likewise, im getting the same error

smartdolphin commented 2 years ago

This is a hydra+DDP issue. If dir path of mode.default.yaml is modified to current path, it seems temporarily runable.

ashleve commented 2 years ago

Yes, DDP requires the working directory to be the same with each run which is not compatible with the way hydra manipulates it. However, lightning implements some workaround and it has been working correctly before. Are you running lightning v1.5? Perhaps that workaround has broken in the recent release. I will investigate it later today

ashleve commented 2 years ago

@bwdeng20 @m-bain @smartdolphin Hi! Do you still experience the issue? I have failed to reproduce it.

The following line:

python trainer.gpus=[0,1]

is incorrect with template default settings - you should also specify the ddp accelerator:

python trainer.gpus=[0,1] +trainer.accelerator=ddp

With accelerator specified I don't experience the FileNotFoundError .

Please update to the newest template version and let me know if the problem still exists and which pytorch version you're using.

shim94kr commented 2 years ago

Is ddp the only option available?? When I use dp option with following command the error above is bypassed, but another error is raised.

python trainer.gpus=[0,1] +trainer.strategy=dp

The problem was on torchmetrics, but the repo said multi-gpus are supported. 'RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!'

Can you check this issue? Thank you !

ashleve commented 2 years ago

@shim94kr There are many strategies available but I have not tested them. image

Take a look at torchmetrics docs for DP:

And lightning docs for DP:

Generally, DP use is discouraged by PyTorch and Lightning. Is there a reason you want to use DP instead of DDP?

ashleve commented 2 years ago

I recommend everyone to download the current template from main branch, set up new conda environment, install requirements and see if the problem with DDP still occurs.

shim94kr commented 2 years ago

Thank you for providing the references!

I'm using DP in my project since it was only compatible with DP mode. I newly noticed that the DDP is the standard to PyTorch and Lightning. Thank You!

shim94kr commented 2 years ago

And I checked DDP works in the current template!

smartdolphin commented 2 years ago

I checked DDP in latest template. It works! thank you!