ashleve / lightning-hydra-template

PyTorch Lightning + Hydra. A very user-friendly template for ML experimentation. ⚡🔥⚡
4.26k stars 654 forks source link

MultiGPU Error #197

Closed bwdeng20 closed 2 years ago

bwdeng20 commented 3 years ago

(update to indicate the bug version is V1.1)

Thanks for your awesome work!

Reproduce

Directly download the main code (ReleaseV1.1) and execute this line(or any line using multiple GPUs)

python run.py trainer.gpus=[0,1]

Environment

Machine

1080Ti x 4, Ubuntu18.04

conda env showed with pip list

Package                           Version
--------------------------------- ---------
absl-py                           0.15.0
aiohttp                           3.8.0
aiosignal                         1.2.0
alembic                           1.6.5
antlr4-python3-runtime            4.8
anyio                             3.3.4
argon2-cffi                       21.1.0
async-timeout                     4.0.0
attrs                             21.2.0
autopage                          0.4.0
Babel                             2.9.1
backcall                          0.2.0
backports.entry-points-selectable 1.1.0
backports.functools-lru-cache     1.6.4
black                             21.10b0
bleach                            4.1.0
brotlipy                          0.7.0
cachetools                        4.2.4
certifi                           2021.10.8
cffi                              1.15.0
cfgv                              3.3.1
chardet                           4.0.0
charset-normalizer                2.0.7
click                             8.0.3
cliff                             3.9.0
cmaes                             0.8.2
cmd2                              2.2.0
colorama                          0.4.4
colorlog                          6.5.0
commonmark                        0.9.1
conda                             4.10.3
conda-package-handling            1.7.3
configparser                      5.0.2
cryptography                      35.0.0
cycler                            0.10.0
debugpy                           1.5.1
decorator                         5.1.0
defusedxml                        0.7.1
distlib                           0.3.3
docker-pycreds                    0.4.0
entrypoints                       0.3
filelock                          3.3.1
flake8                            4.0.1
frozenlist                        1.2.0
fsspec                            2021.10.1
future                            0.18.2
gitdb                             4.0.9
GitPython                         3.1.24
google-auth                       2.3.3
google-auth-oauthlib              0.4.6
googledrivedownloader             0.4
greenlet                          1.1.2
grpcio                            1.41.0
hydra-colorlog                    1.1.0
hydra-core                        1.1.1
hydra-optuna-sweeper              1.1.1
identify                          2.3.3
idna                              3.3
importlib-resources               5.4.0
iniconfig                         1.1.1
ipykernel                         6.4.1
ipython                           7.28.0
ipython-genutils                  0.2.0
ipywidgets                        7.6.5
isodate                           0.6.0
isort                             5.9.3
jedi                              0.18.0
Jinja2                            3.0.2
joblib                            1.1.0
json5                             0.9.6
jsonschema                        4.1.0
jupyter-client                    7.0.6
jupyter-core                      4.8.1
jupyter-server                    1.11.1
jupyterlab                        3.2.0
jupyterlab-pygments               0.1.2
jupyterlab-server                 2.8.2
jupyterlab-widgets                1.0.2
kiwisolver                        1.3.2
llvmlite                          0.37.0
Mako                              1.1.5
mamba                             0.17.0
Markdown                          3.3.4
MarkupSafe                        2.0.1
matplotlib                        3.4.3
matplotlib-inline                 0.1.3
mccabe                            0.6.1
mistune                           0.8.4
mkl-fft                           1.3.0
mkl-random                        1.2.2
mkl-service                       2.4.0
msgpack                           1.0.2
multidict                         5.2.0
mypy-extensions                   0.4.3
nbclassic                         0.3.2
nbclient                          0.5.4
nbconvert                         6.2.0
nbformat                          5.1.3
nest-asyncio                      1.5.1
networkx                          2.6.3
nodeenv                           1.6.0
notebook                          6.4.4
numba                             0.54.1
numpy                             1.20.3
oauthlib                          3.1.1
olefile                           0.46
omegaconf                         2.1.1
optuna                            2.10.0
packaging                         21.0
pandas                            1.3.4
pandocfilters                     1.5.0
parso                             0.8.2
pathspec                          0.9.0
pathtools                         0.1.2
pbr                               5.6.0
pexpect                           4.8.0
pickleshare                       0.7.5
Pillow                            8.3.1
pip                               21.3.1
platformdirs                      2.4.0
plotly                            5.3.1
pluggy                            1.0.0
pre-commit                        2.15.0
prettytable                       2.2.1
prometheus-client                 0.11.0
promise                           2.3
prompt-toolkit                    3.0.20
protobuf                          3.18.1
psutil                            5.8.0
ptyprocess                        0.7.0
pudb                              2021.2.2
py                                1.10.0
pyasn1                            0.4.8
pyasn1-modules                    0.2.8
pycodestyle                       2.8.0
pycosat                           0.6.3
pycparser                         2.20
pyDeprecate                       0.3.1
pyflakes                          2.4.0
Pygments                          2.10.0
PyGSP                             0.5.1
pyOpenSSL                         21.0.0
pyparsing                         2.4.7
pyperclip                         1.8.2
pyrsistent                        0.18.0
PySocks                           1.7.1
pytest                            6.2.5
python-dateutil                   2.8.2
python-dotenv                     0.19.1
python-editor                     1.0.4
pytorch-lightning                 1.5.0
pytz                              2021.3
PyYAML                            6.0
pyzmq                             22.3.0
ray                               1.7.0
rdflib                            6.0.2
redis                             3.5.3
regex                             2021.11.1
requests                          2.26.0
requests-oauthlib                 1.3.0
requests-unixsocket               0.2.0
rich                              10.12.0
rsa                               4.7.2
ruamel-yaml-conda                 0.15.80
scikit-learn                      1.0
scikit-sparse                     0.4.6
scikit-umfpack                    0.3.2
scipy                             1.7.1
seaborn                           0.11.2
Send2Trash                        1.8.0
sentry-sdk                        1.4.3
setuptools                        58.0.4
sh                                1.14.2
shortuuid                         1.0.1
six                               1.16.0
sklearn                           0.0
smmap                             5.0.0
sniffio                           1.2.0
SQLAlchemy                        1.4.26
stevedore                         3.5.0
subprocess32                      3.5.4
tenacity                          8.0.1
tensorboard                       2.7.0
tensorboard-data-server           0.6.1
tensorboard-plugin-wit            1.8.0
termcolor                         1.1.0
terminado                         0.12.1
testpath                          0.5.0
thgsp                             0.1.0
threadpoolctl                     3.0.0
toml                              0.10.2
tomli                             1.2.2
torch                             1.8.2
torch-cluster                     1.5.9
torch-geometric                   2.0.1
torch-scatter                     2.0.8
torch-sparse                      0.6.12
torch-spline-conv                 1.2.1
torchaudio                        0.8.2
torchmetrics                      0.5.1
torchvision                       0.9.2
tornado                           6.1
tqdm                              4.62.3
traitlets                         5.1.0
typing-extensions                 3.10.0.2
urllib3                           1.26.7
urwid                             2.1.2
urwid-readline                    0.13
virtualenv                        20.10.0
wandb                             0.12.6
wcwidth                           0.2.5
webencodings                      0.5.1
websocket-client                  1.2.1
Werkzeug                          2.0.2
wheel                             0.37.0
widgetsnbextension                3.5.1
yacs                              0.1.8
yarl                              1.7.2
yaspin                            2.1.0
zipp                              3.6.0

Error Info

raceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/amax/anaconda3/envs/pyg18/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/home/amax/anaconda3/envs/pyg18/lib/python3.8/multiprocessing/spawn.py", line 125, in _main
    prepare(preparation_data)
  File "/home/amax/anaconda3/envs/pyg18/lib/python3.8/multiprocessing/spawn.py", line 236, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "/home/amax/anaconda3/envs/pyg18/lib/python3.8/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
    main_content = runpy.run_path(main_path,
  File "/home/amax/anaconda3/envs/pyg18/lib/python3.8/runpy.py", line 264, in run_path
    code, fname = _get_code_from_file(run_name, path_name)
  File "/home/amax/anaconda3/envs/pyg18/lib/python3.8/runpy.py", line 234, in _get_code_from_file
    with io.open_code(decoded_path) as f:
FileNotFoundError: [Errno 2] No such file or directory: '/data/dbw/projects/lightning-hydra-template-main/logs/runs/2021-11-04/10-14-23/run.py'
ashleve commented 3 years ago

Hi, I suspect this might have been caused by this week's release of lightning v1.5? I'm preparing an update for the template so perhaps it will be resolved soon

m-bain commented 3 years ago

likewise, im getting the same error

smartdolphin commented 2 years ago

This is a hydra+DDP issue. If dir path of mode.default.yaml is modified to current path, it seems temporarily runable.

ashleve commented 2 years ago

Yes, DDP requires the working directory to be the same with each run which is not compatible with the way hydra manipulates it. However, lightning implements some workaround and it has been working correctly before. Are you running lightning v1.5? Perhaps that workaround has broken in the recent release. I will investigate it later today

ashleve commented 2 years ago

@bwdeng20 @m-bain @smartdolphin Hi! Do you still experience the issue? I have failed to reproduce it.

The following line:

python run.py trainer.gpus=[0,1]

is incorrect with template default settings - you should also specify the ddp accelerator:

python run.py trainer.gpus=[0,1] +trainer.accelerator=ddp

With accelerator specified I don't experience the FileNotFoundError .

Please update to the newest template version and let me know if the problem still exists and which pytorch version you're using.

shim94kr commented 2 years ago

Is ddp the only option available?? When I use dp option with following command the error above is bypassed, but another error is raised.

python run.py trainer.gpus=[0,1] +trainer.strategy=dp

The problem was on torchmetrics, but the repo said multi-gpus are supported. 'RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!'

Can you check this issue? Thank you !

ashleve commented 2 years ago

@shim94kr There are many strategies available but I have not tested them. image

Take a look at torchmetrics docs for DP: https://torchmetrics.readthedocs.io/en/latest/pages/overview.html#metrics-in-dataparallel-dp-mode

And lightning docs for DP: https://pytorch-lightning.readthedocs.io/en/latest/advanced/multi_gpu.html#data-parallel

Generally, DP use is discouraged by PyTorch and Lightning. Is there a reason you want to use DP instead of DDP?

ashleve commented 2 years ago

I recommend everyone to download the current template from main branch, set up new conda environment, install requirements and see if the problem with DDP still occurs.

shim94kr commented 2 years ago

Thank you for providing the references!

I'm using DP in my project since it was only compatible with DP mode. I newly noticed that the DDP is the standard to PyTorch and Lightning. Thank You!

shim94kr commented 2 years ago

And I checked DDP works in the current template!

smartdolphin commented 2 years ago

I checked DDP in latest template. It works! thank you!