Lightning-AI / pytorch-lightning

Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.36k stars 3.38k forks source link

Stuck at loading the trainer module #18836

Closed alalith3298 closed 11 months ago

alalith3298 commented 1 year ago

Bug description

i am trying to load the trainer, but the code gets stuck at trainer = Trainer(accelerator='gpu', devices=1). i am running this on a server with 4 NVIDIA RTX A6000s.

What version are you seeing the problem on?

master

How to reproduce the bug

from pytorch_lightning import Trainer

trainer = Trainer(accelerator='cuda', devices=1)

### Error messages and logs

Error messages and logs here please


No error but the the code doesn't stop.

### Environment

<details>
  <summary>Current environment</summary>

* CUDA:
        - GPU:
                - NVIDIA RTX A6000
                - NVIDIA RTX A6000
                - NVIDIA RTX A6000
                - NVIDIA RTX A6000
        - available:         True
        - version:           11.6
* Lightning:
        - efficientnet-pytorch: 0.7.1
        - lightning-utilities: 0.9.0
        - pytorch-lightning: 2.1.0
        - pytorchvideo:      0.1.5
        - torch:             1.12.1
        - torch-tb-profiler: 0.4.1
        - torch-xla:         1.0
        - torchmetrics:      1.2.0
        - torchvision:       0.13.1
* Packages:
        - absl-py:           2.0.0
        - aiohttp:           3.8.6
        - aiosignal:         1.3.1
        - albumentations:    1.3.1
        - alembic:           1.12.0
        - annotated-types:   0.6.0
        - appdirs:           1.4.3
        - apt-clone:         0.2.1
        - apturl:            0.5.2
        - astunparse:        1.6.2
        - async-timeout:     4.0.3
        - atomicwrites:      1.1.5
        - attrs:             19.3.0
        - av:                10.0.0
        - backcall:          0.1.0
        - bcrypt:            3.1.7
        - beautifulsoup4:    4.8.2
        - bleach:            1.5.0
        - blinker:           1.4
        - blis:              0.7.11
        - blosc:             1.7.0
        - brlapi:            0.7.0
        - cachetools:        4.0.0
        - caffe:             1.0.0
        - catalogue:         2.0.10
        - certifi:           2019.11.28
        - cffi:              1.14.0
        - chardet:           3.0.4
        - charset-normalizer: 3.3.0
        - chrome-gnome-shell: 0.0.0
        - click:             8.1.7
        - cloudpathlib:      0.15.1
        - cloudpickle:       1.3.0
        - cmaes:             0.10.0
        - colorama:          0.4.3
        - colorlog:          6.7.0
        - comm:              0.1.4
        - command-not-found: 0.3
        - confection:        0.1.3
        - cryptography:      2.8
        - cupshelpers:       1.0
        - cycler:            0.10.0
        - cymem:             2.0.8
        - cython:            0.29.14
        - dask:              2.8.1+dfsg
        - dbus-python:       1.2.16
        - decorator:         4.4.2
        - defer:             1.0.6
        - defusedxml:        0.6.0
        - distlib:           0.3.0
        - distro:            1.4.0
        - distro-info:       0.23ubuntu1
        - docker-pycreds:    0.4.0
        - duplicity:         0.8.12.0
        - efficientnet-pytorch: 0.7.1
        - entrypoints:       0.3
        - et-xmlfile:        1.0.1
        - fastai:            2.7.12
        - fastcore:          1.5.29
        - fastdownload:      0.0.7
        - fasteners:         0.14.1
        - fastprogress:      1.0.3
        - filelock:          3.0.12
        - flake8:            3.7.9
        - flatbuffers:       23.5.26
        - frozenlist:        1.4.0
        - fsspec:            2023.9.2
        - future:            0.18.2
        - fvcore:            0.1.5.post20221221
        - gast:              0.4.0
        - gitdb:             4.0.11
        - gitpython:         3.1.40
        - google-auth:       2.23.0
        - google-auth-oauthlib: 1.0.0
        - google-pasta:      0.2.0
        - greenlet:          3.0.0
        - grpcio:            1.58.0
        - h5py:              3.9.0
        - html5lib:          0.9999999
        - httplib2:          0.14.0
        - huggingface-hub:   0.18.0
        - idna:              2.8
        - imageio:           2.31.5
        - imageio-ffmpeg:    0.4.9
        - importlib-metadata: 1.5.0
        - importlib-resources: 6.1.0
        - iopath:            0.1.10
        - ipykernel:         5.2.0
        - ipython:           7.13.0
        - ipython-genutils:  0.2.0
        - ipywidgets:        8.1.1
        - jdcal:             1.0
        - jedi:              0.15.2
        - jinja2:            2.10.1
        - joblib:            0.14.0
        - jsonschema:        3.2.0
        - jupyter-client:    6.1.2
        - jupyter-console:   6.0.0
        - jupyter-core:      4.6.3
        - jupyterlab-widgets: 3.0.9
        - keras:             2.13.1
        - keras-preprocessing: 1.1.2
        - keyring:           18.0.1
        - kiwisolver:        1.0.1
        - langcodes:         3.3.0
        - language-selector: 0.1
        - launchpadlib:      1.10.13
        - lazr.restfulclient: 0.14.2
        - lazr.uri:          1.0.3
        - libclang:          16.0.6
        - lightning-utilities: 0.9.0
        - llvmlite:          0.41.1
        - locket:            0.2.0
        - lockfile:          0.12.2
        - louis:             3.12.0
        - lxml:              4.5.0
        - macaroonbakery:    1.3.1
        - mako:              1.1.0
        - markdown:          3.1.1
        - markupsafe:        2.0.1
        - matplotlib:        3.1.2
        - mccabe:            0.6.1
        - mistune:           0.8.4
        - monotonic:         1.5
        - more-itertools:    4.2.0
        - moviepy:           1.0.3
        - mpi4py:            3.0.3
        - multidict:         6.0.4
        - munch:             4.0.0
        - murmurhash:        1.0.10
        - nbconvert:         5.6.1
        - nbformat:          5.0.4
        - netifaces:         0.10.4
        - networkx:          2.4
        - nose:              1.3.7
        - notebook:          6.0.3
        - numba:             0.58.1
        - numexpr:           2.7.1
        - numpy:             1.23.5
        - nvidia-ml-py3:     7.352.0
        - oauthlib:          3.1.0
        - olefile:           0.46
        - opencv-python-headless: 4.8.1.78
        - openpyxl:          3.1.2
        - opt-einsum:        3.3.0
        - optuna:            3.3.0
        - packaging:         23.2
        - pam:               0.4.2
        - pandas:            2.0.3
        - pandocfilters:     1.4.2
        - parameterized:     0.7.0
        - paramiko:          2.6.0
        - parso:             0.5.2
        - partd:             1.0.0
        - pathtools:         0.1.2
        - pathy:             0.10.2
        - pexpect:           4.6.0
        - pickleshare:       0.7.5
        - pillow:            9.5.0
        - pip:               20.0.2
        - pluggy:            0.13.0
        - ply:               3.11
        - portalocker:       2.8.2
        - preshed:           3.0.9
        - pretrainedmodels:  0.7.4
        - proglog:           0.1.10
        - prometheus-client: 0.7.1
        - prompt-toolkit:    2.0.10
        - protobuf:          4.24.3
        - psutil:            5.5.1
        - py:                1.8.1
        - pyasn1:            0.4.2
        - pyasn1-modules:    0.2.1
        - pycairo:           1.16.2
        - pycodestyle:       2.5.0
        - pycparser:         2.19
        - pycuda:            2019.1.2
        - pycups:            1.9.73
        - pydantic:          2.4.2
        - pydantic-core:     2.10.1
        - pydicom:           2.4.3
        - pydot:             1.4.1
        - pyflakes:          2.1.1
        - pygments:          2.3.1
        - pygobject:         3.36.0
        - pygpu:             0.7.6
        - pyicu:             2.4.2
        - pyinotify:         0.9.6
        - pyjwt:             1.7.1
        - pymacaroons:       0.13.0
        - pynacl:            1.3.0
        - pyopenssl:         19.0.0
        - pyparsing:         2.4.6
        - pyrfc3339:         1.1
        - pyrsistent:        0.15.5
        - pytest:            4.6.9
        - python-apt:        2.0.0+ubuntu0.20.4.8
        - python-dateutil:   2.8.2
        - python-debian:     0.1.36ubuntu1
        - pytools:           2019.1.1
        - pytorch-lightning: 2.1.0
        - pytorchvideo:      0.1.5
        - pytz:              2023.3.post1
        - pywavelets:        0.5.1
        - pyxdg:             0.26
        - pyyaml:            6.0.1
        - pyzmq:             18.1.1
        - qudida:            0.0.4
        - reportlab:         4.0.4
        - requests:          2.31.0
        - requests-oauthlib: 1.0.0
        - requests-unixsocket: 0.2.0
        - roc-utils:         0.2.2
        - rsa:               4.0
        - safetensors:       0.4.0
        - scikit-cuda:       0.5.3
        - scikit-image:      0.16.2
        - scikit-learn:      0.22.2.post1
        - scipy:             1.3.3
        - screen-resolution-extra: 0.0.0
        - seaborn:           0.12.2
        - secretstorage:     2.3.1
        - send2trash:        1.5.0
        - sentry-sdk:        1.32.0
        - setproctitle:      1.3.3
        - setuptools:        45.2.0
        - shap:              0.43.0
        - simpleitk:         2.3.0
        - simplejson:        3.16.0
        - six:               1.14.0
        - slicer:            0.0.7
        - smart-open:        6.4.0
        - smmap:             5.0.1
        - soupsieve:         1.9.5
        - spacy:             3.7.1
        - spacy-legacy:      3.0.12
        - spacy-loggers:     1.0.5
        - sqlalchemy:        2.0.21
        - srsly:             2.4.8
        - ssh-import-id:     5.10
        - systemd-python:    234
        - tables:            3.6.1
        - tabulate:          0.9.0
        - tensorboard:       2.13.0
        - tensorboard-data-server: 0.7.1
        - tensorflow:        2.13.0
        - tensorflow-estimator: 2.13.0
        - tensorflow-gpu:    2.9.1
        - tensorflow-io-gcs-filesystem: 0.34.0
        - tensorflow-tensorboard: 1.5.1
        - termcolor:         1.1.0
        - terminado:         0.8.2
        - testpath:          0.4.4
        - theano:            1.0.4
        - thinc:             8.2.1
        - timm:              0.9.7
        - toolz:             0.9.0
        - torch:             1.12.1
        - torch-tb-profiler: 0.4.1
        - torch-xla:         1.0
        - torchmetrics:      1.2.0
        - torchvision:       0.13.1
        - tornado:           5.1.1
        - tqdm:              4.66.1
        - traitlets:         4.3.3
        - typer:             0.9.0
        - typing-extensions: 4.8.0
        - tzdata:            2023.3
        - ubuntu-advantage-tools: 27.11.3
        - ubuntu-drivers-common: 0.0.0
        - ufw:               0.36
        - unattended-upgrades: 0.1
        - urllib3:           2.0.7
        - usb-creator:       0.3.7
        - virtualenv:        20.0.17
        - wadllib:           1.3.3
        - wandb:             0.15.12
        - wasabi:            1.1.2
        - wcwidth:           0.1.8
        - weasel:            0.3.2
        - webencodings:      0.5.1
        - werkzeug:          2.3.7
        - wheel:             0.34.2
        - widgetsnbextension: 4.0.9
        - wrapt:             1.11.2
        - xgboost:           2.0.0
        - xkit:              0.0.0
        - xlrd:              1.1.0
        - xlwt:              1.3.0
        - yacs:              0.1.8
        - yarl:              1.9.2
        - zipp:              3.17.0
* System:
        - OS:                Linux
        - architecture:
                - 64bit
                - ELF
        - processor:         x86_64
        - python:            3.8.10
        - release:           5.15.0-76-generic
        - version:           #83~20.04.1-Ubuntu SMP Wed Jun 21 20:23:31 UTC 2023

</details>

### More info

_No response_

cc @justusschock @awaelchli
awaelchli commented 1 year ago

@alalith3298 Could you please run a regular pytorch example on the GPU to ensure that your torch install is working: https://github.com/pytorch/examples/tree/main/mnist

There is no reason why you would get stuck in the trainer initialization unless there is some problem with torch I think. Or the other possibility is that you are misinterpreting where the script gets stuck. Make sure to record the output of your program in the bug report above.

awaelchli commented 1 year ago

@alalith3298 Could you please take another look?