Lightning-AI / pytorch-lightning

Pretrain, finetune and deploy AI models on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
27.92k stars 3.34k forks source link

Trainer initialization freezes when mpi4py is installed #20049

Open Robinysh opened 2 months ago

Robinysh commented 2 months ago

Bug description

Trainer freezes on initialization when mpi4py is installed.

I suspect the following issues are encountering the same problem: #18836 #19768

What version are you seeing the problem on?

master

How to reproduce the bug

This freezes

pip install lightning mpi4py
python -c "import lightning; lightning.Trainer(accelerator='cpu')"

This does not freeze

pip install lightning
python -c "import lightning; lightning.Trainer(accelerator='cpu')"

Error messages and logs

No response

Environment

Current environment * CUDA: - GPU: - NVIDIA GeForce RTX 3090 - available: True - version: 12.1 * Lightning: - lightning: 2.3.2 - lightning-utilities: 0.11.3.post0 - pytorch-lightning: 2.3.2 - torch: 2.3.1 - torchmetrics: 1.4.0.post0 * Packages: - aiohttp: 3.9.5 - aiosignal: 1.3.1 - attrs: 23.2.0 - filelock: 3.15.4 - frozenlist: 1.4.1 - fsspec: 2024.6.1 - gitdb: 4.0.11 - gitpython: 3.1.40 - globus-cli: 3.23.0 - globus-sdk: 3.34.0 - idna: 3.7 - jinja2: 3.1.4 - jupyter-server-mathjax: 0.2.6 - lightning: 2.3.2 - lightning-utilities: 0.11.3.post0 - markupsafe: 2.1.5 - mpi4py: 3.1.6 - mpmath: 1.3.0 - multidict: 6.0.5 - nbdime: 4.0.1 - networkx: 3.3 - numpy: 2.0.0 - nvidia-cublas-cu12: 12.1.3.1 - nvidia-cuda-cupti-cu12: 12.1.105 - nvidia-cuda-nvrtc-cu12: 12.1.105 - nvidia-cuda-runtime-cu12: 12.1.105 - nvidia-cudnn-cu12: 8.9.2.26 - nvidia-cufft-cu12: 11.0.2.54 - nvidia-curand-cu12: 10.3.2.106 - nvidia-cusolver-cu12: 11.4.5.107 - nvidia-cusparse-cu12: 12.1.0.106 - nvidia-nccl-cu12: 2.20.5 - nvidia-nvjitlink-cu12: 12.5.82 - nvidia-nvtx-cu12: 12.1.105 - packaging: 24.1 - pip: 24.0 - pyopenssl: 23.2.0 - pytorch-lightning: 2.3.2 - pyyaml: 6.0.1 - setuptools: 70.1.1 - smmap: 5.0.1 - sympy: 1.12.1 - torch: 2.3.1 - torchmetrics: 1.4.0.post0 - tqdm: 4.66.4 - triton: 2.3.1 - types-python-dateutil: 2.8.19.20240106 - typing-extensions: 4.12.2 - wheel: 0.43.0 - yarl: 1.9.4 * System: - OS: Linux - architecture: - 64bit - ELF - processor: - python: 3.11.9 - release: 6.8.9-arch1-2 - version: #1 SMP PREEMPT_DYNAMIC Tue, 07 May 2024 21:35:54 +0000

Conda environment that freezes:

name: debuglightning
channels:
  - conda-forge
  - defaults
dependencies:
  - _libgcc_mutex=0.1=conda_forge
  - _openmp_mutex=4.5=2_gnu
  - bzip2=1.0.8=hd590300_5
  - ca-certificates=2024.7.4=hbcca054_0
  - ld_impl_linux-64=2.40=hf3520f5_7
  - libexpat=2.6.2=h59595ed_0
  - libffi=3.4.2=h7f98852_5
  - libgcc-ng=14.1.0=h77fa898_0
  - libgomp=14.1.0=h77fa898_0
  - libnsl=2.0.1=hd590300_0
  - libsqlite=3.46.0=hde9e2c9_0
  - libuuid=2.38.1=h0b41bf4_0
  - libxcrypt=4.4.36=hd590300_1
  - libzlib=1.3.1=h4ab18f5_1
  - ncurses=6.5=h59595ed_0
  - openssl=3.3.1=h4ab18f5_1
  - pip=24.0=pyhd8ed1ab_0
  - python=3.11.9=hb806964_0_cpython
  - readline=8.2=h8228510_1
  - setuptools=70.1.1=pyhd8ed1ab_0
  - tk=8.6.13=noxft_h4845f30_101
  - tzdata=2024a=h0c530f3_0
  - wheel=0.43.0=pyhd8ed1ab_1
  - xz=5.2.6=h166bdaf_0
  - pip:
      - aiohttp==3.9.5
      - aiosignal==1.3.1
      - attrs==23.2.0
      - filelock==3.15.4
      - frozenlist==1.4.1
      - fsspec==2024.6.1
      - idna==3.7
      - jinja2==3.1.4
      - lightning==2.3.2
      - lightning-utilities==0.11.3.post0
      - markupsafe==2.1.5
      - mpi4py==3.1.6
      - mpmath==1.3.0
      - multidict==6.0.5
      - networkx==3.3
      - numpy==2.0.0
      - nvidia-cublas-cu12==12.1.3.1
      - nvidia-cuda-cupti-cu12==12.1.105
      - nvidia-cuda-nvrtc-cu12==12.1.105
      - nvidia-cuda-runtime-cu12==12.1.105
      - nvidia-cudnn-cu12==8.9.2.26
      - nvidia-cufft-cu12==11.0.2.54
      - nvidia-curand-cu12==10.3.2.106
      - nvidia-cusolver-cu12==11.4.5.107
      - nvidia-cusparse-cu12==12.1.0.106
      - nvidia-nccl-cu12==2.20.5
      - nvidia-nvjitlink-cu12==12.5.82
      - nvidia-nvtx-cu12==12.1.105
      - packaging==24.1
      - pytorch-lightning==2.3.2
      - pyyaml==6.0.1
      - sympy==1.12.1
      - torch==2.3.1
      - torchmetrics==1.4.0.post0
      - tqdm==4.66.4
      - triton==2.3.1
      - typing-extensions==4.12.2
      - yarl==1.9.4
prefix: /home/robinysh/.conda/envs/debuglightning

cc @awaelchli

awaelchli commented 2 months ago

Hi @Robinysh

Based on this code: https://github.com/Lightning-AI/pytorch-lightning/blob/3730e980e388c23f7e9d1f535793e8d614633362/src/lightning/fabric/plugins/environments/mpi.py#L71-L78

I interpret when MPI is installed, and the world size is > 1, the Trainer will detect that it is running on an MPI cluster. If you're not launching on an MPI cluster, then I guess this will not work and the hang is understandable. Can I ask what you have mpi4py installed for and did you intend to run on an MPI cluster or not?