Open Robinysh opened 2 months ago
Hi @Robinysh
Based on this code: https://github.com/Lightning-AI/pytorch-lightning/blob/3730e980e388c23f7e9d1f535793e8d614633362/src/lightning/fabric/plugins/environments/mpi.py#L71-L78
I interpret when MPI is installed, and the world size is > 1, the Trainer will detect that it is running on an MPI cluster. If you're not launching on an MPI cluster, then I guess this will not work and the hang is understandable. Can I ask what you have mpi4py installed for and did you intend to run on an MPI cluster or not?
Bug description
Trainer freezes on initialization when mpi4py is installed.
I suspect the following issues are encountering the same problem: #18836 #19768
What version are you seeing the problem on?
master
How to reproduce the bug
This freezes
This does not freeze
Error messages and logs
No response
Environment
Current environment
* CUDA: - GPU: - NVIDIA GeForce RTX 3090 - available: True - version: 12.1 * Lightning: - lightning: 2.3.2 - lightning-utilities: 0.11.3.post0 - pytorch-lightning: 2.3.2 - torch: 2.3.1 - torchmetrics: 1.4.0.post0 * Packages: - aiohttp: 3.9.5 - aiosignal: 1.3.1 - attrs: 23.2.0 - filelock: 3.15.4 - frozenlist: 1.4.1 - fsspec: 2024.6.1 - gitdb: 4.0.11 - gitpython: 3.1.40 - globus-cli: 3.23.0 - globus-sdk: 3.34.0 - idna: 3.7 - jinja2: 3.1.4 - jupyter-server-mathjax: 0.2.6 - lightning: 2.3.2 - lightning-utilities: 0.11.3.post0 - markupsafe: 2.1.5 - mpi4py: 3.1.6 - mpmath: 1.3.0 - multidict: 6.0.5 - nbdime: 4.0.1 - networkx: 3.3 - numpy: 2.0.0 - nvidia-cublas-cu12: 12.1.3.1 - nvidia-cuda-cupti-cu12: 12.1.105 - nvidia-cuda-nvrtc-cu12: 12.1.105 - nvidia-cuda-runtime-cu12: 12.1.105 - nvidia-cudnn-cu12: 8.9.2.26 - nvidia-cufft-cu12: 11.0.2.54 - nvidia-curand-cu12: 10.3.2.106 - nvidia-cusolver-cu12: 11.4.5.107 - nvidia-cusparse-cu12: 12.1.0.106 - nvidia-nccl-cu12: 2.20.5 - nvidia-nvjitlink-cu12: 12.5.82 - nvidia-nvtx-cu12: 12.1.105 - packaging: 24.1 - pip: 24.0 - pyopenssl: 23.2.0 - pytorch-lightning: 2.3.2 - pyyaml: 6.0.1 - setuptools: 70.1.1 - smmap: 5.0.1 - sympy: 1.12.1 - torch: 2.3.1 - torchmetrics: 1.4.0.post0 - tqdm: 4.66.4 - triton: 2.3.1 - types-python-dateutil: 2.8.19.20240106 - typing-extensions: 4.12.2 - wheel: 0.43.0 - yarl: 1.9.4 * System: - OS: Linux - architecture: - 64bit - ELF - processor: - python: 3.11.9 - release: 6.8.9-arch1-2 - version: #1 SMP PREEMPT_DYNAMIC Tue, 07 May 2024 21:35:54 +0000Conda environment that freezes:
cc @awaelchli