Open OswaldHe opened 2 months ago
Try using "srun python3 train.py". python --> python3
I tried python3, but the issue still remains.
I have the same issue. It works fine when using srun but as a job submitted with sbatch it hangs.
A bottleneck for good especially if you can not do sruns but only sbatch within the environment you work.
Bug description
I'm working on a slurm cluster with 8 AMD MI100 GPUs distributed in 2 nodes, with 4 GPUs in each node. I follow the instructions (https://lightning.ai/docs/pytorch/stable/clouds/cluster_advanced.html) to submit a multi-node training job, but the job stuck at "Initializing distributed:...". I checked all related issues and none of them solve the problem.
What version are you seeing the problem on?
v2.2
How to reproduce the bug
Training Script:
SLURM batch script:
Error messages and logs
Environment
Current environment
* CUDA: - GPU: - AMD Instinct MI100 - AMD Instinct MI100 - AMD Instinct MI100 - AMD Instinct MI100 - available: True - version: None * Lightning: - lightning: 2.2.1 - lightning-utilities: 0.11.2 - pytorch-lightning: 2.2.1 - pytorch-triton-rocm: 2.2.0 - torch: 2.2.0+rocm5.6 - torchaudio: 2.2.0+rocm5.6 - torchmetrics: 1.3.2 - torchvision: 0.17.0+rocm5.6 * Packages: - absl-py: 2.1.0 - aiohttp: 3.9.3 - aiosignal: 1.3.1 - annotated-types: 0.6.0 - async-timeout: 4.0.3 - attrs: 23.2.0 - certifi: 2022.12.7 - charset-normalizer: 2.1.1 - deepspeed: 0.14.0 - filelock: 3.9.0 - frozenlist: 1.4.1 - fsspec: 2023.4.0 - future: 1.0.0 - grpcio: 1.62.1 - hjson: 3.1.0 - idna: 3.4 - imageio: 2.34.0 - jinja2: 3.1.2 - lightning: 2.2.1 - lightning-utilities: 0.11.2 - markdown: 3.6 - markupsafe: 2.1.3 - mpmath: 1.3.0 - multidict: 6.0.5 - networkx: 3.2.1 - ninja: 1.11.1.1 - numpy: 1.26.3 - packaging: 24.0 - pandas: 2.2.1 - pillow: 10.2.0 - pip: 23.3.1 - protobuf: 5.26.1 - psutil: 5.9.8 - py-cpuinfo: 9.0.0 - pydantic: 2.7.0 - pydantic-core: 2.18.1 - pynvml: 11.5.0 - python-dateutil: 2.9.0.post0 - pytorch-lightning: 2.2.1 - pytorch-triton-rocm: 2.2.0 - pytz: 2024.1 - pyyaml: 6.0.1 - requests: 2.28.1 - setuptools: 68.2.2 - six: 1.16.0 - sympy: 1.12 - tensorboard: 2.16.2 - tensorboard-data-server: 0.7.2 - test-tube: 0.7.5 - torch: 2.2.0+rocm5.6 - torchaudio: 2.2.0+rocm5.6 - torchmetrics: 1.3.2 - torchvision: 0.17.0+rocm5.6 - tqdm: 4.66.2 - typing-extensions: 4.8.0 - tzdata: 2024.1 - urllib3: 1.26.13 - werkzeug: 3.0.1 - wheel: 0.41.2 - yarl: 1.9.4 * System: - OS: Linux - architecture: - 64bit - ELF - processor: x86_64 - python: 3.10.14 - release: 5.14.0-162.18.1.el9_1.x86_64 - version: SMP PREEMPT_DYNAMIC Wed Mar 1 22:02:24 UTC 2023More info
No response