NVIDIA / deepops

Tools for building GPU clusters
BSD 3-Clause "New" or "Revised" License
1.25k stars 326 forks source link

Deepops Slurm NCCL Fail #1303

Closed andrevianadf closed 7 months ago

andrevianadf commented 10 months ago

Environment:

ansible [core 2.11.12] config file = None configured module search path = ['/nfs/home/fsuser/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules'] ansible python module location = /opt/deepops/env/lib/python3.10/site-packages/ansible ansible collection location = /nfs/home/fsuser/.ansible/collections:/usr/share/ansible/collections executable location = /opt/deepops/env/bin/ansible python version = 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] jinja version = 2.11.3 libyaml = True

Operating System: Ubuntu 22.04 DeepOps version: 23.08

Issue Description:

I've deployed a Slurm cluster using Deepops. I have one master node and 8 GPU nodes (8xH100) with Infiniband. Everything seems to work fine, but the slurm-validation.yml fails. I tested IB, NCCL locally, docker+GPU, srum nvidia-smi works with multiple nodes.

Error Message:

ansible [core 2.11.12] config file = None configured module search path = ['/nfs/home/fsuser/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules'] ansible python module location = /opt/deepops/env/lib/python3.10/site-packages/ansible ansible collection location = /nfs/home/fsuser/.ansible/collections:/usr/share/ansible/collections executable location = /opt/deepops/env/bin/ansible python version = 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] jinja version = 2.11.3 libyaml = True

github-actions[bot] commented 8 months ago

This issue is stale because it has been open for 60 days with no activity. Please update the issue or it will be closed in 7 days.

mkunin-work commented 7 months ago

From your error message it is not possible to see what is the failure. Could you please attach the full log (running with -vvv)?