Closed andrevianadf closed 7 months ago
This issue is stale because it has been open for 60 days with no activity. Please update the issue or it will be closed in 7 days.
From your error message it is not possible to see what is the failure. Could you please attach the full log (running with -vvv
)?
Environment:
ansible [core 2.11.12] config file = None configured module search path = ['/nfs/home/fsuser/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules'] ansible python module location = /opt/deepops/env/lib/python3.10/site-packages/ansible ansible collection location = /nfs/home/fsuser/.ansible/collections:/usr/share/ansible/collections executable location = /opt/deepops/env/bin/ansible python version = 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] jinja version = 2.11.3 libyaml = True
Operating System: Ubuntu 22.04 DeepOps version: 23.08
Issue Description:
I've deployed a Slurm cluster using Deepops. I have one master node and 8 GPU nodes (8xH100) with Infiniband. Everything seems to work fine, but the slurm-validation.yml fails. I tested IB, NCCL locally, docker+GPU, srum nvidia-smi works with multiple nodes.
Error Message:
ansible [core 2.11.12] config file = None configured module search path = ['/nfs/home/fsuser/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules'] ansible python module location = /opt/deepops/env/lib/python3.10/site-packages/ansible ansible collection location = /nfs/home/fsuser/.ansible/collections:/usr/share/ansible/collections executable location = /opt/deepops/env/bin/ansible python version = 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] jinja version = 2.11.3 libyaml = True