NVIDIA / deepops

Tools for building GPU clusters
BSD 3-Clause "New" or "Revised" License
1.26k stars 328 forks source link

roles and tasks are skipped when running -l slurm-cluster playbooks/slurm-cluster.yml #1062

Closed biocyberman closed 2 years ago

biocyberman commented 2 years ago

Hi I am testing DeepOps with one VMware virtual machine as slurm master, and management node, and one DGX1 as slurm compute node.

ansible-playbook -l slurm-cluster playbooks/slurm-cluster.yml

I am facing various issues: slurm build and installation do not happen on both master and compute node.

ansible-playbook --tags build -l slurm-cluster playbooks/slurm-cluster.yml got slurm to be built, but no installation. That is actually OK because of --tags argument. Next I had to run:

ansible-playbook -l slurm-cluster playbooks/slurm-cluster/slurm.yml

This installs slurm but the deployment is still incomplete or failed at some tasks. Tasks got skipped, and directories are not create even though they do not exist. For example in the list below, only /etc/slurm exists.

TASK [create slurm directories] *******************************************************************************************************************************************************************************************
skipping: [testmgmt] => (item=/etc/slurm) 
skipping: [testmgmt] => (item=/var/spool/slurm/ctld) 
skipping: [testmgmt] => (item=/var/log/slurm) 
skipping: [compute1] => (item=/etc/slurm) 
skipping: [compute1] => (item=/var/spool/slurm/ctld) 
skipping: [compute1] => (item=/var/log/slurm) 
.....
TASK [ensure all slurm services are stopped] ******************************************************************************************************************************************************************************
failed: [testmgmt] (item=slurmctld) => changed=false 
  ansible_loop_var: item
  item: slurmctld
  msg: 'Could not find the requested service slurmctld: host'
failed: [compute1] (item=slurmctld) => changed=false 
  ansible_loop_var: item
  item: slurmctld
  msg: 'Could not find the requested service slurmctld: host'
failed: [testmgmt] (item=slurmd) => changed=false 
  ansible_loop_var: item
  item: slurmd
  msg: 'Could not find the requested service slurmd: host'
ok: [compute1] => (item=slurmd)
failed: [testmgmt] (item=slurmdbd) => changed=false 
  ansible_loop_var: item
  item: slurmdbd
  msg: 'Could not find the requested service slurmdbd: host'
...ignoring
failed: [compute1] (item=slurmdbd) => changed=false 
  ansible_loop_var: item
  item: slurmdbd
  msg: 'Could not find the requested service slurmdbd: host'
...ignoring

.....
TASK [slurm : install dependencies] ***************************************************************************************************************************************************************************************
changed: [testmgmt] => (item=['mariadb-server', 'python3-mysqldb', 's-nail', 'ssmtp'])

TASK [slurm : install dependencies] ***************************************************************************************************************************************************************************************
skipping: [testmgmt] => (item=[]) 

TASK [slurm : install dependencies] ***************************************************************************************************************************************************************************************
skipping: [testmgmt] => (item=[]) 

TASK [slurm : Allow mysql to read libaio.so.1] ****************************************************************************************************************************************************************************
skipping: [testmgmt]

TASK [slurm : Apply new SELinux file context to filesystem] ***************************************************************************************************************************************************************
skipping: [testmgmt]

TASK [slurm : start mariadb] **********************************************************************************************************************************************************************************************
skipping: [testmgmt]

TASK [setup slurm db user] ************************************************************************************************************************************************************************************************
[WARNING]: Module did not set no_log for update_password
fatal: [dlitest]: FAILED! => changed=false 
  msg: The PyMySQL (Python 2.7 and Python 3.X) or MySQL-python (Python 2.X) module is required.
ajdecon commented 2 years ago

It's difficult to diagnose what may be happening here without more information. Can you please share the following information?

github-actions[bot] commented 2 years ago

This issue is stale because it has been open for 60 days with no activity. Please update the issue or it will be closed in 7 days.