NVIDIA / deepops

Tools for building GPU clusters
BSD 3-Clause "New" or "Revised" License
1.24k stars 324 forks source link

Adding a Lua submission script #1309

Closed clemsgrs closed 1 month ago

clemsgrs commented 1 month ago

hi, i'd like to use a Lua job submission script to enforce some QoS related things. However, i cannot get it to work.

Here is a step-by-step description of what i did:

  1. updated config/group_vars/all.yml to add installation of lua dev package
################################################################################
# SOFTWARE                                                                     #
################################################################################
# Extra software to install or remove
# Playbook: software
software_extra_packages:
  - liblua5.2-dev
  1. updated roles/slurm/defaults/main.yml to add the --with-lua flag in slurm_configure:
slurm_configure:  './configure --prefix={{ slurm_install_prefix }} --disable-dependency-tracking --disable-debug --disable-x11 --enable-really-no-cray --enable-salloc-kill-cmd --with-hdf5=no --sysconfdir={{ slurm_config_dir }} --enable-pam --with-pam_dir={{ slurm_pam_lib_dir }} --with-shared-libslurm --without-rpath --with-pmix={{ pmix_install_prefix }} --with-hwloc={{ hwloc_install_prefix }} --with-lua'
  1. update slurm.conf to add the following:
# SUBMISSION FILTERS
JobSubmitPlugins=lua
  1. deploy the cluster through ansible:
ansible-playbook -l slurm-cluster playbooks/slurm-cluster.yml -k -K

Once the cluster deployed, i check /var/log/slurm/slurmctld.log where i can see the following error:

slurmctld version 23.02.2 started on cluster deepops
error: Couldn't find the specified plugin name for job_submit/lua looking at all files
error: cannot find job_submit plugin for job_submit/lua
error: cannot create job_submit context for job_submit/lua
fatal: failed to initialize job_submit plugin

If i check under /usr/local/lib/slurm/, there is no job_submit_lua.so! I'm a bit lost as to what i should do now.

itzsimpl commented 1 month ago

There should be no need to pass --with-lua as it is enabled by default https://github.com/SchedMD/slurm/blob/e9e167304503cad92c57ed78f59ea3cab7b7ddb0/configure#L26665 and Lua support gets built if Slurm finds the Lua package installed. Double check if the package is indeed installed (as per the Slurm configure file versions 5.1-5.4 seem to be supported).

If you did not update the Slurm version you may need to force a recompile of Slurm by setting the variable slurm_force_rebuild: yes, see https://github.com/NVIDIA/deepops/blob/6186cf17726f51dede6d8c277f0184c16d21752c/roles/slurm/defaults/main.yml#L21.

clemsgrs commented 1 month ago

that did the trick, thank you for the fast reply! much appreciated :)