Azure / cyclecloud-slurm

Azure CycleCloud project to enable users to create, configure, and use Slurm HPC clusters.
MIT License
55 stars 42 forks source link

keep_alive.conf removal error #197

Closed themorey closed 7 months ago

themorey commented 7 months ago

CC: 8.5.0 Project Release: 3.0.5

Receive an error when removing nodes from keep_alive:

[root@jm-slurm-beeond-scheduler ~]# azslurm keep_alive -r --node-list jm-slurm-beeond-hpc-[1-2]
scontrol: error: Parse error in file /etc/slurm/keep_alive.conf line 1: "SuspendExcNodes = "
scontrol: error: "Include" failed in file /etc/slurm/slurm.conf line 52
scontrol: fatal: Unable to process configuration file
Error 'Command '['scontrol', 'reconfig']' returned non-zero exit status 1.': See the rest in the log file

The resulting keep_alive.conf contents:

[root@jm-slurm-beeond-scheduler ~]# cat /etc/slurm/keep_alive.conf
SuspendExcNodes =

WORKAROUND comment out SuspendExcNodes = in the keep_alive.conf file and restart slurmctld

aditigaur4 commented 7 months ago

Hi Jerry, we have marked this for fix. You don't really need to restart slurmctld. For now just echo "" > /sched/<clustername>/keep_alive.conf and run scontrol reconfigure.