Closed joeiznogood closed 5 years ago
Thanks for reporting this.
We have been able to reproduce the symptoms. After stopping and starting the a management node in the Oracle console we see
[opc@mgmt ~]$ sudo systemctl status slurmctld
● slurmctld.service - Slurm controller daemon
Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; disabled; vendor preset: disabled)
Active: inactive (dead)
[opc@mgmt ~]$ sbatch test.slm
sbatch: error: Batch job submission failed: Unable to contact slurm controller (connect failure)
You can fix without destroying and recreating your cluster by killing the slurmctld
that was started with sudo
and starting it via systemctl
which will ensure the environment is correct. E.g.
[opc@mgmt ~]$ ps aux | grep slurmctld
slurm 16080 0.0 0.1 680276 9288 ? Sl 10:06 0:00 slurmctld
opc 56723 0.0 0.0 112720 2328 pts/0 S+ 10:12 0:00 grep --color=auto slurmctld
[opc@mgmt ~]$ sudo kill 16080
[opc@mgmt ~]$ sudo systemctl start slurmctld
(substituting the particular PID of slurmctld
in your instance )
This is probably my own fault, but it used to work with the old CitC setup I had. After finishing running some scaling studies yesterday, I stopped (not terminated) the management node in the OCI dashboard to save credits. Today I started it again, it boots fine and I logged in. However, when I try to submit jobs, it tells me SLURM is not running. So then I started the SLURM daemon:
Then when I submitted my job, it is listed as configuring, but in the OCI dashboard there are no compute instances being provisioned.
Is there a way to restart things now, so I don't have to destroy and rebuild the cluster setup? And should I preferably not do this in the future? Like I wrote to start, this used to work fine using the old setup (where compute nodes where just stopped rather than terminated).