clusterinthecloud / terraform

Terraform config for Cluster in the Cloud
https://cluster-in-the-cloud.readthedocs.io
MIT License
20 stars 23 forks source link

Slurm not working after stop and start of mgmt node #21

Closed joeiznogood closed 5 years ago

joeiznogood commented 5 years ago

This is probably my own fault, but it used to work with the old CitC setup I had. After finishing running some scaling studies yesterday, I stopped (not terminated) the management node in the OCI dashboard to save credits. Today I started it again, it boots fine and I logged in. However, when I try to submit jobs, it tells me SLURM is not running. So then I started the SLURM daemon:

[opc@mgmt ~]$ sudo slurmctld

Then when I submitted my job, it is listed as configuring, but in the OCI dashboard there are no compute instances being provisioned.

Is there a way to restart things now, so I don't have to destroy and rebuild the cluster setup? And should I preferably not do this in the future? Like I wrote to start, this used to work fine using the old setup (where compute nodes where just stopped rather than terminated).

christopheredsall commented 5 years ago

Thanks for reporting this.

We have been able to reproduce the symptoms. After stopping and starting the a management node in the Oracle console we see

[opc@mgmt ~]$ sudo systemctl status slurmctld
● slurmctld.service - Slurm controller daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; disabled; vendor preset: disabled)
   Active: inactive (dead)
[opc@mgmt ~]$ sbatch test.slm 
sbatch: error: Batch job submission failed: Unable to contact slurm controller (connect failure)

You can fix without destroying and recreating your cluster by killing the slurmctld that was started with sudo and starting it via systemctl which will ensure the environment is correct. E.g.

[opc@mgmt ~]$ ps aux | grep slurmctld
slurm    16080  0.0  0.1 680276  9288 ?        Sl   10:06   0:00 slurmctld
opc      56723  0.0  0.0 112720  2328 pts/0    S+   10:12   0:00 grep --color=auto slurmctld
[opc@mgmt ~]$ sudo kill 16080
[opc@mgmt ~]$ sudo systemctl start slurmctld

(substituting the particular PID of slurmctld in your instance )