SLURM not automatically spawning instances

joeiznogood commented 5 years ago

I have set up a management node using the newest version. Everything is set up nicely and I have updated my limits.yaml file. However, when I then start submitting jobs, it seems that compute instances are not initialised.

My limits.yaml:

VM.Standard2.1: 1: 1 2: 1 3: 1 VM.Standard2.2: 1: 2 2: 2 3: 2

My slurm script:

! /bin/bash

SBATCH --job-name=test

SBATCH --nodes=1

SBATCH --ntasks-per-node=2

SBATCH --cpus-per-task=1

SBATCH --time=10:00

srun -l hostname

sinfo gives this (after a few attempts):

PARTITION AVAIL TIMELIMIT NODES STATE NODELIST compute up infinite 1 alloc# vm-standard2-1-ad2-0001 compute up infinite 1 idle# vm-standard2-2-ad3-0002 compute* up infinite 1 down# vm-standard2-2-ad1-0001

And nothing shows up in the OCI dashboard under instances - only the management node.

christopheredsall commented 5 years ago

Hi Joe, thanks very much for reporting that. We have identified the issue and are working on the fix - should be ready soon.

christopheredsall commented 5 years ago

Sorry about the delay.

We have fixed the issue and confirmed it works with the latest version (https://github.com/ACRC/oci-cluster-terraform/commit/827d73d5f4ef3ae6d7d6e4f071a6d6f20cb1d7d7)

[ce16990@mgmt ~]$ sbatch test.slm
Submitted batch job 2
[ce16990@mgmt ~]$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 2   compute test.slm  ce16990 CF       0:05      1 vm-standard2-1-ad2-0001

Wait while node boots...

[ce16990@mgmt ~]$ cat slurm-2.out 
0: vm-standard2-1-ad2-0001

This might now work on your currently booted cluster, depending on when you cloned the repo before terrafrom applying it.

The cleanest thing to do would be terrafrom destroy your current cluster and git pull the repo again and terraform plan; terraform apply to build a new one.

joeiznogood commented 5 years ago

I can confirm that it now works - cheers!

clusterinthecloud / terraform

SLURM not automatically spawning instances #17

! /bin/bash

SBATCH --job-name=test

SBATCH --nodes=1

SBATCH --ntasks-per-node=2

SBATCH --cpus-per-task=1

SBATCH --time=10:00