Closed joeiznogood closed 5 years ago
Hi Joe, thanks very much for reporting that. We have identified the issue and are working on the fix - should be ready soon.
Sorry about the delay.
We have fixed the issue and confirmed it works with the latest version (https://github.com/ACRC/oci-cluster-terraform/commit/827d73d5f4ef3ae6d7d6e4f071a6d6f20cb1d7d7)
[ce16990@mgmt ~]$ sbatch test.slm
Submitted batch job 2
[ce16990@mgmt ~]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
2 compute test.slm ce16990 CF 0:05 1 vm-standard2-1-ad2-0001
Wait while node boots...
[ce16990@mgmt ~]$ cat slurm-2.out
0: vm-standard2-1-ad2-0001
This might now work on your currently booted cluster, depending on when you cloned the repo before terrafrom applying it.
The cleanest thing to do would be terrafrom destroy
your current cluster and git pull
the repo again and terraform plan; terraform apply
to build a new one.
I can confirm that it now works - cheers!
I have set up a management node using the newest version. Everything is set up nicely and I have updated my limits.yaml file. However, when I then start submitting jobs, it seems that compute instances are not initialised.
My limits.yaml:
My slurm script:
sinfo gives this (after a few attempts):
And nothing shows up in the OCI dashboard under instances - only the management node.