This repo contains scripts and ansible playbooks for creating a virtual cluster in an Openstack environment, specifically aimed at the Jetstream2 resource.
The basic structure is to have a single image act as headnode, with compute nodes managed by SLURM via the openstack API. A customized image is created for worker nodes, which contains configuration and software specific to that cluster. The Slurm daemon on the headnode dynamically creates and destroys worker nodes in response to jobs in the queue (refer to the figure below). The current version is based on Rocky Linux 8, using RPMs from the OpenHPC project.
As current installation scripts work for Rocky Linux 8 distribution it is expected to have a virtual machine created from the latest Rocky Linux 8 base image in Jestream 2 before proceeding with the installation
sudo su - rocky
If you'd like to modify your cluster, now is a good time!
node_size=
line
in slurm_resume.sh
must be changed.compute_build_base_img.yml
. The task named "install basic packages"
can be easily extended to install anything available from a yum
repository. If you need to add a repo, you can copy the task
titled "Add OpenHPC 1.3.? Repo". For more detailed configuration,
it may be easiest to build your software in /export on the headnode,
and only install the necessary libraries via the compute_build_base_img
(or ensure that they're available in the shared filesystem).cluster_create_local.sh
as the rocky user. This will take around
30 minutes to fully install the cluster.cluster_destroy_local.sh
. This will decommission the SLURM cluster and any
runnning compute nodes. If you need to delete the headnode as well, pass the -d paramaeter to cluster destroy scriptSlurm will run the suspend/resume scripts in response to
scontrol update nodename=compute-[0-1] state=power_down
or
scontrol update nodename=compute-[0-1] state=power_up
If compute instances got stuck in a bad state, it's often helpful to cycle through the following:
scontrol update nodename=compute-[?] state=down reason=resetting
scontrol update nodename=compute-[?] state=idle
or to re-run the suspend/resume scripts as above (if the instance power state doesn't match the current state as seen by slurm). Instances in a failed state within Openstack may simply be deleted, as they will be built anew by slurm the next time they are needed.