aws-samples / aws-eda-slurm-cluster

AWS Slurm Cluster for EDA Workloads
MIT No Attribution
24 stars 7 forks source link

[FEATURE] Support HA configuration #149

Open cartalla opened 10 months ago

cartalla commented 10 months ago

Is your feature request related to a problem? Please describe. Slurm support multiple controllers for HA. Add support for multiple controllers with each in separate AZs.

Describe the solution you'd like Currently the slurm config, binaries, and state save location is stored on the EBS volumes of the controller. This would need to be move to a multi-AZ file system such as EFS or FSxN. EFS may not have the IOPS required by to keep the state save location from becoming a bottleneck. Add a configuration parameter for the head node to allow 1-3 controllers. Create the specified number of controllers and update slurm.conf appropriately.