aws / aws-parallelcluster

AWS ParallelCluster is an AWS supported Open Source cluster management tool to deploy and manage HPC clusters in the AWS cloud.
https://github.com/aws/aws-parallelcluster
Apache License 2.0
818 stars 309 forks source link

Support automatic replacement of head node in case of failures, aka Head Node HA #2442

Open elgalu opened 3 years ago

elgalu commented 3 years ago

How does aws-parallelcluster provide high availability on the head node?

Couldn't find if the master goes down which process will bring it back.

tilne commented 3 years ago

Hi @elgalu. We don't currently offer HA support for the head node. Manual intervention is required if it goes down. I'm marking this as a feature request.

cartalla commented 1 year ago

See #1447. Slurm supports up to 3 controllers and will fail over to the extra controllers if there is a failure. I agree that this is an important feature.