aws / ec2-spot-instances-integrations-roadmap

Other
94 stars 8 forks source link

Improved EC2 Spot Instances support in Slurm Workload Manager #10

Open schmutze opened 4 years ago

schmutze commented 4 years ago

We are working on improved EC2 Spot Instance integration in Slurm Workload Manager to get customers up-and-running quickly with utilizing EC2 Spot Instances best practices when running Slurm Workload Manager on AWS.

We are working on building a new open source compute resource manager for SLURM. This compute resource manager makes it easy to provision the desired number of EC2 Instances to finish jobs cost effectively and at scale. It manages the lifecycle of EC2 Instances and terminates them when they are no longer needed. The solution comes with a configuration file that allows the user to configure the desired capacity, EC2 instance types, and more. You need to start by creating a Slurm cluster and deploying the compute resource manager solution that runs along side Slurm controller.

After initialization, compute resource manager will continuously monitor the job queues for pending jobs and dynamically scale up/down the fleet accordingly. Once the jobs are completed and the compute is no longer needed, it will terminate the instances and continue to idle until more jobs enter the queue. It constantly monitors the state of each EC2 instance and If the instance is no longer in use or if the controller receives an interruption notice, it will automatically drain and remove the EC2 instance from the SLURM cluster.

We'd love to hear about your feedback for this integration!

ssbhat commented 3 years ago

+1

cread commented 3 years ago

I'd be interested in how this would be different to https://github.com/aws/aws-parallelcluster, which I already find a quick and easy way to use slurm with spot

schmutze commented 3 years ago

We are looking at ways to improve Spot support in Parallel Cluster through this integration. May I take this as a +1 for this ask?

jillmon commented 3 years ago

Development for this initiative is still on-going. We hope to update this ticket monthly with our status. Right now we are projected to have something released within the first half of 2021

cartalla commented 3 years ago

Is the SLURM scheduler being leveraged at all for determining, for example, when a node needs to be created and when a node is idle? One issue I have with ParallelCluster is the replication of functionality that is already done by the scheduler. I also want to make sure that security is a high priority. For example, the solution should be able to run in a VPC with no internet access.

dchelupati commented 3 years ago

Is the SLURM scheduler being leveraged at all for determining, for example, when a node needs to be created and when a node is idle? One issue I have with ParallelCluster is the replication of functionality that is already done by the scheduler. I also want to make sure that security is a high priority. For example, the solution should be able to run in a VPC with no internet access.

Hi Allan, the controller agent that we are building will periodically monitor the Slurm partitions to see if nodes need to be added or removed. Slurm scheduler (Slurm CTLD) will then schedule the jobs onto the available nodes.

ad-m commented 2 years ago

Right now we are projected to have something released within the first half of 2021

Is there any update?