aws-samples / aws-eda-slurm-cluster

AWS Slurm Cluster for EDA Workloads
MIT No Attribution
23 stars 7 forks source link

[FEATURE] Enable exclusive scheduling by default #194

Open cartalla opened 5 months ago

cartalla commented 5 months ago

Is your feature request related to a problem? Please describe.

Currently, users specify core and memory requirements for jobs so that Slurm can pick best compute node instance type for the job. This works by running a job like:

srun -c 1 --mem 1G toolname

BUT, if I really don't want multiple jobs per instance, what do I do? Here's the issue. Let's say a job requires 2 cores and 500G of memory. It lands on a r7 instance due to the memory requirement. Now that instance is available and has idle cores and unused memory. So Slurm schedules a job with 1 core and 2G of memory on it that runs for a long time. What happens is that after the large memory job finishes the low memory job is running on an expensive instance that is severely underutilized. If many jobs can be packed onto the instance utilization may be acceptable, but at some point as the cluster scales down the instance may be left running in an underutilized state.

Some analysis of real workloads seems to indicate that the optimal scheduling algorithm for cost is to not share jobs on compute nodes. So what I effectively want to be able to do is to specify the job requirements for the purpose of instance type selection, but always use the whole node. Using -N 1 doesn't reserve the whole node, but --exclusive does. In fact it overrides -c and allocates all of the cores on the node to the job. So that handles the scheduling issue. The only remaining thing is that the job can only use the requested memory when it technically could use it all. Not a huge issue.

Describe the solution you'd like One suggestion is that you could configure the cluster without memory based scheduling. In this case, slurm will use the memory request to pick a compute node with the required memory, but will not treat memory as a consumable resource. But combined with the --exclusive option this would prevent over-subscription of memory.

The remaining issue then would be what would happen if a large-memory instance is running and a low-memory job is submitted. I think that Slurm would allocated the job exclusively to the running instance instead of powering up a lower memory compute node. I don't know if there is an option to get Slurm to pick the best node regardless of power state.