Azure / batch-shipyard

Simplify HPC and Batch workloads on Azure
MIT License
277 stars 121 forks source link

SLURM cluster with multiple batch pools assigned to a single partition #335

Open pansapiens opened 4 years ago

pansapiens commented 4 years ago

Feature Request Description

I would like to be able to assign multiple batch pools (== VM sizes) to a single SLURM partition. This way, SLURM should be able to do resource management using the --mem and --cpus-per-task flags. Currently attempting submit an sbatch/srun job to using these flags with the unmodified slurm.conf generated by shipyard fails (eg, srun: error: Unable to allocate resources: Requested node configuration is not available).

Currently jobs can only be targeted to a specific batch pool via --partition or --constraint flags, since the NodeName= lines in the generated slurm.conf don't contain resource specifications like CoresPerSocket or RealMemory.


I'd like to be able to use a configuration like this. Two (or more) pre-created batch pools x32core64G and x8core16G (VM types STANDARD_F32s_v2 and STANDARD_F8s_v2) mapping to a single SLURM partition mypartition (trimmed example):

slurm:
  slurm_options:
    elastic_partitions:
      mypartition:
        batch_pools:
          x32core64G:
            compute_node_type: dedicated
            max_compute_nodes: 2
            weight: 4
            reclaim_exclude_num_nodes: 0
          x8core16G:
            compute_node_type: low_priority
            max_compute_nodes: 4
            weight: 3
            reclaim_exclude_num_nodes: 0
        default: true
        max_runtime_limit: 7.00:00:00

I can manually add CoresPerSocket or RealMemory values to slurm.conf on the login and controller node, restart slurmctld and submit jobs using --mem and --cpus-per-task.

However I find that with a default single partition (mypartition) mapping to multiple batch pools, only the final batch pool (x8core16G in this case) ever receives jobs and autoscales. I believe this is because the Table shipyardslurm only holds a single BatchPoolId per partition (the final one defined in slurm.yaml/slurm.conf), so only a single batch pool ever autoscales in this case ?

I'd like to be able to use a configuration like this with a single default partition, multiple batch pools, and have SLURM / shipyard automatically assign jobs to the correct node type / batch pool based on the --cpu-per-task and --mem flags.

Describe Preferred Solution

  1. Shipyard should query VM specifications for each batch pool and add CorePerSocket and RealMemory (or similar) values to each NodeName line in the generated slurm.conf.

  2. Make the autoscaling / powersaving scripts (eg /var/batch-shipyard/slurm.py, the shipyardslurm Table partition to batch pool mappings) work when a partition maps to multiple batch pools. Unsure of exactly the changes required to make this part work.