Azure / batch-shipyard

Simplify HPC and Batch workloads on Azure
MIT License
277 stars 121 forks source link

'default' should not be allowed as a partition name in slurm.yaml #333

Open pansapiens opened 4 years ago

pansapiens commented 4 years ago

Problem Description

If default is used as a single partition name in slurm.yaml (under elastic_partitions:), the slurmctld controller fails to start. /var/log/slurm/slurmctld.log suggests that the PartitionName in slurm.conf is missing/invalid.

It turns out that default in not a valid partition name, but is used for setting defaults for all partitions (eg https://www.mail-archive.com/slurm-dev@schedmd.com/msg08392.html).

Batch Shipyard Version

3.9.0

Steps to Reproduce

Take a working configuration with a single Batch Pool and single partition, and change the elastic_partition in slurm.yaml to be named default. Provision the cluster and attempt to run an sbatch job (should fail). Login to the controller node (shipyard slurm ssh controller) and determine that slurmctld isn't running, check sudo tail /var/log/slurm/slurmctld.log

Expected Results

Expect shipyard cluster create to fail fast during schema validation if there are partitions named default.

Actual Results

Cluster appears to provision successfully but slurmctld fails to start.

alfpark commented 4 years ago

Thanks for the issue report, this will be fixed in the next release.