If default is used as a single partition name in slurm.yaml (under elastic_partitions:), the slurmctld controller fails to start. /var/log/slurm/slurmctld.log suggests that the PartitionName in slurm.conf is missing/invalid.
Take a working configuration with a single Batch Pool and single partition, and change the elastic_partition in slurm.yaml to be named default.
Provision the cluster and attempt to run an sbatch job (should fail).
Login to the controller node (shipyard slurm ssh controller) and determine that slurmctld isn't running, check sudo tail /var/log/slurm/slurmctld.log
Expected Results
Expect shipyard cluster create to fail fast during schema validation if there are partitions named default.
Actual Results
Cluster appears to provision successfully but slurmctld fails to start.
Problem Description
If
default
is used as a single partition name inslurm.yaml
(underelastic_partitions:
), theslurmctld
controller fails to start./var/log/slurm/slurmctld.log
suggests that thePartitionName
inslurm.conf
is missing/invalid.It turns out that
default
in not a valid partition name, but is used for setting defaults for all partitions (eg https://www.mail-archive.com/slurm-dev@schedmd.com/msg08392.html).Batch Shipyard Version
3.9.0
Steps to Reproduce
Take a working configuration with a single Batch Pool and single partition, and change the
elastic_partition
inslurm.yaml
to be nameddefault
. Provision the cluster and attempt to run ansbatch
job (should fail). Login to the controller node (shipyard slurm ssh controller
) and determine thatslurmctld
isn't running, checksudo tail /var/log/slurm/slurmctld.log
Expected Results
Expect
shipyard cluster create
to fail fast during schema validation if there are partitions nameddefault
.Actual Results
Cluster appears to provision successfully but
slurmctld
fails to start.