CC = 8.4.0
Slurm Cluster-init = 3.0.1
Slurm version = 22.05.8-1
ISSUEazure.conf uses use miscalculated DefMemPerCpu for HTC partitions ( slurm.hpc = false)
STEPS TO REPRODUCE
Create a Slurm cluster with HTC partition
inspect the azure.conf file:
root@jm-slurm-multi2-hn:~# cat /sched/azure.conf
# Creating dynamic nodeset and partition using slurm.dynamic_config=-Z --conf "Feature=dyn"
Nodeset=dynamicns Feature=dyn
PartitionName=dynamic Nodes=dynamicns
# Note: CycleCloud reported a RealMemory of 446273536 but we reduced it by -1 (i.e. max(1gb, -1%)) to account for OS/VM overhead which
# would result in the nodes being rejected by Slurm if they report a number less than defined here.
# To pick a different percentage to dampen, set slurm.dampen_memory=X in the nodearray's Configuration where X is percentage (5 = 5%).
PartitionName=hpc Nodes=jm-slurm-mutli2-hpc-[1-3] Default=YES DefMemPerCPU=18158 MaxTime=INFINITE State=UP
Nodename=jm-slurm-mutli2-hpc-[1-3] Feature=cloud STATE=CLOUD CPUs=24 ThreadsPerCore=1 RealMemory=435814 Gres=gpu:4
# Note: CycleCloud reported a RealMemory of 3145728 but we reduced it by -1 (i.e. max(1gb, -1%)) to account for OS/VM overhead which
# would result in the nodes being rejected by Slurm if they report a number less than defined here.
# To pick a different percentage to dampen, set slurm.dampen_memory=X in the nodearray's Configuration where X is percentage (5 = 5%).
PartitionName=htc Nodes=jm-slurm-mutli2-htc-[1-5] Default=NO DefMemPerCPU=3072 MaxTime=INFINITE State=UP
Nodename=jm-slurm-mutli2-htc-[1-5] Feature=cloud STATE=CLOUD CPUs=2 ThreadsPerCore=1 RealMemory=3072
The HTC partition has CPUS=2 and RealMemory=3072, which is correct. The expected DefMemPerCPU should be 1536 (3072/2) but is configured for 3072.
WORKAROUND
manually update azure.conf whenever azslurm scale is run.
CC = 8.4.0 Slurm Cluster-init = 3.0.1 Slurm version = 22.05.8-1
ISSUE
azure.conf
uses use miscalculatedDefMemPerCpu
for HTC partitions (slurm.hpc = false
)STEPS TO REPRODUCE
Create a Slurm cluster with HTC partition
inspect the
azure.conf
file:The
HTC
partition hasCPUS=2
andRealMemory=3072
, which is correct. The expectedDefMemPerCPU
should be 1536 (3072/2) but is configured for 3072.WORKAROUND manually update
azure.conf
wheneverazslurm scale
is run.