Azure / cyclecloud-slurm

Azure CycleCloud project to enable users to create, configure, and use Slurm HPC clusters.
MIT License
58 stars 43 forks source link

Slurm 3.0.1 DefMemPerCpu miscalculated for htc partitions #120

Closed themorey closed 1 year ago

themorey commented 1 year ago

CC = 8.4.0 Slurm Cluster-init = 3.0.1 Slurm version = 22.05.8-1

ISSUE azure.conf uses use miscalculated DefMemPerCpu for HTC partitions ( slurm.hpc = false)

STEPS TO REPRODUCE

  1. Create a Slurm cluster with HTC partition

  2. inspect the azure.conf file:

    root@jm-slurm-multi2-hn:~# cat /sched/azure.conf
    # Creating dynamic nodeset and partition using slurm.dynamic_config=-Z --conf "Feature=dyn"
    Nodeset=dynamicns Feature=dyn
    PartitionName=dynamic Nodes=dynamicns
    # Note: CycleCloud reported a RealMemory of 446273536 but we reduced it by -1 (i.e. max(1gb, -1%)) to account for OS/VM overhead which
    # would result in the nodes being rejected by Slurm if they report a number less than defined here.
    # To pick a different percentage to dampen, set slurm.dampen_memory=X in the nodearray's Configuration where X is percentage (5 = 5%).
    PartitionName=hpc Nodes=jm-slurm-mutli2-hpc-[1-3] Default=YES DefMemPerCPU=18158 MaxTime=INFINITE State=UP
    Nodename=jm-slurm-mutli2-hpc-[1-3] Feature=cloud STATE=CLOUD CPUs=24 ThreadsPerCore=1 RealMemory=435814 Gres=gpu:4
    # Note: CycleCloud reported a RealMemory of 3145728 but we reduced it by -1 (i.e. max(1gb, -1%)) to account for OS/VM overhead which
    # would result in the nodes being rejected by Slurm if they report a number less than defined here.
    # To pick a different percentage to dampen, set slurm.dampen_memory=X in the nodearray's Configuration where X is percentage (5 = 5%).
    PartitionName=htc Nodes=jm-slurm-mutli2-htc-[1-5] Default=NO DefMemPerCPU=3072 MaxTime=INFINITE State=UP
    Nodename=jm-slurm-mutli2-htc-[1-5] Feature=cloud STATE=CLOUD CPUs=2 ThreadsPerCore=1 RealMemory=3072
  3. The HTC partition has CPUS=2 and RealMemory=3072, which is correct. The expected DefMemPerCPU should be 1536 (3072/2) but is configured for 3072.

WORKAROUND manually update azure.conf whenever azslurm scale is run.

ryanhamel commented 1 year ago

Fixed in 3.0.3