Open ccoulombe opened 3 years ago
The default partition with oversubscription in slurm.conf.epp
:
https://github.com/ComputeCanada/puppet-magic_castle/blob/a20033dc4366ca716ebe92062211fbbcaa11616b/site/profile/templates/slurm/slurm.conf.epp#L24
However, the number 4
is not specified, so it must be a default value set by Slurm when enabling oversubscription.
From Slurm Partition Configuration - Oversubscribe:
[Oversubscribe=YES] May be followed with a colon and maximum number of jobs in running or suspended state. For example "OverSubscribe=YES:4" enables each node, socket or core to execute up to four jobs at once.
Scontrol shows :
$ scontrol show part
PartitionName=cpubase_bycore_b1
AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
AllocNodes=ALL Default=YES QoS=N/A
DefaultTime=01:00:00 DisableRootJobs=YES ExclusiveUser=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
Nodes=compute-node[1-2],node[1-2]
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=YES:4
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=48 TotalNodes=4 SelectTypeParameters=NONE
JobDefaults=(null)
DefMemPerCPU=256 MaxMemPerNode=UNLIMITED
Currently the default Slurm partition contains
OverSubscribe=YES:4
which means that 4 jobs are allocated per core on a single node. Depending on the number of cores in the node, this works fine or Slurm will start pending jobs. (e.g 16 cores means 64 jobs can be allocated on this node)In some heavier cases, 4 jobs might be too much, while in other lighter cases, it could be higher.