ComputeCanada / puppet-magic_castle

Puppet Environment repo for Magic Castle - https://github.com/ComputeCanada/magic_castle
MIT License
11 stars 20 forks source link

Allow oversubscribe to be configurable #124

Open ccoulombe opened 3 years ago

ccoulombe commented 3 years ago

Currently the default Slurm partition contains OverSubscribe=YES:4 which means that 4 jobs are allocated per core on a single node. Depending on the number of cores in the node, this works fine or Slurm will start pending jobs. (e.g 16 cores means 64 jobs can be allocated on this node)

In some heavier cases, 4 jobs might be too much, while in other lighter cases, it could be higher.

cmd-ntrf commented 3 years ago

The default partition with oversubscription in slurm.conf.epp: https://github.com/ComputeCanada/puppet-magic_castle/blob/a20033dc4366ca716ebe92062211fbbcaa11616b/site/profile/templates/slurm/slurm.conf.epp#L24

However, the number 4 is not specified, so it must be a default value set by Slurm when enabling oversubscription.

From Slurm Partition Configuration - Oversubscribe:

[Oversubscribe=YES] May be followed with a colon and maximum number of jobs in running or suspended state. For example "OverSubscribe=YES:4" enables each node, socket or core to execute up to four jobs at once.

ccoulombe commented 3 years ago

Scontrol shows :

$ scontrol show part
PartitionName=cpubase_bycore_b1
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=YES QoS=N/A
   DefaultTime=01:00:00 DisableRootJobs=YES ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=compute-node[1-2],node[1-2]
   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=YES:4
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=48 TotalNodes=4 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=256 MaxMemPerNode=UNLIMITED