Results:
Centos8 + Slurm - 20.11.0-1 = No Autoscaling
Centos7 + Slurm - 20.11.0-1 = Autoscale works
[root@ip-0A781804 slurmctld]# cat slurmctld.log
[2020-12-18T00:15:27.016] debug: Log file re-opened
[2020-12-18T00:15:27.020] debug: creating clustername file: /var/spool/slurmd/clustername
[2020-12-18T00:15:27.021] error: Configured MailProg is invalid
[2020-12-18T00:15:27.021] slurmctld version 20.11.0 started on cluster asdasd
[2020-12-18T00:15:27.021] cred/munge: init: Munge credential signature plugin loaded
[2020-12-18T00:15:27.021] debug: auth/munge: init: Munge authentication plugin loaded
[2020-12-18T00:15:27.021] select/cons_res: common_init: select/cons_res loaded
[2020-12-18T00:15:27.021] select/cons_tres: common_init: select/cons_tres loaded
[2020-12-18T00:15:27.021] select/cray_aries: init: Cray/Aries node selection plugin loaded
[2020-12-18T00:15:27.021] select/linear: init: Linear node selection plugin loaded with argument 20
[2020-12-18T00:15:27.021] preempt/none: init: preempt/none loaded
[2020-12-18T00:15:27.021] debug: acct_gather_energy/none: init: AcctGatherEnergy NONE plugin loaded
[2020-12-18T00:15:27.021] debug: acct_gather_Profile/none: init: AcctGatherProfile NONE plugin loaded
[2020-12-18T00:15:27.021] debug: acct_gather_interconnect/none: init: AcctGatherInterconnect NONE plugin loaded
[2020-12-18T00:15:27.021] debug: acct_gather_filesystem/none: init: AcctGatherFilesystem NONE plugin loaded
[2020-12-18T00:15:27.022] debug: jobacct_gather/none: init: Job accounting gather NOT_INVOKED plugin loaded
[2020-12-18T00:15:27.022] ext_sensors/none: init: ExtSensors NONE plugin loaded
[2020-12-18T00:15:27.022] debug: switch/none: init: switch NONE plugin loaded
[2020-12-18T00:15:27.022] debug: shutting down backup controllers (my index: 0)
[2020-12-18T00:15:27.022] accounting_storage/none: init: Accounting storage NOT INVOKED plugin loaded
[2020-12-18T00:15:27.022] debug: create_mmap_buf: Failed to open file `/var/spool/slurmd/assoc_usage`, No such file or directory
[2020-12-18T00:15:27.022] debug: Reading slurm.conf file: /etc/slurm/slurm.conf
[2020-12-18T00:15:27.023] debug: NodeNames=hpc-pg0-[1-4] setting Sockets=60 based on CPUs(60)/(CoresPerSocket(1)/ThreadsPerCore(1))
[2020-12-18T00:15:27.023] debug: NodeNames=htc-[1-5] setting Sockets=60 based on CPUs(60)/(CoresPerSocket(1)/ThreadsPerCore(1))
[2020-12-18T00:15:27.023] debug: Reading cgroup.conf file /etc/slurm/cgroup.conf
[2020-12-18T00:15:27.023] topology/tree: init: topology tree plugin loaded
[2020-12-18T00:15:27.023] debug: No DownNodes
[2020-12-18T00:15:27.023] debug: create_mmap_buf: Failed to open file `/var/spool/slurmd/last_config_lite`, No such file or directory
[2020-12-18T00:15:27.140] debug: Log file re-opened
[2020-12-18T00:15:27.141] sched: Backfill scheduler plugin loaded
[2020-12-18T00:15:27.141] debug: topology/tree: _read_topo_file: Reading the topology.conf file
[2020-12-18T00:15:27.141] topology/tree: _validate_switches: TOPOLOGY: warning -- no switch can reach all nodes through its descendants. If this is not intentional, fix the topology.conf file.
[2020-12-18T00:15:27.141] debug: topology/tree: _log_switches: Switch level:0 name:hpc-Standard_HB60rs-pg0 nodes:hpc-pg0-[1-4] switches:(null)
[2020-12-18T00:15:27.141] debug: topology/tree: _log_switches: Switch level:0 name:htc nodes:htc-[1-5] switches:(null)
[2020-12-18T00:15:27.141] route/default: init: route default plugin loaded
[2020-12-18T00:15:27.141] debug: create_mmap_buf: Failed to open file `/var/spool/slurmd/node_state`, No such file or directory
[2020-12-18T00:15:27.141] error: Could not open node state file /var/spool/slurmd/node_state: No such file or directory
[2020-12-18T00:15:27.141] error: NOTE: Trying backup state save file. Information may be lost!
[2020-12-18T00:15:27.141] debug: create_mmap_buf: Failed to open file `/var/spool/slurmd/node_state.old`, No such file or directory
[2020-12-18T00:15:27.141] No node state file (/var/spool/slurmd/node_state.old) to recover
[2020-12-18T00:15:27.141] debug: create_mmap_buf: Failed to open file `/var/spool/slurmd/job_state`, No such file or directory
[2020-12-18T00:15:27.141] error: Could not open job state file /var/spool/slurmd/job_state: No such file or directory
[2020-12-18T00:15:27.141] error: NOTE: Trying backup state save file. Jobs may be lost!
[2020-12-18T00:15:27.141] debug: create_mmap_buf: Failed to open file `/var/spool/slurmd/job_state.old`, No such file or directory
[2020-12-18T00:15:27.142] No job state file (/var/spool/slurmd/job_state.old) to recover
[2020-12-18T00:15:27.142] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 2 partitions
[2020-12-18T00:15:27.142] debug: gres/gpu: init: loaded
[2020-12-18T00:15:27.142] debug: gpu/generic: init: init: GPU Generic plugin loaded
[2020-12-18T00:15:27.142] debug: gres/gpu: init: loaded
[2020-12-18T00:15:27.142] debug: gres/gpu: init: loaded
[2020-12-18T00:15:27.142] debug: gres/gpu: init: loaded
[2020-12-18T00:15:27.142] debug: gres/gpu: init: loaded
[2020-12-18T00:15:27.142] debug: gres/gpu: init: loaded
[2020-12-18T00:15:27.142] debug: gres/gpu: init: loaded
[2020-12-18T00:15:27.142] debug: gres/gpu: init: loaded
[2020-12-18T00:15:27.142] debug: gres/gpu: init: loaded
[2020-12-18T00:15:27.142] debug: Updating partition uid access list
[2020-12-18T00:15:27.142] debug: create_mmap_buf: Failed to open file `/var/spool/slurmd/resv_state`, No such file or directory
[2020-12-18T00:15:27.143] error: Could not open reservation state file /var/spool/slurmd/resv_state: No such file or directory
[2020-12-18T00:15:27.143] error: NOTE: Trying backup state save file. Reservations may be lost
[2020-12-18T00:15:27.143] debug: create_mmap_buf: Failed to open file `/var/spool/slurmd/resv_state.old`, No such file or directory
[2020-12-18T00:15:27.143] No reservation state file (/var/spool/slurmd/resv_state.old) to recover
[2020-12-18T00:15:27.143] debug: create_mmap_buf: Failed to open file `/var/spool/slurmd/trigger_state`, No such file or directory
[2020-12-18T00:15:27.143] error: Could not open trigger state file /var/spool/slurmd/trigger_state: No such file or directory
[2020-12-18T00:15:27.143] error: NOTE: Trying backup state save file. Triggers may be lost!
[2020-12-18T00:15:27.143] debug: create_mmap_buf: Failed to open file `/var/spool/slurmd/trigger_state.old`, No such file or directory
[2020-12-18T00:15:27.143] No trigger state file (/var/spool/slurmd/trigger_state.old) to recover
[2020-12-18T00:15:27.143] read_slurm_conf: backup_controller not specified
[2020-12-18T00:15:27.143] Reinitializing job accounting state
[2020-12-18T00:15:27.143] select/cons_tres: select_p_reconfigure: select/cons_tres: reconfigure
[2020-12-18T00:15:27.143] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 2 partitions
[2020-12-18T00:15:27.143] Running as primary controller
[2020-12-18T00:15:27.143] debug: No backup controllers, not launching heartbeat.
[2020-12-18T00:15:27.143] debug: priority/basic: init: Priority BASIC plugin loaded
[2020-12-18T00:15:27.143] No parameter for mcs plugin, default values set
[2020-12-18T00:15:27.143] mcs: MCSParameters = (null). ondemand set.
[2020-12-18T00:15:27.143] debug: mcs/none: init: mcs none plugin loaded
[2020-12-18T00:15:57.143] debug: sched/backfill: _attempt_backfill: beginning
[2020-12-18T00:15:57.143] debug: sched/backfill: _attempt_backfill: no jobs to backfill
[2020-12-18T00:16:27.212] SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2
[2020-12-18T00:16:27.212] debug: sched: Running job scheduler
[2020-12-18T00:17:27.284] debug: sched: Running job scheduler
[2020-12-18T00:17:27.285] debug: shutting down backup controllers (my index: 0)
[2020-12-18T00:18:27.362] debug: sched: Running job scheduler
[2020-12-18T00:19:27.438] debug: sched: Running job scheduler
[2020-12-18T00:19:27.438] debug: shutting down backup controllers (my index: 0)
[2020-12-18T00:20:27.512] debug: sched: Running job scheduler
[2020-12-18T00:20:27.513] debug: create_mmap_buf: Failed to open file `/var/spool/slurmd/job_state`, No such file or directory
[2020-12-18T00:20:27.513] error: Could not open job state file /var/spool/slurmd/job_state: No such file or directory
[2020-12-18T00:20:27.513] error: NOTE: Trying backup state save file. Jobs may be lost!
[2020-12-18T00:20:27.513] debug: create_mmap_buf: Failed to open file `/var/spool/slurmd/job_state.old`, No such file or directory
[2020-12-18T00:20:27.513] No job state file (/var/spool/slurmd/job_state.old) found
[2020-12-18T00:21:27.676] debug: sched: Running job scheduler
[2020-12-18T00:21:27.676] debug: shutting down backup controllers (my index: 0)
[2020-12-18T00:22:27.749] debug: sched: Running job scheduler
[2020-12-18T00:23:27.821] debug: sched: Running job scheduler
[2020-12-18T00:23:27.821] debug: shutting down backup controllers (my index: 0)
[2020-12-18T00:24:27.892] debug: sched: Running job scheduler
[2020-12-18T00:25:27.966] debug: Updating partition uid access list
[2020-12-18T00:25:27.966] debug: sched: Running job scheduler
[2020-12-18T00:25:28.067] debug: shutting down backup controllers (my index: 0)
[2020-12-18T00:26:27.139] debug: sched: Running job scheduler
[2020-12-18T00:27:27.212] debug: sched: Running job scheduler
[2020-12-18T00:27:28.214] debug: shutting down backup controllers (my index: 0)
[2020-12-18T00:28:27.286] debug: sched: Running job scheduler
[2020-12-18T00:29:27.361] debug: sched: Running job scheduler
[2020-12-18T00:29:28.362] debug: shutting down backup controllers (my index: 0)
[2020-12-18T00:30:27.437] debug: sched: Running job scheduler
[2020-12-18T00:30:55.711] req_switch=-2 network='(null)'
[2020-12-18T00:30:55.711] Setting reqswitch to 1.
[2020-12-18T00:30:55.711] returning.
[2020-12-18T00:30:55.712] sched: _slurm_rpc_allocate_resources JobId=2 NodeList=htc-1 usec=1268
[2020-12-18T00:30:56.261] debug: sched/backfill: _attempt_backfill: beginning
[2020-12-18T00:30:56.261] debug: sched/backfill: _attempt_backfill: no jobs to backfill
[2020-12-18T00:30:57.263] error: power_save: program exit status of 1
[2020-12-18T00:31:27.588] debug: sched: Running job scheduler
[2020-12-18T00:31:28.589] debug: shutting down backup controllers (my index: 0)
[root@ip-0A781804 slurmctld]# squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
2 htc hostname andreim CF 1:37 1 htc-1
[root@ip-0A781804 slurmctld]# sinfo -V
slurm 20.11.0
[root@ip-0A781804 slurmctld]# systemctl status slurmctld.service
● slurmctld.service - Slurm controller daemon
Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; enabled; vendor preset: disabled)
Drop-In: /etc/systemd/system/slurmctld.service.d
└─override.conf
Active: active (running) since Fri 2020-12-18 00:15:26 UTC; 17min ago
Main PID: 3980 (slurmctld)
Tasks: 8
Memory: 5.5M
CGroup: /system.slice/slurmctld.service
└─3980 /usr/sbin/slurmctld -D
Dec 18 00:15:26 ip-0A781804 systemd[1]: Started Slurm controller daemon.
[root@ip-0A781804 slurm]# cat topology.conf
SwitchName=hpc-Standard_HB60rs-pg0 Nodes=hpc-pg0-[1-4]
SwitchName=htc Nodes=htc-[1-5]
[root@ip-0A781804 slurm]# cat slurm.conf
MpiDefault=none
ProctrackType=proctrack/cgroup
ReturnToService=2
PropagateResourceLimits=ALL
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser="slurm"
StateSaveLocation=/var/spool/slurmd
SwitchType=switch/none
TaskPlugin=task/affinity,task/cgroup
SchedulerType=sched/backfill
SelectType=select/cons_tres
GresTypes=gpu
SelectTypeParameters=CR_Core_Memory
ClusterName="ASDASD"
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=debug
SlurmctldLogFile=/var/log/slurmctld/slurmctld.log
SlurmctldParameters=idle_on_node_suspend
SlurmdDebug=debug
SlurmdLogFile=/var/log/slurmd/slurmd.log
TopologyPlugin=topology/tree
JobSubmitPlugins=job_submit/cyclecloud
PrivateData=cloud
TreeWidth=65533
ResumeTimeout=1800
SuspendTimeout=600
SuspendTime=300
ResumeProgram=/opt/cycle/jetpack/system/bootstrap/slurm/resume_program.sh
ResumeFailProgram=/opt/cycle/jetpack/system/bootstrap/slurm/resume_fail_program.sh
SuspendProgram=/opt/cycle/jetpack/system/bootstrap/slurm/suspend_program.sh
SchedulerParameters=max_switch_wait=24:00:00
AccountingStorageType=accounting_storage/none
Include cyclecloud.conf
SlurmctldHost=ip-0A781804
[root@ip-0A781804 slurm]# cat cyclecloud.conf
# Note: CycleCloud reported a RealMemory of 228884 but we reduced it by 11444 (i.e. max(1gb, 5%)) to account for OS/VM overhead which
# would result in the nodes being rejected by Slurm if they report a number less than defined here.
# To pick a different percentage to dampen, set slurm.dampen_memory=X in the nodearray's Configuration where X is percentage (5 = 5%).
PartitionName=hpc Nodes=hpc-pg0-[1-4] Default=YES DefMemPerCPU=3624 MaxTime=INFINITE State=UP
Nodename=hpc-pg0-[1-4] Feature=cloud STATE=CLOUD CPUs=60 CoresPerSocket=1 RealMemory=217440
# Note: CycleCloud reported a RealMemory of 228884 but we reduced it by 11444 (i.e. max(1gb, 5%)) to account for OS/VM overhead which
# would result in the nodes being rejected by Slurm if they report a number less than defined here.
# To pick a different percentage to dampen, set slurm.dampen_memory=X in the nodearray's Configuration where X is percentage (5 = 5%).
PartitionName=htc Nodes=htc-[1-5] Default=NO DefMemPerCPU=3624 MaxTime=INFINITE State=UP
Nodename=htc-[1-5] Feature=cloud STATE=CLOUD CPUs=60 CoresPerSocket=1 RealMemory=217440
Hi, it seems that autoscaling no longer works with Centos8
Tested with :
Cyclecloud Version: 8.1.0-1275 Cyclecloud-Slurm 2.4.2
Results: Centos8 + Slurm - 20.11.0-1 = No Autoscaling Centos7 + Slurm - 20.11.0-1 = Autoscale works