Azure / cyclecloud-slurm

Azure CycleCloud project to enable users to create, configure, and use Slurm HPC clusters.
MIT License
58 stars 43 forks source link

No autoscaling when using Centos8 base image #49

Open mahalel opened 3 years ago

mahalel commented 3 years ago

Hi, it seems that autoscaling no longer works with Centos8

Tested with :

Cyclecloud Version: 8.1.0-1275 Cyclecloud-Slurm 2.4.2

Results: Centos8 + Slurm - 20.11.0-1 = No Autoscaling Centos7 + Slurm - 20.11.0-1 = Autoscale works

[root@ip-0A781804 slurmctld]# cat slurmctld.log
[2020-12-18T00:15:27.016] debug:  Log file re-opened
[2020-12-18T00:15:27.020] debug:  creating clustername file: /var/spool/slurmd/clustername
[2020-12-18T00:15:27.021] error: Configured MailProg is invalid
[2020-12-18T00:15:27.021] slurmctld version 20.11.0 started on cluster asdasd
[2020-12-18T00:15:27.021] cred/munge: init: Munge credential signature plugin loaded
[2020-12-18T00:15:27.021] debug:  auth/munge: init: Munge authentication plugin loaded
[2020-12-18T00:15:27.021] select/cons_res: common_init: select/cons_res loaded
[2020-12-18T00:15:27.021] select/cons_tres: common_init: select/cons_tres loaded
[2020-12-18T00:15:27.021] select/cray_aries: init: Cray/Aries node selection plugin loaded
[2020-12-18T00:15:27.021] select/linear: init: Linear node selection plugin loaded with argument 20
[2020-12-18T00:15:27.021] preempt/none: init: preempt/none loaded
[2020-12-18T00:15:27.021] debug:  acct_gather_energy/none: init: AcctGatherEnergy NONE plugin loaded
[2020-12-18T00:15:27.021] debug:  acct_gather_Profile/none: init: AcctGatherProfile NONE plugin loaded
[2020-12-18T00:15:27.021] debug:  acct_gather_interconnect/none: init: AcctGatherInterconnect NONE plugin loaded
[2020-12-18T00:15:27.021] debug:  acct_gather_filesystem/none: init: AcctGatherFilesystem NONE plugin loaded
[2020-12-18T00:15:27.022] debug:  jobacct_gather/none: init: Job accounting gather NOT_INVOKED plugin loaded
[2020-12-18T00:15:27.022] ext_sensors/none: init: ExtSensors NONE plugin loaded
[2020-12-18T00:15:27.022] debug:  switch/none: init: switch NONE plugin loaded
[2020-12-18T00:15:27.022] debug:  shutting down backup controllers (my index: 0)
[2020-12-18T00:15:27.022] accounting_storage/none: init: Accounting storage NOT INVOKED plugin loaded
[2020-12-18T00:15:27.022] debug:  create_mmap_buf: Failed to open file `/var/spool/slurmd/assoc_usage`, No such file or directory
[2020-12-18T00:15:27.022] debug:  Reading slurm.conf file: /etc/slurm/slurm.conf
[2020-12-18T00:15:27.023] debug:  NodeNames=hpc-pg0-[1-4] setting Sockets=60 based on CPUs(60)/(CoresPerSocket(1)/ThreadsPerCore(1))
[2020-12-18T00:15:27.023] debug:  NodeNames=htc-[1-5] setting Sockets=60 based on CPUs(60)/(CoresPerSocket(1)/ThreadsPerCore(1))
[2020-12-18T00:15:27.023] debug:  Reading cgroup.conf file /etc/slurm/cgroup.conf
[2020-12-18T00:15:27.023] topology/tree: init: topology tree plugin loaded
[2020-12-18T00:15:27.023] debug:  No DownNodes
[2020-12-18T00:15:27.023] debug:  create_mmap_buf: Failed to open file `/var/spool/slurmd/last_config_lite`, No such file or directory
[2020-12-18T00:15:27.140] debug:  Log file re-opened
[2020-12-18T00:15:27.141] sched: Backfill scheduler plugin loaded
[2020-12-18T00:15:27.141] debug:  topology/tree: _read_topo_file: Reading the topology.conf file
[2020-12-18T00:15:27.141] topology/tree: _validate_switches: TOPOLOGY: warning -- no switch can reach all nodes through its descendants. If this is not intentional, fix the topology.conf file.
[2020-12-18T00:15:27.141] debug:  topology/tree: _log_switches: Switch level:0 name:hpc-Standard_HB60rs-pg0 nodes:hpc-pg0-[1-4] switches:(null)
[2020-12-18T00:15:27.141] debug:  topology/tree: _log_switches: Switch level:0 name:htc nodes:htc-[1-5] switches:(null)
[2020-12-18T00:15:27.141] route/default: init: route default plugin loaded
[2020-12-18T00:15:27.141] debug:  create_mmap_buf: Failed to open file `/var/spool/slurmd/node_state`, No such file or directory
[2020-12-18T00:15:27.141] error: Could not open node state file /var/spool/slurmd/node_state: No such file or directory
[2020-12-18T00:15:27.141] error: NOTE: Trying backup state save file. Information may be lost!
[2020-12-18T00:15:27.141] debug:  create_mmap_buf: Failed to open file `/var/spool/slurmd/node_state.old`, No such file or directory
[2020-12-18T00:15:27.141] No node state file (/var/spool/slurmd/node_state.old) to recover
[2020-12-18T00:15:27.141] debug:  create_mmap_buf: Failed to open file `/var/spool/slurmd/job_state`, No such file or directory
[2020-12-18T00:15:27.141] error: Could not open job state file /var/spool/slurmd/job_state: No such file or directory
[2020-12-18T00:15:27.141] error: NOTE: Trying backup state save file. Jobs may be lost!
[2020-12-18T00:15:27.141] debug:  create_mmap_buf: Failed to open file `/var/spool/slurmd/job_state.old`, No such file or directory
[2020-12-18T00:15:27.142] No job state file (/var/spool/slurmd/job_state.old) to recover
[2020-12-18T00:15:27.142] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 2 partitions
[2020-12-18T00:15:27.142] debug:  gres/gpu: init: loaded
[2020-12-18T00:15:27.142] debug:  gpu/generic: init: init: GPU Generic plugin loaded
[2020-12-18T00:15:27.142] debug:  gres/gpu: init: loaded
[2020-12-18T00:15:27.142] debug:  gres/gpu: init: loaded
[2020-12-18T00:15:27.142] debug:  gres/gpu: init: loaded
[2020-12-18T00:15:27.142] debug:  gres/gpu: init: loaded
[2020-12-18T00:15:27.142] debug:  gres/gpu: init: loaded
[2020-12-18T00:15:27.142] debug:  gres/gpu: init: loaded
[2020-12-18T00:15:27.142] debug:  gres/gpu: init: loaded
[2020-12-18T00:15:27.142] debug:  gres/gpu: init: loaded
[2020-12-18T00:15:27.142] debug:  Updating partition uid access list
[2020-12-18T00:15:27.142] debug:  create_mmap_buf: Failed to open file `/var/spool/slurmd/resv_state`, No such file or directory
[2020-12-18T00:15:27.143] error: Could not open reservation state file /var/spool/slurmd/resv_state: No such file or directory
[2020-12-18T00:15:27.143] error: NOTE: Trying backup state save file. Reservations may be lost
[2020-12-18T00:15:27.143] debug:  create_mmap_buf: Failed to open file `/var/spool/slurmd/resv_state.old`, No such file or directory
[2020-12-18T00:15:27.143] No reservation state file (/var/spool/slurmd/resv_state.old) to recover
[2020-12-18T00:15:27.143] debug:  create_mmap_buf: Failed to open file `/var/spool/slurmd/trigger_state`, No such file or directory
[2020-12-18T00:15:27.143] error: Could not open trigger state file /var/spool/slurmd/trigger_state: No such file or directory
[2020-12-18T00:15:27.143] error: NOTE: Trying backup state save file. Triggers may be lost!
[2020-12-18T00:15:27.143] debug:  create_mmap_buf: Failed to open file `/var/spool/slurmd/trigger_state.old`, No such file or directory
[2020-12-18T00:15:27.143] No trigger state file (/var/spool/slurmd/trigger_state.old) to recover
[2020-12-18T00:15:27.143] read_slurm_conf: backup_controller not specified
[2020-12-18T00:15:27.143] Reinitializing job accounting state
[2020-12-18T00:15:27.143] select/cons_tres: select_p_reconfigure: select/cons_tres: reconfigure
[2020-12-18T00:15:27.143] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 2 partitions
[2020-12-18T00:15:27.143] Running as primary controller
[2020-12-18T00:15:27.143] debug:  No backup controllers, not launching heartbeat.
[2020-12-18T00:15:27.143] debug:  priority/basic: init: Priority BASIC plugin loaded
[2020-12-18T00:15:27.143] No parameter for mcs plugin, default values set
[2020-12-18T00:15:27.143] mcs: MCSParameters = (null). ondemand set.
[2020-12-18T00:15:27.143] debug:  mcs/none: init: mcs none plugin loaded
[2020-12-18T00:15:57.143] debug:  sched/backfill: _attempt_backfill: beginning
[2020-12-18T00:15:57.143] debug:  sched/backfill: _attempt_backfill: no jobs to backfill
[2020-12-18T00:16:27.212] SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2
[2020-12-18T00:16:27.212] debug:  sched: Running job scheduler
[2020-12-18T00:17:27.284] debug:  sched: Running job scheduler
[2020-12-18T00:17:27.285] debug:  shutting down backup controllers (my index: 0)
[2020-12-18T00:18:27.362] debug:  sched: Running job scheduler
[2020-12-18T00:19:27.438] debug:  sched: Running job scheduler
[2020-12-18T00:19:27.438] debug:  shutting down backup controllers (my index: 0)
[2020-12-18T00:20:27.512] debug:  sched: Running job scheduler
[2020-12-18T00:20:27.513] debug:  create_mmap_buf: Failed to open file `/var/spool/slurmd/job_state`, No such file or directory
[2020-12-18T00:20:27.513] error: Could not open job state file /var/spool/slurmd/job_state: No such file or directory
[2020-12-18T00:20:27.513] error: NOTE: Trying backup state save file. Jobs may be lost!
[2020-12-18T00:20:27.513] debug:  create_mmap_buf: Failed to open file `/var/spool/slurmd/job_state.old`, No such file or directory
[2020-12-18T00:20:27.513] No job state file (/var/spool/slurmd/job_state.old) found
[2020-12-18T00:21:27.676] debug:  sched: Running job scheduler
[2020-12-18T00:21:27.676] debug:  shutting down backup controllers (my index: 0)
[2020-12-18T00:22:27.749] debug:  sched: Running job scheduler
[2020-12-18T00:23:27.821] debug:  sched: Running job scheduler
[2020-12-18T00:23:27.821] debug:  shutting down backup controllers (my index: 0)
[2020-12-18T00:24:27.892] debug:  sched: Running job scheduler
[2020-12-18T00:25:27.966] debug:  Updating partition uid access list
[2020-12-18T00:25:27.966] debug:  sched: Running job scheduler
[2020-12-18T00:25:28.067] debug:  shutting down backup controllers (my index: 0)
[2020-12-18T00:26:27.139] debug:  sched: Running job scheduler
[2020-12-18T00:27:27.212] debug:  sched: Running job scheduler
[2020-12-18T00:27:28.214] debug:  shutting down backup controllers (my index: 0)
[2020-12-18T00:28:27.286] debug:  sched: Running job scheduler
[2020-12-18T00:29:27.361] debug:  sched: Running job scheduler
[2020-12-18T00:29:28.362] debug:  shutting down backup controllers (my index: 0)
[2020-12-18T00:30:27.437] debug:  sched: Running job scheduler
[2020-12-18T00:30:55.711] req_switch=-2 network='(null)'
[2020-12-18T00:30:55.711] Setting reqswitch to 1.
[2020-12-18T00:30:55.711] returning.
[2020-12-18T00:30:55.712] sched: _slurm_rpc_allocate_resources JobId=2 NodeList=htc-1 usec=1268
[2020-12-18T00:30:56.261] debug:  sched/backfill: _attempt_backfill: beginning
[2020-12-18T00:30:56.261] debug:  sched/backfill: _attempt_backfill: no jobs to backfill
[2020-12-18T00:30:57.263] error: power_save: program exit status of 1
[2020-12-18T00:31:27.588] debug:  sched: Running job scheduler
[2020-12-18T00:31:28.589] debug:  shutting down backup controllers (my index: 0)
[root@ip-0A781804 slurmctld]# squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 2       htc hostname  andreim CF       1:37      1 htc-1

[root@ip-0A781804 slurmctld]# sinfo -V
slurm 20.11.0

[root@ip-0A781804 slurmctld]# systemctl status slurmctld.service
● slurmctld.service - Slurm controller daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/slurmctld.service.d
           └─override.conf
   Active: active (running) since Fri 2020-12-18 00:15:26 UTC; 17min ago
 Main PID: 3980 (slurmctld)
    Tasks: 8
   Memory: 5.5M
   CGroup: /system.slice/slurmctld.service
           └─3980 /usr/sbin/slurmctld -D

Dec 18 00:15:26 ip-0A781804 systemd[1]: Started Slurm controller daemon.
[root@ip-0A781804 slurm]# cat topology.conf
SwitchName=hpc-Standard_HB60rs-pg0 Nodes=hpc-pg0-[1-4]
SwitchName=htc Nodes=htc-[1-5]

[root@ip-0A781804 slurm]# cat slurm.conf
MpiDefault=none
ProctrackType=proctrack/cgroup
ReturnToService=2
PropagateResourceLimits=ALL
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser="slurm"
StateSaveLocation=/var/spool/slurmd
SwitchType=switch/none
TaskPlugin=task/affinity,task/cgroup
SchedulerType=sched/backfill
SelectType=select/cons_tres
GresTypes=gpu
SelectTypeParameters=CR_Core_Memory
ClusterName="ASDASD"
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=debug
SlurmctldLogFile=/var/log/slurmctld/slurmctld.log
SlurmctldParameters=idle_on_node_suspend
SlurmdDebug=debug
SlurmdLogFile=/var/log/slurmd/slurmd.log
TopologyPlugin=topology/tree
JobSubmitPlugins=job_submit/cyclecloud
PrivateData=cloud
TreeWidth=65533
ResumeTimeout=1800
SuspendTimeout=600
SuspendTime=300
ResumeProgram=/opt/cycle/jetpack/system/bootstrap/slurm/resume_program.sh
ResumeFailProgram=/opt/cycle/jetpack/system/bootstrap/slurm/resume_fail_program.sh
SuspendProgram=/opt/cycle/jetpack/system/bootstrap/slurm/suspend_program.sh
SchedulerParameters=max_switch_wait=24:00:00
AccountingStorageType=accounting_storage/none
Include cyclecloud.conf
SlurmctldHost=ip-0A781804

[root@ip-0A781804 slurm]# cat cyclecloud.conf
# Note: CycleCloud reported a RealMemory of 228884 but we reduced it by 11444 (i.e. max(1gb, 5%)) to account for OS/VM overhead which
# would result in the nodes being rejected by Slurm if they report a number less than defined here.
# To pick a different percentage to dampen, set slurm.dampen_memory=X in the nodearray's Configuration where X is percentage (5 = 5%).
PartitionName=hpc Nodes=hpc-pg0-[1-4] Default=YES DefMemPerCPU=3624 MaxTime=INFINITE State=UP
Nodename=hpc-pg0-[1-4] Feature=cloud STATE=CLOUD CPUs=60 CoresPerSocket=1 RealMemory=217440
# Note: CycleCloud reported a RealMemory of 228884 but we reduced it by 11444 (i.e. max(1gb, 5%)) to account for OS/VM overhead which
# would result in the nodes being rejected by Slurm if they report a number less than defined here.
# To pick a different percentage to dampen, set slurm.dampen_memory=X in the nodearray's Configuration where X is percentage (5 = 5%).
PartitionName=htc Nodes=htc-[1-5] Default=NO DefMemPerCPU=3624 MaxTime=INFINITE State=UP
Nodename=htc-[1-5] Feature=cloud STATE=CLOUD CPUs=60 CoresPerSocket=1 RealMemory=217440