Closed mkuchnik closed 6 years ago
Thanks for the report; this looks like a bug in our integration with SLURM (or possibly a bug in SLURM, still digging). The SQS messages from the compute nodes starting all show up about 2 minutes after scale-up, which is what you would expect for an instance to boot. The process that sits on the SQS queue for new instance events then takes 5 minutes to process each message. It looks like it's running slow because restarting slurm on the compute node is timing out. If I log into the compute node on my cluster and run /etc/init.d/slurm restart
, it takes multiple minutes to fail. When we fix that, the scale-up times should fix themselves (and get back to the normal slow ramp up / down problems, which usually at least don't cycle, but just go slower than we might like).
Example failure:
root@ip-172-31-60-170:/etc/init.d# time /etc/init.d/slurm restart
[....] Restarting slurm (via systemctl): slurm.serviceJob for slurm.service failed because a timeout was exceeded. See "systemctl status slurm.service" and "journalctl -xe" for details.
failed!
real 5m0.162s
user 0m0.000s
sys 0m0.004s
It looks like all systemd platforms are impacted (meaning CentOS6 and alinux are not). The problem is how the sysv autoconversion works in systemd. It's pulling the last pidfile
variable it sees in /etc/init.d/slurm
, which is the slurmctld.pid
file, which works fine on the master node as it only runs slurmctld
but doesn't work on the compute nodes, which only run slurmd
and create a slurmd.pid
file. It looks like the best solution is to stop using the sysv compatibility mode, but that means updating the sqswatcher plugin to run the right command when restarting a compute node's slurm daemon. I think that's ok, because we require compute and master to run the same OS.
@mkuchnik, I can't find any rational reason that we're restarting the compute node slurmd when a node gets added. While we're finding a couple fixes for systemd issues as part of 1.3.3, a quick fix is to edit the sqswatcher slurm plugin on the master node (/usr/local/lib/python2.7/dist-packages/sqswatcher/plugins/slurm.py
) and remove lines 59 - 62. That will remove the 5 minutes timeout ssh that's causing the slow cluster start. You'll then want to run killall sqswatcher
to load the new version.
@bwbarrett For my clarification, are below the steps to fix slow cluster start problem if I have to use slurm?
/usr/local/lib/python2.7/dist-packages/sqswatcher/plugins/slurm.py
Sorry, I wasn't clear. There are three options:
Any of the three will solve the problem (or, since the second isn't an option, either 1 or 3 in your case).
The root issue is that we install SLURM's SysV init script always and systemd's SysV compatibility layer gets the SLURM init script wrong. So it can't restart processes properly, and times out in 5 minutes instead. This will happen on any systemd distro (ubuntu1604, CentOS7), but not on SysV or upstart distros (Centos6, alinux). We'll fix that in CfnCluster 1.3.3, but I can't give a timetable on release yet.
Thanks @bwbarrett. In my case, using centos or alinux is also not an option. I have my AMIs created for ubuntu. I will try 3.
If I create an ami with this change, will that work automatically without restarting sqswatcher?
Basic question: How do I restart sqswatcher?
Santi
If you make the change in the ami, then you should be good to go (I'm pretty sure, anyway). Easiest way to restart sqswatcher is just "killall sqswatcher" as root; there's a nanny that will restart it when it is killed.
@bwbarrett I tried commenting lines 59-62 and restarting sqswatcher. Nodes are added right away. But the ones that are added after the restart are in an unknown state. Is this expected?
Lines commented:
Sinfo output:
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST compute up infinite 3 unk ip-172-xx,ip-172-xx,ip-172-xx compute* up infinite 1 idle ip-172-xx
Odd; that wasn't happening to me (they entered "up" state and I could. It's possible that there's a race I'm just lucking out with which is why the restart was there. In which case, there aren't a lot of good solutions. We need to remove the SysV init script and install systemd init scripts instead, but I think you'll end up fighting with Chef recipes if you do that by hand.
I don't want to fight with Chef recipes :). I am building a centos6 AMI. I will update the thread with my findings.
@bwbarrett Building a custom cfncluster centos6 AMI using instructions here worked for me. Addition of instances to slurm was almost immediate.
If your AMI depends on recent versions of gcc/g++ be prepared for several hours of installation and > 20GB EBS storage.
@bwbarrett I tried centos6 but there are too many libraries that are not easy to install on centos6 compared to ubuntu1604. I would like to use ubuntu1604 instead. Is there a timeline on CfnCluster 1.3.3, where SLURM's SysV init script issue is resolved?
@adavanisanti, unfortunately, we don't have a timeline for when we'll have a fix. I'm hoping to have a fix for this issue in git this week, but there are other issues we need to address before releasing the next version of CfnCluster.
@bwbarrett Just wondering if the slurm issue is resolved. Is there anything I need to do other than upgrading my cfncluster on my notebook to latest git version?
Thanks Santi
@adavanisanti This fix will be available in the next release of cfncluster. Unfortunately there is no timeline of when that will be released, but we will keep you informed when that is.
Thanks! Jordan
Thanks @jocherry
@bwbarrett @adavanisanti
You can either change line 59 of /usr/local/lib/python2.7/dist-packages/sqswatcher/plugins/slurm.py on the master node
to read: command = 'sudo sh -c "cd /etc/init.d; ./slurm restart 2>&1 > /tmp/slurmdstart.log"'
don't comment out any lines.
OR
on each compute node, change line 20 of /etc/init.d/functions to: _use_systemctl=0
with either of these, all the compute nodes come up straight away and are in a state of idle and jobs can be submitted.
The first option is safer, the second option will undoubtedly cause problems, plus, I've tested that you can make the change in the first option in the AMI and it works every time
This is because there is a case statement in /etc/init.d/functions which is checking for the existence of "/etc/init.d/*" in the command being run, i.e. $0, if that string is found then it will use systemctl, otherwise it will use slurmd
obviously this is just a work around.
@brundle56uk Thanks for the workaround. I will try this out. Which OS version(s) did you test this on?
@adavanisanti I tested this on Centos 7, specifically, I built a cluster on the following custom AMI:
ami-9cb9aef8
this is the Centos 7 AMI for eu-west-2, full list here:
https://github.com/awslabs/cfncluster-cookbook/pull/52 and https://github.com/awslabs/cfncluster-node/pull/16 Should take care of this issue.
Closing
Launching many nodes (e.g. 15 C4.8xlarge instances) results in nodes taking very long to register with Slurm, and ultimately results in downscaling. Inspecting "sinfo", about 1 node is registered with Slurm every 5-10 minutes. Further, the logic for scaling down the node count checks if nodes were running jobs in the past hour (outlined here: http://cfncluster.readthedocs.io/en/latest/processes.html#sqswatcher). High node count jobs cannot run since there aren't enough nodes in a ready state, therefore the cluster will scale down even though jobs are pending.
There may be a problem with how nodes are registered to Slurm, as registration takes too long. The autoscaling logic needs to be revised since the policies for scaling up and scaling down can result in cyclic scaling behavior. Nodes may not register in time with the scheduler (not necessarily Slurm), resulting in a scale down event. Since jobs cannot run and are therefore pending, scaling up events will follow.
One possible fix for the downscaling behavior would be to only allow downscaling if no jobs are pending. Downscaling can also happen if jobs are pending, but a job can never run due to requiring more nodes than allowed by autoscaling (e.g. the only job in the queue needs 10 nodes, but the max node count is 5).
!jobs_pending || (max_allowed_node_count < min_required_nodes(jobs))