mkuchnik commented 7 years ago

Launching many nodes (e.g. 15 C4.8xlarge instances) results in nodes taking very long to register with Slurm, and ultimately results in downscaling. Inspecting "sinfo", about 1 node is registered with Slurm every 5-10 minutes. Further, the logic for scaling down the node count checks if nodes were running jobs in the past hour (outlined here: http://cfncluster.readthedocs.io/en/latest/processes.html#sqswatcher). High node count jobs cannot run since there aren't enough nodes in a ready state, therefore the cluster will scale down even though jobs are pending.

There may be a problem with how nodes are registered to Slurm, as registration takes too long. The autoscaling logic needs to be revised since the policies for scaling up and scaling down can result in cyclic scaling behavior. Nodes may not register in time with the scheduler (not necessarily Slurm), resulting in a scale down event. Since jobs cannot run and are therefore pending, scaling up events will follow.

One possible fix for the downscaling behavior would be to only allow downscaling if no jobs are pending. Downscaling can also happen if jobs are pending, but a job can never run due to requiring more nodes than allowed by autoscaling (e.g. the only job in the queue needs 10 nodes, but the max node count is 5).

!jobs_pending || (max_allowed_node_count < min_required_nodes(jobs))

bwbarrett commented 7 years ago

Thanks for the report; this looks like a bug in our integration with SLURM (or possibly a bug in SLURM, still digging). The SQS messages from the compute nodes starting all show up about 2 minutes after scale-up, which is what you would expect for an instance to boot. The process that sits on the SQS queue for new instance events then takes 5 minutes to process each message. It looks like it's running slow because restarting slurm on the compute node is timing out. If I log into the compute node on my cluster and run /etc/init.d/slurm restart, it takes multiple minutes to fail. When we fix that, the scale-up times should fix themselves (and get back to the normal slow ramp up / down problems, which usually at least don't cycle, but just go slower than we might like).

bwbarrett commented 7 years ago

Example failure:

root@ip-172-31-60-170:/etc/init.d# time /etc/init.d/slurm restart
[....] Restarting slurm (via systemctl): slurm.serviceJob for slurm.service failed because a timeout was exceeded. See "systemctl status slurm.service" and "journalctl -xe" for details.
 failed!

real    5m0.162s
user    0m0.000s
sys 0m0.004s

bwbarrett commented 7 years ago

It looks like all systemd platforms are impacted (meaning CentOS6 and alinux are not). The problem is how the sysv autoconversion works in systemd. It's pulling the last pidfile variable it sees in /etc/init.d/slurm, which is the slurmctld.pid file, which works fine on the master node as it only runs slurmctld but doesn't work on the compute nodes, which only run slurmd and create a slurmd.pid file. It looks like the best solution is to stop using the sysv compatibility mode, but that means updating the sqswatcher plugin to run the right command when restarting a compute node's slurm daemon. I think that's ok, because we require compute and master to run the same OS.

bwbarrett commented 7 years ago

@mkuchnik, I can't find any rational reason that we're restarting the compute node slurmd when a node gets added. While we're finding a couple fixes for systemd issues as part of 1.3.3, a quick fix is to edit the sqswatcher slurm plugin on the master node (/usr/local/lib/python2.7/dist-packages/sqswatcher/plugins/slurm.py) and remove lines 59 - 62. That will remove the 5 minutes timeout ssh that's causing the slow cluster start. You'll then want to run killall sqswatcher to load the new version.

RML-Admin commented 7 years ago

@bwbarrett For my clarification, are below the steps to fix slow cluster start problem if I have to use slurm?

Use CentOS or alinux
Edit sqswatcher slurm plugin by commenting lines 59-62 of /usr/local/lib/python2.7/dist-packages/sqswatcher/plugins/slurm.py
killall sqswatcher

bwbarrett commented 7 years ago

Sorry, I wasn't clear. There are three options:

Use Centos6 or alinux (note that Centos7 will also break)
Use a batch scheduler other than SLURM
Edit sqswatcher's slurm.py plugin and restart sqswatcher.

Any of the three will solve the problem (or, since the second isn't an option, either 1 or 3 in your case).

The root issue is that we install SLURM's SysV init script always and systemd's SysV compatibility layer gets the SLURM init script wrong. So it can't restart processes properly, and times out in 5 minutes instead. This will happen on any systemd distro (ubuntu1604, CentOS7), but not on SysV or upstart distros (Centos6, alinux). We'll fix that in CfnCluster 1.3.3, but I can't give a timetable on release yet.

RML-Admin commented 7 years ago

Thanks @bwbarrett. In my case, using centos or alinux is also not an option. I have my AMIs created for ubuntu. I will try 3.

If I create an ami with this change, will that work automatically without restarting sqswatcher?

Basic question: How do I restart sqswatcher?

Santi

bwbarrett commented 7 years ago

If you make the change in the ami, then you should be good to go (I'm pretty sure, anyway). Easiest way to restart sqswatcher is just "killall sqswatcher" as root; there's a nanny that will restart it when it is killed.

RML-Admin commented 7 years ago

@bwbarrett I tried commenting lines 59-62 and restarting sqswatcher. Nodes are added right away. But the ones that are added after the restart are in an unknown state. Is this expected?

Lines commented:

command = 'sudo sh -c \"/etc/init.d/slurm restart 2>&1 > /tmp/slurmdstart.log\"'

stdin, stdout, stderr = ssh.exec_command(command)

while not stdout.channel.exit_status_ready():

time.sleep(1)

Sinfo output:

PARTITION AVAIL TIMELIMIT NODES STATE NODELIST compute up infinite 3 unk ip-172-xx,ip-172-xx,ip-172-xx compute* up infinite 1 idle ip-172-xx

bwbarrett commented 7 years ago

Odd; that wasn't happening to me (they entered "up" state and I could. It's possible that there's a race I'm just lucking out with which is why the restart was there. In which case, there aren't a lot of good solutions. We need to remove the SysV init script and install systemd init scripts instead, but I think you'll end up fighting with Chef recipes if you do that by hand.

RML-Admin commented 7 years ago

I don't want to fight with Chef recipes :). I am building a centos6 AMI. I will update the thread with my findings.

RML-Admin commented 7 years ago

@bwbarrett Building a custom cfncluster centos6 AMI using instructions here worked for me. Addition of instances to slurm was almost immediate.

If your AMI depends on recent versions of gcc/g++ be prepared for several hours of installation and > 20GB EBS storage.

RML-Admin commented 7 years ago

@bwbarrett I tried centos6 but there are too many libraries that are not easy to install on centos6 compared to ubuntu1604. I would like to use ubuntu1604 instead. Is there a timeline on CfnCluster 1.3.3, where SLURM's SysV init script issue is resolved?

bwbarrett commented 7 years ago

@adavanisanti, unfortunately, we don't have a timeline for when we'll have a fix. I'm hoping to have a fix for this issue in git this week, but there are other issues we need to address before releasing the next version of CfnCluster.

RML-Admin commented 7 years ago

@bwbarrett Just wondering if the slurm issue is resolved. Is there anything I need to do other than upgrading my cfncluster on my notebook to latest git version?

Thanks Santi

jocherry commented 7 years ago

@adavanisanti This fix will be available in the next release of cfncluster. Unfortunately there is no timeline of when that will be released, but we will keep you informed when that is.

Thanks! Jordan

RML-Admin commented 7 years ago

Thanks @jocherry

brundle56uk commented 7 years ago

@bwbarrett @adavanisanti

You can either change line 59 of /usr/local/lib/python2.7/dist-packages/sqswatcher/plugins/slurm.py on the master node

to read: command = 'sudo sh -c "cd /etc/init.d; ./slurm restart 2>&1 > /tmp/slurmdstart.log"'

don't comment out any lines.

OR

on each compute node, change line 20 of /etc/init.d/functions to: _use_systemctl=0

with either of these, all the compute nodes come up straight away and are in a state of idle and jobs can be submitted.

The first option is safer, the second option will undoubtedly cause problems, plus, I've tested that you can make the change in the first option in the AMI and it works every time

This is because there is a case statement in /etc/init.d/functions which is checking for the existence of "/etc/init.d/*" in the command being run, i.e. $0, if that string is found then it will use systemctl, otherwise it will use slurmd

obviously this is just a work around.

RML-Admin commented 7 years ago

@brundle56uk Thanks for the workaround. I will try this out. Which OS version(s) did you test this on?

brundle56uk commented 7 years ago

@adavanisanti I tested this on Centos 7, specifically, I built a cluster on the following custom AMI:

ami-9cb9aef8

this is the Centos 7 AMI for eu-west-2, full list here:

https://github.com/awslabs/cfncluster/blob/master/amis.txt

jocherry commented 6 years ago

https://github.com/awslabs/cfncluster-cookbook/pull/52 and https://github.com/awslabs/cfncluster-node/pull/16 Should take care of this issue.

Closing

aws / aws-parallelcluster

/etc/init.d/slurm restart fails on compute nodes #209

command = 'sudo sh -c \"/etc/init.d/slurm restart 2>&1 > /tmp/slurmdstart.log\"'

stdin, stdout, stderr = ssh.exec_command(command)

while not stdout.channel.exit_status_ready():

time.sleep(1)