SGE Job stucks at "qw" state while Computes "Live, Die & Repeat".

ganderaj commented 5 years ago

Environment:

AWS ParallelCluster version [2.4.0]
OS: CentOS 7.6
Scheduler: SGE
Master instance type: c5.2xlarge
Compute instance type: c5.9xlarge

Bug description and how to reproduce: Submit a job to the cluster with N # of cores. Corresponding # of compute would be brought up by the ASG in action. Upon investigating further I have found that SGE fails to register the hosts to the Master Queue. Please refer to the attached logs for further details-

Problem with executing the remote commands on sqswatcher. (please refer to logs attached.)
Some daemon is failing to add the hosts to @allhosts groups with an error "error writing object "all.q/ip-x.x.x.x.yyyyyy.zzz.com" to spooling database" (During first Try)
Error in executing the remote command with a log file attached. (During retry)

NOTE

We have recently experienced a scenario where couple of Compute Nodes were showing up in qhost list. So I had to manually intervene to remove them from the queue by following instructions on this thread
Recently we have attached Scheduled Action entry to the ASG for business reasons and then removed it. Since then we are facing this issue.

execd_install_ip-10-246-40-78_2019-07-31_23.15.10.log jobwatcher.log sqswatcher.log

keien commented 5 years ago

Why is this issue closed? I don't see any resolution in this one, nor the other referenced issue.

We are running into this same issue intermittently, with basically the same setup as above except for pcluster version 2.4.1

In our case, the cluster works fine for a while, then suddenly starts running into the same above issue running qconf -aattr hostgroup hostlist from the master

I did try removing /etc/init.d/sgeexecd.p6444 and rerunning cd /opt/sge && /opt/sge/inst_sge -noremote -x -auto /opt/parallelcluster/templates/sge/sge_inst.conf, which do something:

$ cat /opt/sge/default/common/install_logs/execd_install_ip-172-31-105-56_2019-08-23_14:19:49.log

Your $SGE_ROOT directory: /opt/sge

Using cell: >default<

Using local execd spool directory [/var/spool/sge]

Creating local configuration for host >ip-172-31-105-56.us-west-2.compute.internal<
sgeadmin@ip-172-31-105-56.us-west-2.compute.internal modified "ip-172-31-105-56.us-west-2.compute.internal" in configuration list
Local configuration for host >ip-172-31-105-56.us-west-2.compute.internal< created.

cp /opt/sge/default/common/sgeexecd /etc/init.d/sgeexecd.p6444
/usr/lib/lsb/install_initd /etc/init.d/sgeexecd.p6444

   Starting Grid Engine execution daemon

Execd on host ip-172-31-105-56 is running!

Unfortunately this was not enough, as the master was still unable to add the node

One interesting thing I found was that the all.q directory, which is normally in /opt/sge/default/spool/qmaster/qinstances/ in a healthy cluster, was not present on this bad cluster - /opt/sge/default/spool/qmaster/qinstances/ was just empty

ganderaj commented 5 years ago

@keien

Try running following commands

sudo mkdir /opt/sge/default/spool/qmaster/qinstances/all.q sudo chmod 755 /opt/sge/default/spool/qmaster/qinstances/all.q sudo chown sgeadmin:sgeadmin /opt/sge/default/spool/qmaster/qinstances/all.q

keien commented 5 years ago

@ganderaj that worked

# sudo mkdir /opt/sge/default/spool/qmaster/qinstances/all.q
# sudo chmod 755 /opt/sge/default/spool/qmaster/qinstances/all.q
# sudo chown sgeadmin:sgeadmin /opt/sge/default/spool/qmaster/qinstances/all.q
# qconf -aattr hostgroup hostlist ip-172-31-105-56 @allhosts
root@ip-172-31-110-89.us-west-2.compute.internal modified "@allhosts" in host group list

with that, cluster seems back to normal as well

that being said, would love to know why the directory disappeared in the first place, or more broadly what chain of events caused the cluster to end up in this situation

sean-smith commented 5 years ago

@keien Are you using a custom dns setup?

keien commented 5 years ago

@sean-smith don't know if this is exactly what you mean but this has come up in other issues we've filed - we have DHCP in our VPC so EC2 metadata calls like the following return two values:

$ curl -s http://169.254.169.254/latest/meta-data/hostname
ip-172-31-103-142.us-west-2.compute.internal cerebras.aws

which I understand is currently incompatible with parallelcluster

we have a few fixes in the pre-install script to work around this:

hostnamectl set-hostname `curl -s http://169.254.169.254/latest/meta-data/hostname | cut -f 1 -d' '`

# As of pcluster 2.3.1, we need to do a similar fixup in their compute_ready script which uses the
# above URL to get a hostname
sed -i 's/local_hostname=$(curl --retry 3 --retry-delay 0 --silent --fail ${local_hostname_url})/local_hostname=$(curl --retry 3 --retry-delay 0 --silent --fail ${local_hostname_url} | cut -f 1 -d" ")/' /opt/parallelcluster/scripts/compute_ready

# The addition to /etc/hosts below accomplishes the following
# 1. It reduces reliance on DNS lookups to work (which sge appears to rely on)
# 2. It allows reverse lookups to work as sge expects.
# Note that the subnet here: 172.31.[112-127].XXX is the subnet cfncluster uses when
# clusters are created.
if ! grep -q cfn-hostname-fixup.sh /etc/hosts; then
echo -e "# entries below added with s3://cb-cfncluster/cfn-hostname-fixup.sh" >> /etc/hosts
for i in `seq 96 111`; do
  for j in `seq 0 255`; do
    echo -e "172.31.$i.$j\tip-172-31-$i-$j.us-west-2.compute.internal ip-172-31-$i-$j";
  done;
done >> /etc/hosts;

lukeseawalker commented 5 years ago

Hi all, we believe that the process of running manually commands to remove dead host from SGE configuration is what broke the all.q of the initial post.

To recap, the following is the current behaviour (2.4.1) in the management of forced shutdown of the compute nodes (e.g. Spot)

Node with single node running job/s:

job is killed, marked as failed and eventually rescheduled in case the rerun flag is used (qsub -r y) or the queue is configured with the rerun flag enabled. The compute node is correctly removed from the scheduler queue. This behaviour is dictated by the following SGE configuration parameters:
```
* reschedule_unknown 00:00:30
* ENABLE_FORCED_QDEL_IF_UNKNOWN
* ENABLE_RESCHEDULE_KILL=1
```

Node with a multi-node running job/s:

the job is not terminated and keeps running on the available nodes. The compute node is removed from the scheduler queue but will appear in the hosts list as an orphaned and unavailable node. It is responsibility of the user to delete the job when this occurs (qdel <jobid>). The node will be still displayed in the hosts list (qhost) although this will not affect the scaling of the cluster or any ParallelCluster feature. sqswatcher daemon will retry to remove the host for three times. After that, to remove the host from the list you could execute the following command after having replaced <hostname> accordingly: sudo -- bash -c 'source /etc/profile.d/sge.sh; qconf -dattr hostgroup hostlist <hostname> @allhosts; qconf -de <hostname>'

Since everyone was able to restore the cluster in a working state and the discussion has shifted from its original question I'm going to mark this ticket in autoclose, feel free to open another ticket if you have any other issue.

Thanks Luca

no-response[bot] commented 5 years ago

This issue has been automatically closed because there has been no response to our request for more information from the original author. With only the information that is currently in the issue, we don't have enough information to take action. Please reach out if you have or find the answers we need so that we can investigate further.

aws / aws-parallelcluster

SGE Job stucks at "qw" state while Computes "Live, Die & Repeat". #1247