Closed ganderaj closed 5 years ago
Why is this issue closed? I don't see any resolution in this one, nor the other referenced issue.
We are running into this same issue intermittently, with basically the same setup as above except for pcluster version 2.4.1
In our case, the cluster works fine for a while, then suddenly starts running into the same above issue running qconf -aattr hostgroup hostlist
from the master
I did try removing /etc/init.d/sgeexecd.p6444
and rerunning cd /opt/sge && /opt/sge/inst_sge -noremote -x -auto /opt/parallelcluster/templates/sge/sge_inst.conf
, which do something:
$ cat /opt/sge/default/common/install_logs/execd_install_ip-172-31-105-56_2019-08-23_14:19:49.log
Your $SGE_ROOT directory: /opt/sge
Using cell: >default<
Using local execd spool directory [/var/spool/sge]
Creating local configuration for host >ip-172-31-105-56.us-west-2.compute.internal<
sgeadmin@ip-172-31-105-56.us-west-2.compute.internal modified "ip-172-31-105-56.us-west-2.compute.internal" in configuration list
Local configuration for host >ip-172-31-105-56.us-west-2.compute.internal< created.
cp /opt/sge/default/common/sgeexecd /etc/init.d/sgeexecd.p6444
/usr/lib/lsb/install_initd /etc/init.d/sgeexecd.p6444
Starting Grid Engine execution daemon
Execd on host ip-172-31-105-56 is running!
Unfortunately this was not enough, as the master was still unable to add the node
One interesting thing I found was that the all.q
directory, which is normally in /opt/sge/default/spool/qmaster/qinstances/
in a healthy cluster, was not present on this bad cluster - /opt/sge/default/spool/qmaster/qinstances/
was just empty
@keien
Try running following commands
sudo mkdir /opt/sge/default/spool/qmaster/qinstances/all.q
sudo chmod 755 /opt/sge/default/spool/qmaster/qinstances/all.q
sudo chown sgeadmin:sgeadmin /opt/sge/default/spool/qmaster/qinstances/all.q
@ganderaj that worked
# sudo mkdir /opt/sge/default/spool/qmaster/qinstances/all.q
# sudo chmod 755 /opt/sge/default/spool/qmaster/qinstances/all.q
# sudo chown sgeadmin:sgeadmin /opt/sge/default/spool/qmaster/qinstances/all.q
# qconf -aattr hostgroup hostlist ip-172-31-105-56 @allhosts
root@ip-172-31-110-89.us-west-2.compute.internal modified "@allhosts" in host group list
with that, cluster seems back to normal as well
that being said, would love to know why the directory disappeared in the first place, or more broadly what chain of events caused the cluster to end up in this situation
@keien Are you using a custom dns setup?
@sean-smith don't know if this is exactly what you mean but this has come up in other issues we've filed - we have DHCP in our VPC so EC2 metadata calls like the following return two values:
$ curl -s http://169.254.169.254/latest/meta-data/hostname
ip-172-31-103-142.us-west-2.compute.internal cerebras.aws
which I understand is currently incompatible with parallelcluster
we have a few fixes in the pre-install script to work around this:
hostnamectl set-hostname `curl -s http://169.254.169.254/latest/meta-data/hostname | cut -f 1 -d' '`
# As of pcluster 2.3.1, we need to do a similar fixup in their compute_ready script which uses the
# above URL to get a hostname
sed -i 's/local_hostname=$(curl --retry 3 --retry-delay 0 --silent --fail ${local_hostname_url})/local_hostname=$(curl --retry 3 --retry-delay 0 --silent --fail ${local_hostname_url} | cut -f 1 -d" ")/' /opt/parallelcluster/scripts/compute_ready
# The addition to /etc/hosts below accomplishes the following
# 1. It reduces reliance on DNS lookups to work (which sge appears to rely on)
# 2. It allows reverse lookups to work as sge expects.
# Note that the subnet here: 172.31.[112-127].XXX is the subnet cfncluster uses when
# clusters are created.
if ! grep -q cfn-hostname-fixup.sh /etc/hosts; then
echo -e "# entries below added with s3://cb-cfncluster/cfn-hostname-fixup.sh" >> /etc/hosts
for i in `seq 96 111`; do
for j in `seq 0 255`; do
echo -e "172.31.$i.$j\tip-172-31-$i-$j.us-west-2.compute.internal ip-172-31-$i-$j";
done;
done >> /etc/hosts;
Hi all,
we believe that the process of running manually commands to remove dead host from SGE configuration is what broke the all.q
of the initial post.
To recap, the following is the current behaviour (2.4.1) in the management of forced shutdown of the compute nodes (e.g. Spot)
Node with single node running job/s:
qsub -r y
) or the queue is configured with the rerun flag enabled. The compute node is correctly removed from the scheduler queue. This behaviour is dictated by the following SGE configuration parameters:
* reschedule_unknown 00:00:30
* ENABLE_FORCED_QDEL_IF_UNKNOWN
* ENABLE_RESCHEDULE_KILL=1
Node with a multi-node running job/s:
qdel <jobid>
). The node will be still displayed in the hosts list (qhost
) although this will not affect the scaling of the cluster or any ParallelCluster feature.
sqswatcher
daemon will retry to remove the host for three times. After that, to remove the host from the list you could execute the following command after having replaced <hostname>
accordingly:
sudo -- bash -c 'source /etc/profile.d/sge.sh; qconf -dattr hostgroup hostlist <hostname> @allhosts; qconf -de <hostname>'
Since everyone was able to restore the cluster in a working state and the discussion has shifted from its original question I'm going to mark this ticket in autoclose, feel free to open another ticket if you have any other issue.
Thanks Luca
This issue has been automatically closed because there has been no response to our request for more information from the original author. With only the information that is currently in the issue, we don't have enough information to take action. Please reach out if you have or find the answers we need so that we can investigate further.
Environment:
Bug description and how to reproduce: Submit a job to the cluster with N # of cores. Corresponding # of compute would be brought up by the ASG in action. Upon investigating further I have found that SGE fails to register the hosts to the Master Queue. Please refer to the attached logs for further details-
NOTE
execd_install_ip-10-246-40-78_2019-07-31_23.15.10.log jobwatcher.log sqswatcher.log