clean_cluster currently uses qhost to find nodes that are present in SGE but in the cluster. Once these nodes are found, it attempts to delete any jobs on these nodes before removing them from SGE using the output from qstat -u "*" -xml.
qhost displays only the hostname (node001) and qstat displays FQDN if present (node001.blah.com). This creates a problem when --dns-suffix is used with cluster, the queue names are all.q@node001.blah.com instead of all.q@node001. If there are any stuck jobs on these nodes, they do not get deleted and the node cannot be removed via _remove_from_sge.
The workaround is to parse the output of qhost -xml, which gives the FQDN that matches qstat.
@FinchPowers I'm not sure if anyone else has this issue. I've tested a fix in my branch and am happy to submit a PR if this is useful.
clean_cluster currently uses qhost to find nodes that are present in SGE but in the cluster. Once these nodes are found, it attempts to delete any jobs on these nodes before removing them from SGE using the output from qstat -u "*" -xml.
qhost displays only the hostname (node001) and qstat displays FQDN if present (node001.blah.com). This creates a problem when --dns-suffix is used with cluster, the queue names are all.q@node001.blah.com instead of all.q@node001. If there are any stuck jobs on these nodes, they do not get deleted and the node cannot be removed via _remove_from_sge.
The workaround is to parse the output of qhost -xml, which gives the FQDN that matches qstat.
@FinchPowers I'm not sure if anyone else has this issue. I've tested a fix in my branch and am happy to submit a PR if this is useful.