datacratic / StarCluster

StarCluster is a utility for creating and managing computing clusters hosted on Amazon's Elastic Compute Cloud (EC2).
http://star.mit.edu/cluster
GNU Lesser General Public License v3.0
37 stars 13 forks source link

clean_cluster does not delete jobs on failed nodes with dns suffix #68

Closed vasisht closed 7 years ago

vasisht commented 7 years ago

clean_cluster currently uses qhost to find nodes that are present in SGE but in the cluster. Once these nodes are found, it attempts to delete any jobs on these nodes before removing them from SGE using the output from qstat -u "*" -xml.
qhost displays only the hostname (node001) and qstat displays FQDN if present (node001.blah.com). This creates a problem when --dns-suffix is used with cluster, the queue names are all.q@node001.blah.com instead of all.q@node001. If there are any stuck jobs on these nodes, they do not get deleted and the node cannot be removed via _remove_from_sge.

The workaround is to parse the output of qhost -xml, which gives the FQDN that matches qstat.

@FinchPowers I'm not sure if anyone else has this issue. I've tested a fix in my branch and am happy to submit a PR if this is useful.

FinchPowers commented 7 years ago

I can't tell if it affects anyone else at the moment, but if there is a fix then you should submit it. :)

vasisht commented 7 years ago

Sure, here it is https://github.com/datacratic/StarCluster/pull/69