torque is hanging indefinitely

caddymob commented 10 years ago

Hey there -

In trying the updates for #386 we have killed our development install with 756be0ac - any job we try to run be in human, rat, mouse, or the broken dogs all hang indefinitely with torque. The nodes get checked out and the engine and clients look to be running via qstat or showq - however nothing is happening on the nodes when I look at top or ps aux. There are plenty of free nodes so this doesn't seem to a queue issue The jobs all hang until they hit the timeout and that's all I get. I dont see anything in the logs/ipython logs - Engines appear to have started successfully... I've rubbed my eyes and wiped my work dirs a few times to no avail. I checked and indeed running -t local works.... Any suggestions or additional info I can provide?

Thanks!

mjafin commented 10 years ago

@jpeden1 In your previous post you mention 172.17.1.33 is the IP of eth5 on the machine that doesn't work, is that right? In the above json dump the IP is 172.19.1.134 - is this an IP to something else?

jpeden1 commented 10 years ago

So our submit node that we start bcbio is on a 172.17.X.X network. When the newer version of ipython start it is somehow selecting compute node that are on a 172.19.X.X network. This is an inifiband network. Our machine does not have access to this 172.19.X.X network. The older version of ipython is choosing the 172.17.X.X network correctly to start jobs on. What it appears we need is a way to tell bcbio to ONLY choose compute nodes that are on 172.17.X.X. How would we do this??

jpeden1 commented 10 years ago

The other finding is that all of our 172.19.X.X compute nodes have a 172.17.X.X address. So if I get on a machine that has access to the 172.19.X.X network and I get the hostname of that machine. I can connect to it from our submit node that is hanging by using the hostname. In other words if a machine has an address of 172.19.1.134. It also has an 172.17.1.134 address that is associated with a hostname in DNS. If bcbio would call the compute nodes by hostname and not by IP bcbio would not hang.

mjafin commented 10 years ago

Seems like our problems are related, and related to ipython 2.x being more liberal in how it chooses the IPs from the pool somehow.

chapmanb commented 10 years ago

Jim and Miika; Thanks again for all the debugging. It sounds like we're at the root cause of the issue. To summarize, IPython 2.x uses a more thorough approach to discovering IP addresses and doesn't always choose the right one in complex cases. For Miika, this was due to VM-based interfaces. For Jim, it looks like it's due to addresses being bound to 2 IPs and only one working globally on the network.

Jim, when you ran the interface debugging command earlier, did you run it on the problem submit node (172.19.1.134) or on a different machine? If not on the problem submit node, my guess is that:

./anaconda/bin/python -c 'import netifaces; print [(iface, [x["addr"] for x in netifaces.ifaddresses(iface).get(netifaces.AF_INET, [])]) for iface in netifaces.interfaces()]'

would give you something like:

[..., ('eth5', ['172.17.1.134', '172.19.1.134'])]

If that's right, then I think the right fix is to pick the first valid non-local address found for each interface. I pushed a new version which does this if you can upgrade with:

./bcbio/anaconda/bin/pip install --upgrade ipython-cluster-helper

You should get 0.2.22. Fingers crossed that will work. I don't think the other workarounds are generalizable since IPs are more likely to work than assuming clusters have correct DNS resolution everywhere. I also don't know another way to generalize that it should prefer the 172.17.xxx IPs over 172.19.xxx. From your side, it might also be worth making both network ranges fully visible over the network since IPython might not be the only software that would get tripped up by this.

Hope the new version of ipython-cluster-helper fixes it. Thanks again for all the help debugging.

jpeden1 commented 10 years ago

Brad,

I did run the debugging on the problem submit node:

[('lo', ['127.0.0.1']), ('eth0', ['10.48.66.33']), ('eth1', []), ('eth2', []), ('eth3', []), ('eth4', []), ('eth5', ['172.17.1.33'])]

The problem submit node (172.17.1.33) can not talk to any thing on 172.19.X.X. That is on an inifiband network and the submit node does not have an inifiband nic nick. I don't understand how the problem submit node is even getting an 172.19.X.X address???

I'll do the upgrade and let you know the result.

Thanks

mjafin commented 10 years ago

I might be adding to the confusion, but in my case it was the compute node that was causing the problem. The json files list compute node IPs if I'm not mistaken (I might!). If the compute node reports 172.19.x.x then obviously the submit node wouldn't be able to see it, right?

What do the compute nodes report for the eth interfaces?

jpeden1 commented 10 years ago

@mjafin You are correct that the json file list compute nodes(see above). In our case the json file is showing the location as 172.19.1.134. That compute has two interfaces. The other interface has an IP of 172.17.1.134.

jpeden1 commented 10 years ago

@chapmanb I did the upgrade for ipython-cluster-helper. Reran. Same problem. Bcbio hangs and the json files have compute nodes on the 172.19.X.X network. Is there a way to have it only select comute nodes on the 172.17.X.X network or to have called by name instead of by IP?? Where is bcbio selecting compute nodes??

chapmanb commented 10 years ago

Jim; Thanks for testing and sorry that we're still running into issues. It appears the problem is with resolving names on the compute node that the ipcontroller gets assigned to. The submit node may be a red herring -- or perhaps the problem submit node specifically schedules to different nodes than the working submit node. In response to your question, the cluster scheduler chooses this, not bcbio. You can look at the torque_controller* file that is created by if that has any clues to why things are scheduled in certain places.

My suggestion to debug would be to start a cluster, note the compute node that it gets assigned to, which is likely the 172.17.1.134 machine, and then ssh in there and run the interfaces command to give us a better idea of what is happening on that machine:

./anaconda/bin/python -c 'import netifaces; print [(iface, [x["addr"] for x in netifaces.ifaddresses(iface).get(netifaces.AF_INET, [])]) for iface in netifaces.interfaces()]'

Hopefully that will provide more insight. Sorry for any confusion from my side; I'm not totally sure about your setup so am making best guesses here but hope this helps.

jpeden1 commented 10 years ago

Brad,

As you requested I ssh to the compute node (172.17.1.134). Then ran your debug code: [('lo', ['127.0.0.1']), ('eth0', ['172.17.1.134']), ('eth1', []), ('ib0', ['172.19.1.134']), ('ib1', [])]

Also, the older version of 0.7.9a runs fine from from the problem submit node.

chapmanb commented 10 years ago

Jim; Thank you, this is super helpful. The issue is that you have two interfaces with different IPs, eth0 and ib0, and it's not clear to IPython which should be preferred. It arbitrarily picks the last one in the list, so will default to the ib0 one which is not configured to be reachable in your network.

I pushed a new version of ipython-cluster-helper (0.2.23) that prioritizes eth interfaces that will hopefully resolve the issue:

./bcbio/anaconda/bin/pip install --upgrade ipython-cluster-helper

As a small aside, this is independent of the version of bcbio, and is related to the version of IPython (2.x is problematic, 1.x will work). The changes I'm pushing work around it by monkey patching IPython.

So, fingers crossed that this will get things working for you and let you update at will. Thanks again for all the patience debugging.

jpeden1 commented 10 years ago

I'll give that a try.

Someone here pointed out that it might fix the issue if we could change the "--ip=*" to only allow bcbio to use our 10GbE. Is that possible and where would I make that change????

Thanks again for all the help.

jpeden1 commented 10 years ago

Did the install --upgrade. Reran bcbio. The .json's are CORRECT. :) It has gotten past the point where it was hanging. It will take a like while for it to finish this test job. But is looks promising!

chapmanb commented 10 years ago

Jim; Awesome that this worked. Thanks again for all the patience working through the issues. Did everything finish up okay?

jpeden1 commented 10 years ago

Brad, Have had a couple of different jobs finish without issue. Still wonder about "--ip=" option and if we could use that to have iPython select an eth interface? Thanks again for all the help and fixes!!

chapmanb commented 10 years ago

Jim; Brilliant, glad to have this working. Sorry about forgetting to respond to the --ip suggestion. You could use this to specify specific IPs but I don't know of a way to generalize this better in the ipython-cluster-helper code than how IPython is doing it. You'd need to be able to adjust the IP to point to the correct eth interface on every machine the ipcontroller starts, which is essentially what IPython tries to do when we specify '--ip=*'. Thanks again for all the work on this.

bcbio / bcbio-nextgen

torque is hanging indefinitely #416