torque is hanging indefinitely

caddymob commented 10 years ago

Hey there -

In trying the updates for #386 we have killed our development install with 756be0ac - any job we try to run be in human, rat, mouse, or the broken dogs all hang indefinitely with torque. The nodes get checked out and the engine and clients look to be running via qstat or showq - however nothing is happening on the nodes when I look at top or ps aux. There are plenty of free nodes so this doesn't seem to a queue issue The jobs all hang until they hit the timeout and that's all I get. I dont see anything in the logs/ipython logs - Engines appear to have started successfully... I've rubbed my eyes and wiped my work dirs a few times to no avail. I checked and indeed running -t local works.... Any suggestions or additional info I can provide?

Thanks!

chapmanb commented 10 years ago

Jason; Sorry about the issues. That's unexpected, I don't think we've done anything on Torque recently -- at least on purpose. If you manually install an older version of ipython-cluster-helper does that resolve it:

/usr/local/share/bcbio-nextgen/anaconda/bin/pip install ipython-cluster-helper==0.2.17

Hopefully that will at least get your running. If so we can try to trace back to see what might have broken it.

One other thing to try is to look for old ipython controller processes from previous runs. I haven't seen that recently but know it solved the issue for a couple of people. Sorry again about the issue.

caddymob commented 10 years ago

Thanks Brad - The version in our 0.7.9 development build is ./bcbio/anaconda/lib/python2.7/site-packages/ipython_cluster_helper-0.2.19-py2.7.egg-info

In our stable old build (that still works) is ./anaconda/lib/python2.7/site-packages/ipython_cluster_helper-0.2.15-py2.7.egg-info

Working on rolling back versions now but wanted to report. There are no dangling ipys out on the nodes - been there!

chapmanb commented 10 years ago

Jason; Thanks much for helping with debugging. To help isolate the issue, it would be helpful to know if this is related to an ipython-cluster-helper version, or bcbio-nextgen itself. If you can fix by downgrading just ipython-cluster-helper, that would help us focus there. Otherwise I'll look at the bcbio-nextgen changes. I already have some improvements to make there to reduce memory usage but if there are other issues happy to shake them out. We're working on finalizing before the next release so this is super helpful. Sorry again about the problems.

mjafin commented 10 years ago

Hi Guys, I'm getting this same with SGE. qstat reports the slots being allocated correctly, but bcbio hangs at:

2014-04-30 20:02:30.698 [IPClusterStart] Config changed:
2014-04-30 20:02:30.698 [IPClusterStart] {'ProfileDir': {'location': u'/ngs/PHB/analysis/BGI_BRCA_test_3samples_Apr2014/ensemble-variant/work/log/ipython'}, 'SGELauncher': {'queue': 'highmem.q'}, 'BcbioSGEEngineSetLauncher': {'mem': '16.2', 'cores': 8, 'pename': u'smp', 'tag': '', 'resources': ''}, 'IPClusterEngines': {'early_shutdown': 240}, 'Application': {'log_level': 10}, 'BcbioSGEControllerLauncher': {'mem': '16.2', 'tag': '', 'resources': ''}, 'BaseParallelApplication': {'log_to_file': True, 'cluster_id': u'4f9b3d0e-e8b5-4efc-a3ab-0b375278d533'}, 'IPClusterStart': {'delay': 10, 'n': 3, 'daemonize': True, 'engine_launcher_class': u'cluster_helper.cluster.BcbioSGEEngineSetLauncher', 'controller_launcher_class': u'cluster_helper.cluster.BcbioSGEControllerLauncher'}}
2014-04-30 20:02:30.704 [IPClusterStart] Using existing profile dir: u'/ngs/PHB/analysis/BGI_BRCA_test_3samples_Apr2014/ensemble-variant/work/log/ipython'
2014-04-30 20:02:30.704 [IPClusterStart] Searching path [u'/ngs/PHB/analysis/BGI_BRCA_test_3samples_Apr2014/ensemble-variant/work', u'/ngs/PHB/analysis/BGI_BRCA_test_3samples_Apr2014/ensemble-variant/work/log/ipython'] for config files
2014-04-30 20:02:30.704 [IPClusterStart] Attempting to load config file: ipython_config.py
2014-04-30 20:02:30.704 [IPClusterStart] Loaded config file: /ngs/PHB/analysis/BGI_BRCA_test_3samples_Apr2014/ensemble-variant/work/log/ipython/ipython_config.py
2014-04-30 20:02:30.705 [IPClusterStart] Attempting to load config file: ipcluster_4f9b3d0e_e8b5_4efc_a3ab_0b375278d533_config.py
2014-04-30 20:02:30.705 [IPClusterStart] Loaded config file: /ngs/PHB/analysis/BGI_BRCA_test_3samples_Apr2014/ensemble-variant/work/log/ipython/ipcontroller_config.py
2014-04-30 20:02:30.705 [IPClusterStart] Attempting to load config file: ipcluster_4f9b3d0e_e8b5_4efc_a3ab_0b375278d533_config.py
2014-04-30 20:02:30.706 [IPClusterStart] Loaded config file: /ngs/PHB/analysis/BGI_BRCA_test_3samples_Apr2014/ensemble-variant/work/log/ipython/ipengine_config.py
2014-04-30 20:02:30.706 [IPClusterStart] Attempting to load config file: ipcluster_4f9b3d0e_e8b5_4efc_a3ab_0b375278d533_config.py
2014-04-30 20:02:30.707 [IPClusterStart] Loaded config file: /ngs/PHB/analysis/BGI_BRCA_test_3samples_Apr2014/ensemble-variant/work/log/ipython/ipcluster_config.py

roryk commented 10 years ago

Hi guys,

Hmm, we're not having troubles running on SLURM or LSF, and we haven't touched the SGE and Torque bits of ipython-cluster-helper so I'm not sure where to look to start to debug. If you install the newest version of ipython-cluster-helper from https://github.com/roryk/ipython-cluster-helper and run the example.py script in ipython-cluster-helper, does it work? It would be helpful in narrowing down if it is something with ipython-cluster-helper or something in bcbio-nextgen. Thanks a lot for the troubleshooting, sorry I can't be of more help.

caddymob commented 10 years ago

Hey guys - super sorry for this one - turns out the DNS entry changed somewhere somehow on our galaxy machine that is doing the bcbio submissions. #facepalm

roryk commented 10 years ago

Hi Jason. Thanks for following up-- why can't all bugs solve themselves? I reopened this so we can tackle Miika's issue.

chapmanb commented 10 years ago

Jason -- thanks for this. Glad to have one problem solved.

Miika -- is your problem with hanging or time outs? If hanging, could you ssh into the controller and engine machines and check memory usage? I've realized with larger runs memory usage is getting to be a problem and working on tactics to avoid that now. That's the only potential problem I know of right now so is my best guess.

mjafin commented 10 years ago

Brad, it's a problem with hanging. There's plenty memory available on the controller/engine node (256GB) and things work fine when I use 0.7.8 stable.

Rory, how did you want me to run the example.py script?

mjafin commented 10 years ago

Hmm, now that I submitted it to another queue, off it goes.. false alarm I guess, and beats me what the reason might be!

roryk commented 10 years ago

Two for two, this must be a dream. highmem sounds like a good queue for bcbio-nextgen to work on. Has it worked before on the highmem queue? If you run the example.py script like this:

python example/example.py --scheduler SLURM --num_jobs 3 --queue highmem.q

Does it work? Is the smp parallel environment that is being selected correct?

mjafin commented 10 years ago

So, 0.7.8 worked with highmem.q on this same data. smp is what we're using solely so it's correct.

With the other queue, ngs.q, the example.py file produces more output. With highmem.q it doesn't. I'll try to figure out why.

roryk commented 10 years ago

Hi Miika,

ipython-cluster-helper not working correctly is a good clue. It seems like sometimes this behavior happens because the engines and controller can't talk to each other; we've seen it when the node the controller is running on is walled off from the engines. I'm interested in knowing what you find out.

caddymob commented 10 years ago

Me again...

I take it back - it wasn't a DNS thing, I was reading the IPs wrong in chasing this down. So, Rory we pulled down your example.py script and gave that a spin - same deal, however a slight deviation to our queue as follows:

python example.py --scheduler torque --num_jobs 3 --queue batch

This does the same thing where it works fine in our stable version, but in our development clone, we're hanging until timeout - but the nodes are indeed checked out.

This is where it sits:

[ProfileCreate] Generating default config file: u'/home/jcorneveaux/.ipython/profile_bb7ea384-d0b0-11e3-854e-d4ae52679fa3/ipython_config.py'
[ProfileCreate] Generating default config file: u'/home/jcorneveaux/.ipython/profile_bb7ea384-d0b0-11e3-854e-d4ae52679fa3/ipython_notebook_config.py'
[ProfileCreate] Generating default config file: u'/home/jcorneveaux/.ipython/profile_bb7ea384-d0b0-11e3-854e-d4ae52679fa3/ipython_nbconvert_config.py'
[ProfileCreate] Generating default config file: u'/home/jcorneveaux/.ipython/profile_bb7ea384-d0b0-11e3-854e-d4ae52679fa3/ipcontroller_config.py'
[ProfileCreate] Generating default config file: u'/home/jcorneveaux/.ipython/profile_bb7ea384-d0b0-11e3-854e-d4ae52679fa3/ipengine_config.py'
[ProfileCreate] Generating default config file: u'/home/jcorneveaux/.ipython/profile_bb7ea384-d0b0-11e3-854e-d4ae52679fa3/ipcluster_config.py'
[ProfileCreate] Generating default config file: u'/home/jcorneveaux/.ipython/profile_bb7ea384-d0b0-11e3-854e-d4ae52679fa3/iplogger_config.py'
2014-04-30 14:45:24.522 [IPClusterStart] Config changed:
2014-04-30 14:45:24.522 [IPClusterStart] {'BaseIPythonApplication': {'profile': u'bb7ea384-d0b0-11e3-854e-d4ae52679fa3'}, 'BcbioTORQUEEngineSetLauncher': {'mem': '1', 'cores': 1, 'tag': '', 'resources': ''}, 'IPClusterEng\
ines': {'early_shutdown': 240}, 'Application': {'log_level': 10}, 'BaseParallelApplication': {'log_to_file': True, 'cluster_id': u'f36d5a2c-9394-4af5-85d4-225ef9951f09'}, 'TORQUELauncher': {'queue': 'batch'}, 'BcbioTORQUE\
ControllerLauncher': {'mem': '1', 'tag': '', 'resources': ''}, 'IPClusterStart': {'delay': 10, 'n': 3, 'daemonize': True, 'engine_launcher_class': u'cluster_helper.cluster.BcbioTORQUEEngineSetLauncher', 'controller_launch\
er_class': u'cluster_helper.cluster.BcbioTORQUEControllerLauncher'}}
2014-04-30 14:45:24.523 [IPClusterStart] IPYTHONDIR set to: /home/jcorneveaux/.ipython
2014-04-30 14:45:24.528 [IPClusterStart] Using existing profile dir: u'/home/jcorneveaux/.ipython/profile_bb7ea384-d0b0-11e3-854e-d4ae52679fa3'
2014-04-30 14:45:24.529 [IPClusterStart] Searching path [u'/home/jcorneveaux/example', u'/home/jcorneveaux/.ipython/profile_bb7ea384-d0b0-11e3-854e-d4ae52679fa3'] for config files
2014-04-30 14:45:24.529 [IPClusterStart] Attempting to load config file: ipython_config.py
2014-04-30 14:45:24.529 [IPClusterStart] Loaded config file: /home/jcorneveaux/.ipython/profile_bb7ea384-d0b0-11e3-854e-d4ae52679fa3/ipython_config.py
2014-04-30 14:45:24.530 [IPClusterStart] Attempting to load config file: ipcluster_f36d5a2c_9394_4af5_85d4_225ef9951f09_config.py
2014-04-30 14:45:24.530 [IPClusterStart] Loaded config file: /home/jcorneveaux/.ipython/profile_bb7ea384-d0b0-11e3-854e-d4ae52679fa3/ipcontroller_config.py
2014-04-30 14:45:24.530 [IPClusterStart] Attempting to load config file: ipcluster_f36d5a2c_9394_4af5_85d4_225ef9951f09_config.py
2014-04-30 14:45:24.531 [IPClusterStart] Loaded config file: /home/jcorneveaux/.ipython/profile_bb7ea384-d0b0-11e3-854e-d4ae52679fa3/ipengine_config.py
2014-04-30 14:45:24.531 [IPClusterStart] Attempting to load config file: ipcluster_f36d5a2c_9394_4af5_85d4_225ef9951f09_config.py
2014-04-30 14:45:24.532 [IPClusterStart] Loaded config file: /home/jcorneveaux/.ipython/profile_bb7ea384-d0b0-11e3-854e-d4ae52679fa3/ipcluster_config.py

This is specific to our 'galaxy' head node - if we try from the standard submission node or from a compute node running interactively, both branches work as expected. On the stable version, all submission hosts work fine with ipython-cluster, but the galaxy node where we must submit bcbio jobs the development branch is hanging... These are all centOS 6 machines. Is there additional debugging or logs to look or is there a place in the source we can add some debugging?

roryk commented 10 years ago

Hi Jason,

Great-- thanks for looking into this. @chapmanb will hopefully have some more ideas, but taking a crack at it: Are you by any chance running the bcbio_nextgen.py command from the head node, or are you also submitting that as a job? It has come up with a couple of cluster configurations that the compute nodes are all allowed to open up random ports on each other, but not to the head node, so the controller and engines can never connect if the job runs on the head node but is fine if the job is submitted to run on a compute node. It sounds to me like there is something specific about the galaxy node setup that is preventing the engines and controller from registering with each other.

In the .ipython/profile_uuid/ directory there might be some log files specific for IPython that have some more information, in particular there are JSON files that have the connection details which might help narrow down which hosts can't talk to each other or if there is a port range that is getting blocked off.

Hope something in there is on the right path!

roryk commented 10 years ago

Hi Jason,

Did this ever end up getting resolved? If so, what was the ultimate solution? If not, how can we help? :)

caddymob commented 10 years ago

Sorry i left this one hanging - no, we didn't get it solved, I've been punting and running on compute nodes. We actually just went down the rabbit hole to build a copy on the galaxy node and need to update SSL and reboot and so on... Not sure where or what is different but decided to try building afresh and hope that clears up the blockage.

mjafin commented 10 years ago

Rory, no luck here! It gets the slots:

4368430 0.50500 bcbio-c    klrl262      r     05/08/2014 09:31:45 highmem.q@ukapdlnx119              1 1
4368431 0.60500 bcbio-e    klrl262      r     05/08/2014 09:32:00 highmem.q@ukapdlnx119              8 1
4368431 0.60500 bcbio-e    klrl262      r     05/08/2014 09:32:00 highmem.q@ukapdlnx119              8 2
4368431 0.60500 bcbio-e    klrl262      r     05/08/2014 09:32:00 highmem.q@ukapdlnx119              8 3
4368431 0.60500 bcbio-e    klrl262      r     05/08/2014 09:32:00 highmem.q@ukapdlnx119              8 4

but then is left hanging indefinitely here:

2014-05-08 09:31:35.085 [IPClusterStart] Config changed:
2014-05-08 09:31:35.086 [IPClusterStart] {'ProfileDir': {'location': u'/ngs/RDI/Analysis/NA12878_validation/work/log/ipython'}, 'SGELauncher': {'queue': 'highmem.q'}, 'BcbioSGEEngineSetLauncher': {'mem': '16.2', 'cores': 8, 'pename': u'smp', 'tag': '', 'resources': ''}, 'IPClusterEngines': {'early_shutdown': 240}, 'Application': {'log_level': 10}, 'BcbioSGEControllerLauncher': {'mem': '16.2', 'tag': '', 'resources': ''}, 'BaseParallelApplication': {'log_to_file': True, 'cluster_id': u'fced528b-5bd6-4cbc-87a2-f0701356db87'}, 'IPClusterStart': {'delay': 10, 'n': 4, 'daemonize': True, 'engine_launcher_class': u'cluster_helper.cluster.BcbioSGEEngineSetLauncher', 'controller_launcher_class': u'cluster_helper.cluster.BcbioSGEControllerLauncher'}}
2014-05-08 09:31:35.090 [IPClusterStart] Using existing profile dir: u'/ngs/RDI/Analysis/NA12878_validation/work/log/ipython'
2014-05-08 09:31:35.091 [IPClusterStart] Searching path [u'/ngs/RDI/Analysis/NA12878_validation/work', u'/ngs/RDI/Analysis/NA12878_validation/work/log/ipython'] for config files
2014-05-08 09:31:35.091 [IPClusterStart] Attempting to load config file: ipython_config.py
2014-05-08 09:31:35.091 [IPClusterStart] Loaded config file: /ngs/RDI/Analysis/NA12878_validation/work/log/ipython/ipython_config.py
2014-05-08 09:31:35.091 [IPClusterStart] Attempting to load config file: ipcluster_fced528b_5bd6_4cbc_87a2_f0701356db87_config.py
2014-05-08 09:31:35.092 [IPClusterStart] Loaded config file: /ngs/RDI/Analysis/NA12878_validation/work/log/ipython/ipcontroller_config.py
2014-05-08 09:31:35.092 [IPClusterStart] Attempting to load config file: ipcluster_fced528b_5bd6_4cbc_87a2_f0701356db87_config.py
2014-05-08 09:31:35.092 [IPClusterStart] Loaded config file: /ngs/RDI/Analysis/NA12878_validation/work/log/ipython/ipengine_config.py
2014-05-08 09:31:35.093 [IPClusterStart] Attempting to load config file: ipcluster_fced528b_5bd6_4cbc_87a2_f0701356db87_config.py
2014-05-08 09:31:35.093 [IPClusterStart] Loaded config file: /ngs/RDI/Analysis/NA12878_validation/work/log/ipython/ipcluster_config.py

roryk commented 10 years ago

Hi Jason,

Thanks for the update and sorry for the problems. Let us know if rebuilding the node doesn't fix it up.

Miika, hmm-- I see on the processes that the actual bcbio job isn't on the list, is that job running on the head node? There is a client, the controller and a bunch of engines. The client sends jobs to the controller which distributes them to the engines, so the client needs to be able to sonnect to the ZMQ sockets the client sets up. Sometimes clusters are set up such that the compute nodes can talk to each other but the head node can't directly connect to ports on the compute nodes. If the bcbio_nextgen job is submitted to highmem.q as well, does that work? If not, does running that example.py script from ipython-cluster-helper work? Does it work on any queue, or just that one?

mjafin commented 10 years ago

Rory, both ngs.q and highmem.q work when running 0.7.8. With 0.7.9a ngs.q works but highmem.q doesn't. Tried both actual jobs and example.py.

Was anything changed in the cluster-helper recently (post 0.7.8 release)?

I'm usually not submitting the actual bcbio job itself to the queue, just running it in a screen locally..

mjafin commented 10 years ago

OK, so after some further investigation, if I'm running example.py from one of the ngs.q hosts, only ngs.q works (0.7.9a). However, on the highmem.q host, both ngs.q and highmem.q work when running example.py.

So something about the communication, ZMQ sockets, changed in 0.7.9a?

mjafin commented 10 years ago

Sorry about the spam, but when I CTRL-C example.py after it hangs, I can see this:

  File "example.py", line 48, in <module>
    profile=args.profile, extra_params=args.resources) as view:
  File "/apps/bcbio-nextgen/latest-devel/rhel6-x64/anaconda/lib/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "/apps/bcbio-nextgen/latest-devel/rhel6-x64/anaconda/lib/python2.7/site-packages/cluster_helper/cluster.py", line 803, in cluster_view
    while not _is_up(url_file, num_jobs):
  File "/apps/bcbio-nextgen/latest-devel/rhel6-x64/anaconda/lib/python2.7/site-packages/cluster_helper/cluster.py", line 745, in _is_up
    client = Client(url_file, timeout=60)
  File "/apps/bcbio-nextgen/latest-devel/rhel6-x64/anaconda/lib/python2.7/site-packages/IPython/parallel/client/client.py", line 496, in __init__
    self._connect(sshserver, ssh_kwargs, timeout)
  File "/apps/bcbio-nextgen/latest-devel/rhel6-x64/anaconda/lib/python2.7/site-packages/IPython/parallel/client/client.py", line 613, in _connect
    evts = poller.poll(timeout*1000)
  File "/apps/bcbio-nextgen/latest-devel/rhel6-x64/anaconda/lib/python2.7/site-packages/zmq/sugar/poll.py", line 110, in poll
    return zmq_poll(self.sockets, timeout=timeout)
  File "_poll.pyx", line 115, in zmq.backend.cython._poll.zmq_poll (zmq/backend/cython/_poll.c:1678)
  File "checkrc.pxd", line 11, in zmq.backend.cython.checkrc._check_rc (zmq/backend/cython/_poll.c:1879)
KeyboardInterrupt

roryk commented 10 years ago

Hi Miika,

Is that trace from submitting to the highmem.q from the ngs.q? It definitely seems like when you are submitting to the highmem queue from the ngs.q host, the client (the main bcbio job) can't connect to the controller.

If you submit the example.py job as a job, does it work on both queues? So there would be the example.py job running, a controller job and a set of engines jobs running on the scheduler, not just the controller and engines.

Thanks for helping debug what is going on, don't worry about spamming-- more info is awesome.

chapmanb commented 10 years ago

Miika; We didn't purposely change anything in communication along these lines, but agree with Rory that it looks like some kind of connectivity issue. I wonder if it's an IPython difference. What versions are you running on the working and not working installs:

/apps/bcbio-nextgen/latest-devel/rhel6-x64/anaconda/bin/conda list ipython

If they are different, does converting the broken IPython one to the working IPython version fix it (or vice versa, although you might not want to break the one that works). Just trying to isolate the root cause a bit. Hope this helps.

mjafin commented 10 years ago

Rory, so qsubbing a wrapper script calling example.py to highmem.q did indeed run just fine, so must be something about communication between ngs.q nodes and the highmem.q host. So the answer to "If you submit the example.py job as a job, does it work on both queues? So there would be the example.py job running, a controller job and a set of engines jobs running on the scheduler, not just the controller and engines." is yes.

Brad, here's the list for 0.7.9a:

# packages in environment at /apps/bcbio-nextgen/latest-devel/rhel6-x64/anaconda:
#
ipython                   2.0.0                    py27_0
ipython-cluster-helper    0.2.19                    <pip>

and here's the list for 0.7.8:

# packages in environment at /apps/bcbio-nextgen/0.7.8/rhel6-x64/anaconda:
#
ipython                   1.2.1                    py27_0
ipython-cluster-helper    0.2.17                    <pip>

So 1.2.1 vs. 2.0.0

Edit. Downgrading ipython-cluster-helper to 0.2.17 within 0.7.9a didn't help, so the culprit must be ipython 2.0.0

roryk commented 10 years ago

Hi Miika,

Score! This seems like it might be related to https://github.com/chapmanb/bcbio-nextgen/issues/138. There is a solution to one type of network configuration problem here: https://bcbio-nextgen.readthedocs.org/en/latest/search.html?q=network+troubleshooting&check_keywords=yes&area=default#troubleshooting

If you find out what needs to be changed in your machine's configuration to let the ngs and highmem nodes talk to each other, could you follow up here so we can add it to the docs?

mjafin commented 10 years ago

Thanks Rory, I'll see if I can figure out something. On both the ngs.q and highmem.q machines the /etc/hosts file is identical and lists all the hosts and their corresponding IP addresses. I also checked they're correct by running python -c 'import socket; print socket.gethostbyname(socket.gethostname())' on all the hosts and as mentioned, things worked with the old ipython version.

Even if I can't figure out the underlying reason, it's not greatly blocking me from getting the jobs run with the workarounds - obviously it would be nice to nail the real reason though.

roryk commented 10 years ago

It also might be that accessing the zeromq ports is blocked on the highmem.q machines when coming from the ngs.q machines. Maybe the zeromq port range is blocked? iptables --list on the highmem.q machines might have some clues.

mjafin commented 10 years ago

Thanks, will ask our system admin to check the ports (requires sudo)!

chapmanb commented 10 years ago

Miika and Rory; Thanks for all the detective work. It sounds like there might be different ports required between 1.2.1 and 2.0.0 but searching around I couldn't find a definitive change log that would identify them. The best way to check might be to start a parallel process with both old/working and new/not working and look at the generated JSON files defining the ZeroMQ ports in log/ipython/security/ipcontroller-*client.json and log/ipython/security/ipcontroller-*engine.json. Are there different port requirements between the two? This might give admin folks exact open/closed ports to look at. It looks like we'd be looking for a range or other differences, not exact numbers. The code appears to randomly pick open ports using the Python socket interface:

https://github.com/ipython/ipython/blob/master/IPython/parallel/util.py#L278 http://stackoverflow.com/a/1365284/252589

We'd love to hear what you find out and update the docs with tips. We're guessing on our side because we haven't run into this, but it's great to capture these sorts of issues. Jason, is any of this helping with your debugging?

mjafin commented 10 years ago

I took a look at the json files and found a suspect. Here are the entries from the working, old ipython 1.2.1:

{
  "control": 56515,
  "task": 58746,
  "notification": 58667,
  "task_scheme": "lru",
  "mux": 45754,
  "iopub": 49942,
  "ssh": "",
  "key": "727dc910-d63c-49a8-9143-6c960fb2a54f",
  "registration": 43854,
  "interface": "tcp://*",
  "signature_scheme": "hmac-sha256",
  "pack": "json",
  "unpack": "json",
  "location": "10.145.85.157"
}{
  "control": 35995,
  "task": 38806,
  "hb_ping": 32912,
  "mux": 54166,
  "pack": "json",
  "hb_pong": 51138,
  "ssh": "",
  "key": "727dc910-d63c-49a8-9143-6c960fb2a54f",
  "registration": 43854,
  "interface": "tcp://*",
  "iopub": 51856,
  "signature_scheme": "hmac-sha256",
  "unpack": "json",
  "location": "10.145.85.157"
}

And these are from the non-working version:

{
  "control": 41690,
  "task": 59471,
  "notification": 55996,
  "task_scheme": "lru",
  "mux": 41958,
  "iopub": 42830,
  "ssh": "",
  "key": "9359cf54-7b67-4e3e-a841-49150e1680bf",
  "registration": 46937,
  "interface": "tcp://*",
  "signature_scheme": "hmac-sha256",
  "pack": "json",
  "unpack": "json",
  "location": "192.168.122.1"
}{
  "control": 39676,
  "task": 45770,
  "hb_ping": 40036,
  "mux": 59565,
  "pack": "json",
  "hb_pong": 57220,
  "ssh": "",
  "key": "9359cf54-7b67-4e3e-a841-49150e1680bf",
  "registration": 46937,
  "interface": "tcp://*",
  "iopub": 59501,
  "signature_scheme": "hmac-sha256",
  "unpack": "json",
  "location": "192.168.122.1"
}

The location seems off in the second case (i.e. 10.145.85.157 is correct, 192.168.122.1 isn't). Upon closer inspection on the highmem.q node, there is a network interface called virbr0:

virbr0    Link encap:Ethernet  HWaddr 52:...
          inet addr:192.168.122.1  Bcast:192.168.122.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:29 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:0 (0.0 b)  TX bytes:4255 (4.1 KiB)

What I don't understand is why ipython 2.0.0 would try and connect to this instead of eth2.

chapmanb commented 10 years ago

Miika; Thanks so much for the detective work. I think I've isolated the issue and pushed a fix to ipython-cluster-helper. If you upgrade on the problem version:

/apps/bcbio-nextgen/latest-devel/rhel6-x64/anaconda/bin/pip install --upgrade ipython-cluster-helper

It should hopefully now give you 10.145.85.157 as a location instead of 192.168.122.1. If this works, I'll push the fix upstream to IPython.

virbr0 is used by VMs for networking (http://askubuntu.com/questions/246343/what-is-the-virbr0-interface-used-for) so is likely related to Docker work if you were testing on that machine. The underlying cause is that network detection was made more resilient in IPython 2.0.0. Here's the new version:

https://github.com/ipython/ipython/blob/master/IPython/utils/localinterfaces.py

versus the old:

https://github.com/ipython/ipython/blob/rel-1.2.1/IPython/utils/localinterfaces.py

Unfortunately the fixes appear to allow it to grab more interfaces, so things like docker0 and virbr0 get pulled in and potentially used. The fix I pushed should avoid this.

Jason, can you check if this is the underlying cause of your issue as well? A quick way to check is if ifconfig reports interfaces that will throw off your networking. Hopefully this will fix the issue for you as well.

Thanks again for all the help debugging this. It's great to finally have a handle on an underlying cause. Fingers crossed this fixes things.

mjafin commented 10 years ago

Thanks Brad, I think we're on the right track! Now, something even weirder happened after the update. Location now reports the IP of something called bond0.104 of the submitting host!

chapmanb commented 10 years ago

Miika; Ugh, sorry. That is my mistake. I was so focused on testing on a machine that had Docker I didn't realize I was patching this in the wrong place for distributed execution. I pushed a new version which should correctly execute this on the controller machine and actually provide the right address. Can you upgrade again:

/apps/bcbio-nextgen/latest-devel/rhel6-x64/anaconda/bin/pip install --upgrade ipython-cluster-helper

Fingers crossed this will fix it.

mjafin commented 10 years ago

Bingo! example.py runs now to completion, although the reported IP address in the .json file is from bond0.104 and not eth2 of the highmem.q machine. The IP of the bond0.104 interface is not in /etc/hosts but it doesn't seem to prevent the job from running. Do you think it makes any difference?

chapmanb commented 10 years ago

Brilliant. The bond0 interface is probably networking magic combining multiple physical cards into a single higher throughput pipe. I think you normally set these up so the bond0 interface has an external IP and the constituent eth cards do not, but I am not an expert on this all. Practically, working is the most important thing, so all we need is that your external IPs from ifconfig are resolvable from the network.

Jason, if this fixes for you I'll push to IPython and finalize this. Nice to have some closure.

jpeden1 commented 10 years ago

iPython is still hanging from this one machine that Jason talked. I went so far as to have the machine reimaged to match all of the nodes in our cluster. Here is the output I get when I control-C bcbio:

2014-05-15 10:05:57.536 [IPClusterStart] Loaded config file: /scratch/jpeden/GALAXY_BC_TEST/work/log/ipython/ipython_config.py
2014-05-15 10:05:57.537 [IPClusterStart] Attempting to load config file: ipcluster_bffa14d9_a60e_481e_86c2_58940cd2f032_config.py
2014-05-15 10:05:57.537 [IPClusterStart] Loaded config file: /scratch/jpeden/GALAXY_BC_TEST/work/log/ipython/ipcontroller_config.py
2014-05-15 10:05:57.538 [IPClusterStart] Attempting to load config file: ipcluster_bffa14d9_a60e_481e_86c2_58940cd2f032_config.py
2014-05-15 10:05:57.539 [IPClusterStart] Loaded config file: /scratch/jpeden/GALAXY_BC_TEST/work/log/ipython/ipengine_config.py
2014-05-15 10:05:57.539 [IPClusterStart] Attempting to load config file: ipcluster_bffa14d9_a60e_481e_86c2_58940cd2f032_config.py
2014-05-15 10:05:57.540 [IPClusterStart] Loaded config file: /scratch/jpeden/GALAXY_BC_TEST/work/log/ipython/ipcluster_config.py
^C2014-05-15 10:26:52.110 [IPClusterStop] Using existing profile dir: u'/scratch/jpeden/GALAXY_BC_TEST/work/log/ipython'
2014-05-15 10:26:52.134 [IPClusterStop] Stopping cluster [pid=11203] with [signal=2]
Traceback (most recent call last):
  File "/packages/bcbio/0.7.9/tooldir/bin/bcbio_nextgen.py", line 62, in <module>
    main(**kwargs)
  File "/packages/bcbio/0.7.9/tooldir/bin/bcbio_nextgen.py", line 40, in main
    run_main(**kwargs)
  File "/packages/bcbio/0.7.9/bcbio/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 44, in run_main
    fc_dir, run_info_yaml)
  File "/packages/bcbio/0.7.9/bcbio/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 82, in _run_toplevel
    for xs in pipeline.run(config, config_file, parallel, dirs, pipeline_items):
  File "/packages/bcbio/0.7.9/bcbio/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 312, in run
    multiplier=alignprep.parallel_multiplier(samples)) as run_parallel:
  File "/packages/bcbio/0.7.9/bcbio/anaconda/lib/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "/packages/bcbio/0.7.9/bcbio/anaconda/lib/python2.7/site-packages/bcbio/distributed/prun.py", line 45, in start
    with ipython.create(parallel, dirs, config) as view:
  File "/packages/bcbio/0.7.9/bcbio/anaconda/lib/python2.7/contextlib.py", line 17, in **enter**
    return self.gen.next()
  File "/packages/bcbio/0.7.9/bcbio/anaconda/lib/python2.7/site-packages/cluster_helper/cluster.py", line 803, in cluster_view
    while not _is_up(url_file, num_jobs):
  File "/packages/bcbio/0.7.9/bcbio/anaconda/lib/python2.7/site-packages/cluster_helper/cluster.py", line 745, in _is_up
    client = Client(url_file, timeout=60)
  File "/packages/bcbio/0.7.9/bcbio/anaconda/lib/python2.7/site-packages/IPython/parallel/client/client.py", line 496, in __init__
    self._connect(sshserver, ssh_kwargs, timeout)
  File "/packages/bcbio/0.7.9/bcbio/anaconda/lib/python2.7/site-packages/IPython/parallel/client/client.py", line 613, in _connect
    evts = poller.poll(timeout*1000)
  File "/packages/bcbio/0.7.9/bcbio/anaconda/lib/python2.7/site-packages/zmq/sugar/poll.py", line 110, in poll
    return zmq_poll(self.sockets, timeout=timeout)
  File "_poll.pyx", line 115, in zmq.backend.cython._poll.zmq_poll (zmq/backend/cython/_poll.c:1678)
  File "checkrc.pxd", line 11, in zmq.backend.cython.checkrc._check_rc (zmq/backend/cython/_poll.c:1879)
KeyboardInterrupt

The other logs are basicly emtpy.

When I run this same job on an un-update 0.7.9a it works.

Our working copy has:


jpeden@galaxy-> ./anaconda/bin/conda list ipython

# packages in environment at /packages/bcbio/0.7.4/anaconda:

#
ipython                   1.2.1                    py27_0  
ipython-cluster-helper    0.2.15                    <pip>

Our non working copy (upgraded last night) has:


jpeden@galaxy-> ./anaconda/bin/conda list ipython

# packages in environment at /packages/bcbio/0.7.4/anaconda:

#
ipython                   1.2.1                    py27_0  
ipython-cluster-helper    0.2.15                    <pip>

Any help would be great!

Jim

chapmanb commented 10 years ago

Jim; Sorry for the continued problems. Can you clarify the differences between the two environments? Your ./anaconda/bin/conda list ipython results are confusing as they both have the same IPython and ipython-cluster-helper version. Is that right? If you're updating to the latest it should at least force ipython-cluster-helper >= 0.2.19 so it seems like something is off. Maybe you accidentally listed the same install twice? Thanks for the help debugging this.

jpeden1 commented 10 years ago

I upgraded our ipython-cluster-helper from 0.2.19 to 0.2.21. Same problem. No change.

mjafin commented 10 years ago

@jpeden1 in our case it was helpful to look at the ifconfig output and compare to the ipython json files to see if a network adapter was being used that isn't seen by the other machines.

jpeden1 commented 10 years ago

I see a .json file in my work directory under log/ipython. Not sure how to decode it? In our case it is eth5 that is the adapter that talks to the cluster. Below are the contents of the json file:

{"next_id": 2, "engines": {"0": "8585ba26-e167-459a-8303-f19de845d7b4", "1": "a0d9243b-8980-44dd-b104-c52d6f79d956"}}

Jim

chapmanb commented 10 years ago

Jim; Sorry it's still not working. As Miika mentioned, it could still be an adapter issue we didn't anticipate. The files with the IP addresses you want to look at are log/ipython/security/ipcontroller-*client.json and log/ipython/security/ipcontroller-*engine.json. The IP address in location should match the IP of your adapter, eth5. If it doesn't could you send along what it is and the output of ifconfig so we can see what adapter it does use? Thanks for the help debugging.

jpeden1 commented 10 years ago

The security directory is empty. I had let bcbio run for 5 - 10 minutes before I hit control-C.

chapmanb commented 10 years ago

Jim; Sorry if that file is not produced, we'll probably need to understand your network configuration a bit more to offer additional advice. It seems like some configuration is making the new IPython version unhappy. What does this command report?

./anaconda/bin/python -c 'import netifaces; print [(iface, [x["addr"] for x in netifaces.ifaddresses(iface).get(netifaces.AF_INET, [])]) for iface in netifaces.interfaces()]'

Ideally we'd like to see it pointing to your eth5 adapter. Thanks.

jpeden1 commented 10 years ago

Here is what I get:

[('lo', ['127.0.0.1']), ('eth0', ['10.48.66.33']), ('eth1', []), ('eth2', []), ('eth3', []), ('eth4', []), ('eth5', ['172.17.1.33'])]

chapmanb commented 10 years ago

Jim; Thanks, that looks right assuming that 172.17.1.33 if the machine address you'd like to have exported. If the log/ipython/security/ipcontroller* files aren't being produced, that is the next point to debug. Sorry I don't have anything to suggest right now but I can read the IPython code to try and guess what might be causing it to fail writing these. If you have any thoughts about how this machine is configured differently than others that work, that would be super helpful. We're basically just guessing here so anything to go on might help identify the underlying cause. Thanks again.

mjafin commented 10 years ago

@chapmanb I wonder if the ipcontroller files are removed as part of the temporary profile removal, i.e. when hitting CTRL-C? @jpeden1 maybe if you start a process that hangs, don't CTRL-C it but rather check from your terminal screen the location of the ipython profile and see if the ipcontroller files are there?

Out of curiosity, is eth0 visible from the other machines? Maybe ipython is trying to use eth0 rather than eth5

jpeden1 commented 10 years ago

@mjafin reran bcbio and the json are there while it is running. They must get delete when I hit CTRL-C.

Here is the contains of the files:

[security]jpeden@galaxy-> more ipcontroller-d776da1d-16ec-4c85-9993-22f5b088bda5-client.json { "control": 39396, "task": 57524, "notification": 43872, "task_scheme": "lru", "mux": 33443, "patch": "VMFixIPControllerApp", "iopub": 48118, "ssh": "", "key": "2f424d43-2f6f-4088-b394-835845567dee", "registration": 45187, "interface": "tcp://*", "signature_scheme": "hmac-sha256", "pack": "json", "unpack": "json", "location": "172.19.1.134" }

[security]jpeden@galaxy-> more ipcontroller-d776da1d-16ec-4c85-9993-22f5b088bda5-engine.json { "control": 34699, "task": 51469, "hb_ping": 45104, "mux": 47048, "patch": "VMFixIPControllerApp", "pack": "json", "hb_pong": 52224, "ssh": "", "key": "2f424d43-2f6f-4088-b394-835845567dee", "registration": 45187, "interface": "tcp://*", "iopub": 51241, "signature_scheme": "hmac-sha256", "unpack": "json", "location": "172.19.1.134" }

Bcbio is still hanging. What is interest is that the old version of 0.7.9a which is running ipython 1.2.1 and ipython-cluster-helper 0.2.15 works when run from the same machine where the later bcbio hangs.

jpeden1 commented 10 years ago

@mjafin eth0 is not visible from the compute nodes.

mjafin commented 10 years ago

@jpeden1 How do the json files look like when you use the version that works?

bcbio / bcbio-nextgen

torque is hanging indefinitely #416