Increase timeout for hub connection (or --timeout does nothing?)

lbeltrame commented 11 years ago

I have a cluster of 34 nodes, 32 of which are totally diskless over NFS, running SGE: while most of the heavy-IO paths are in RAM, there is still a noticeable delay in getting the clusters set up in an IPython cluster because they all access the NFS folder keeping the work directory.

This is particularly evident when I want to use all the cores available (700+).

bcibio-nextgen keeps on failing with:

IPython.parallel.error.TimeoutError: Hub connection request timed out

no matter how much I set the timeout option to.

So, is the timeout option what I am needing, or is it a shortcoming of IPython, or is it hardcoded?

chapmanb commented 11 years ago

Luca; The bcbio_nextgen.py --timeout parameter controls a different timeout: it handles how long it takes for the cluster to spin up. It sounds like the timeout you found is this setting for how long the engine should wait to connect to the controller:

https://github.com/roryk/ipython-cluster-helper/blob/master/cluster_helper/cluster.py#L31

We've already tripled this over the default, but it sounds like it needs to be longer to work on your system. Do you have a sense of a good setting which will capture the lag time you're seeing?

lbeltrame commented 11 years ago

In data domenica 21 luglio 2013 18:06:10, Brad Chapman ha scritto:

Hello Brad,

We've already tripled this over the default, but it sounds like it needs to be longer to work on your system. Do you have a sense of a good setting

Where is this set in bcbio-nextgen? I can do some local modifications then trigger runs and see if I can find an acceptable timeout. It is a function of core size: up to 100 the cluster starts fine (so far, I plan more tests later today), but when close to the maximum of 792 it still generates a timeout.

chapmanb commented 11 years ago

Luca; To test, clone the ipython-cluster-helper repository and adjust the --timeout=30 parameter manually, then install inside your bcbio-nextgen virtualenv with /path/to/virtualenv/bin/python setup.py. If you can identify a timeout that gives enough time to connect on your larger clusters, then I'll set this as the default and push a new version of ipython-cluster-helper. I'd like to get the fix in there generally so it'll help other projects beyond bcbio-nextgen that use it.

Thanks for all the help.

lbeltrame commented 11 years ago

In data lunedì 22 luglio 2013 04:22:46, Brad Chapman ha scritto:

Hello Brad,

To test, clone the ipython-cluster-helper repository and adjust the --timeout=30 parameter manually, then install inside your bcbio-nextgen

I'll adjust the parameter from my install, thanks for the tip (I don't use a virtualenv: my pip packages are handled through puppet to ensure consistency across the cluster's nodes).

I'll be sure to experiment and report back.

Luca Beltrame - KDE Forums team KDE Science supporter GPG key ID: 6E1A4E79

lbeltrame commented 11 years ago

It looks like that timeout option does nothing, as I set it to 5 mins (300 seconds) and yet after less than a minute the cluster stops with the error.

lbeltrame commented 11 years ago

EDIT: It looks like registration doesn't complete in time for all the items, but it doesn't look, as I said earlier, that timeout is involved, as I set it to a avery large value (1200) and yet there are issues.

So far I've put up to 330-335 cores: numbers above that threshold always generate issues. I don't think that there is an issue with SGE, as the jobs are dispatched in a few seconds at best (nodes have the SGE directories on a RAM disk).

Some more details: the maximum number of analyzable chunks the pipeline does with my data is 660.

chapmanb commented 11 years ago

Luca; It's also worth checking that the *_controller* and *_engines* files that do the submissions have the timeouts you're setting. This is a good sanity check to ensure the settings you're changing are actually being supplied.

Beyond that, can you provide more detail on the error messages? Is it only the IPython.parallel.error.TimeoutError: Hub connection request timed out message you're seeing? Where is this reported at? Do the ipython logs have more details?

lbeltrame commented 11 years ago

This is what I get:


2013-07-22 20:19:37.263 [IPClusterStart] Searching path [u'/mnt/data/projects/testproject/work', u'/mnt/data/projects/testproject/work/log/ipython'] for config files
2013-07-22 20:19:37.263 [IPClusterStart] Attempting to load config file: ipython_config.py
2013-07-22 20:19:37.263 [IPClusterStart] Config file ipython_config.py not found
2013-07-22 20:19:37.264 [IPClusterStart] Attempting to load config file: ipcluster_config.py
2013-07-22 20:19:37.264 [IPClusterStart] Config file not found, skipping: ipcluster_config.py
2013-07-22 20:20:48.235 [IPClusterStop] Using existing profile dir: u'/mnt/data/projects/testproject/work/log/ipython'
2013-07-22 20:20:48.261 [IPClusterStop] Stopping cluster [pid=24791] with [signal=2]
Traceback (most recent call last):
  File "/usr/local/bin/bcbio_nextgen.py", line 5, in <module>
    pkg_resources.run_script('bcbio-nextgen==0.7.0a', 'bcbio_nextgen.py')
  File "build/bdist.linux-x86_64/egg/pkg_resources.py", line 540, in run_script
  File "build/bdist.linux-x86_64/egg/pkg_resources.py", line 1455, in run_script
  File "/usr/local/lib/python2.7/dist-packages/bcbio_nextgen-0.7.0a-py2.7.egg/EGG-INFO/scripts/bcbio_nextgen.py", line 129, in <module>
    main(**kwargs)
  File "/usr/local/lib/python2.7/dist-packages/bcbio_nextgen-0.7.0a-py2.7.egg/EGG-INFO/scripts/bcbio_nextgen.py", line 69, in main
    fc_dir, run_info_yaml)
  File "/usr/local/lib/python2.7/dist-packages/bcbio_nextgen-0.7.0a-py2.7.egg/bcbio/pipeline/main.py", line 55, in run_main
    for xs in pipeline.run(config, config_file, run_parallel, parallel, dirs, pipeline_items):
  File "/usr/local/lib/python2.7/dist-packages/bcbio_nextgen-0.7.0a-py2.7.egg/bcbio/pipeline/main.py", line 227, in run
    multiplier=len(regions["analysis"])) as parallel:
  File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "/usr/local/lib/python2.7/dist-packages/bcbio_nextgen-0.7.0a-py2.7.egg/bcbio/distributed/ipython.py", line 165, in global_parallel
    with _view_from_parallel(parallel, work_dir, config) as view:
  File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "/usr/local/lib/python2.7/dist-packages/ipython_cluster_helper-0.1.8-py2.7.egg/cluster_helper/cluster.py", line 347, in cluster_view
    while not _is_up(url_file, num_jobs):
  File "/usr/local/lib/python2.7/dist-packages/ipython_cluster_helper-0.1.8-py2.7.egg/cluster_helper/cluster.py", line 300, in _is_up
    client = Client(url_file)
  File "/usr/local/lib/python2.7/dist-packages/IPython/parallel/client/client.py", line 485, in __init__
    self._connect(sshserver, ssh_kwargs, timeout)
  File "/usr/local/lib/python2.7/dist-packages/IPython/parallel/client/client.py", line 602, in _connect
    raise error.TimeoutError("Hub connection request timed out")
IPython.parallel.error.TimeoutError: Hub connection request timed out

lbeltrame commented 11 years ago

Cluster logs:

2013-07-22 20:19:37.274 [IPClusterStart] Starting ipcluster with [daemon=True]
2013-07-22 20:19:37.280 [IPClusterStart] Creating pid file: /mnt/data/projects/testproject/work/log/ipython/pid/ipcluster-d213f144-c998-4839-9dcd-e2793e12ab3d.pid
2013-07-22 20:19:37.281 [IPClusterStart] Starting Controller with cluster_helper.cluster.BcbioSGEControllerLauncher
2013-07-22 20:19:37.281 [IPClusterStart] Starting BcbioSGEControllerLauncher: ['qsub', u'./sge_controllerf76034dd-942b-4fbb-97fd-4f07ac252357']
2013-07-22 20:19:37.282 [IPClusterStart] adding PBS queue settings to batch script
2013-07-22 20:19:37.282 [IPClusterStart] adding job array settings to batch script
2013-07-22 20:19:37.282 [IPClusterStart] Writing batch script: ./sge_controllerf76034dd-942b-4fbb-97fd-4f07ac252357
2013-07-22 20:19:37.306 [IPClusterStart] Job submitted with job id: u'505'
2013-07-22 20:19:37.306 [IPClusterStart] Process 'qsub' started: u'505'
2013-07-22 20:19:47.314 [IPClusterStart] Starting 500 Engines with cluster_helper.cluster.BcbioSGEEngineSetLauncher
2013-07-22 20:19:47.314 [IPClusterStart] Starting BcbioSGEEngineSetLauncher: ['qsub', u'./sge_enginea904bc7e-e681-472c-b5b2-f83870c40a2e']
2013-07-22 20:19:47.314 [IPClusterStart] Writing batch script: ./sge_enginea904bc7e-e681-472c-b5b2-f83870c40a2e
2013-07-22 20:19:47.339 [IPClusterStart] Job submitted with job id: u'506'
2013-07-22 20:19:47.339 [IPClusterStart] Process 'qsub' started: u'506'
2013-07-22 20:20:48.261 [IPClusterStart] SIGINT received, stopping launchers...
2013-07-22 20:20:48.280 [IPClusterStart] Process 'qsub' stopped: {'output': u'einar has registered the job-array task 505.1 for deletion\n', 'job_id': u'505'}
2013-07-22 20:20:48.280 [IPClusterStart] ERROR | IPython cluster: stopping
2013-07-22 20:20:48.280 [IPClusterStart] Stopping Engines...

chapmanb commented 11 years ago

Luca; Thank you for the detailed log and tracebacks, that was a huge help in tracking down the issue. Could you try the latest version of ipython-cluster-helper and see if that resolves the issues on larger clusters? I increased the timeout both in the engines and also in local calls that check the status as it spins up. If this gets it working cleanly, I roll a new version and update the bcbio-nextgen requirements.

lbeltrame commented 11 years ago

I'll update and rerun. I'll keep you posted.

lbeltrame commented 11 years ago

It looks to me that some clients don't get the JSON file for the cluster, although they all mount the same FS:

2013-07-23 12:37:54.869 [IPEngineApp] Using existing profile dir: u'/mnt/data/projects/testproject/work/log/ipython'
2013-07-23 12:37:54.874 [IPEngineApp] WARNING | url_file u'/mnt/data/projects/testproject/work/log/ipython/security/ipcontroller-06dc62c8-3993-4232-a3c1-cd88b9552f49-engine.json' not found
2013-07-23 12:37:54.874 [IPEngineApp] WARNING | Waiting up to 960.0 seconds for it to arrive.

lbeltrame commented 11 years ago

I also checked I/O on the server which hosts the work dir, and it is minimal at this stage. Likewise for the server hosting the SSI image used by the nodes. The number of processes with active IDs doesn't exceed 300 even several minutes after the pipeline has started.

UPDATE: It reached 336 then it timed out.

lbeltrame commented 11 years ago

Looks like there's some sort of race: the number of assigned IDs fluctuates between 300 and 450, and never reaches the maximum (660).

EDIT: I ran several (failed) tests and I'm fairly convinced this is some sort of race condition that gets exposed somehow. The increased timeouts in the newest cluster-helper don't seem to help, the cluster just stays put doing nothing while new IDs don't get registered (everything happens in the first 2 minutes, then the rest of the time is spent idle).

lbeltrame commented 11 years ago

I did yet more tests by moving the work directory to a RAM disk to ensure I/O wasn't a culprit, and still the cluster prcesses with IDs are still around 330.

chapmanb commented 11 years ago

Luca; What error messages do you see? Timeouts? Anything in the ipython logs? The json warning you mention is only letting you know that the controller didn't yet write the json file, but shouldn't be a problem if the file eventually gets written.

I've been running this with no problems on 500 node cluster, so there isn't any expected stop around 300. I'm not sure what you mean by a fluctuating number of assigned IDs. How are you measuring/counting them? In log/ipython/log/ipcontroller-GUID.log, are you seeing lines like:

[IPControllerApp] engine::Engine Connected:

When it times out, are you seeing any error messages in the scheduler log files for the ipengine job array? Any clues will help dig into this more.

lbeltrame commented 11 years ago

Previously I was grepping for ID registrations and counting the results on the SGE logs, but I'll check that, since it's more precise. By "fluctuating" I mean the following:

The controller starts
Jobs are dispatched to SGE
Processes start looking for the JSON files
this is where things break - some processes wait and find the JSON file within the 960 second limit, others, perhaps even on the same node, hang indefinitely waiting for the JSON file - this happens within the first two minutes from the start, then there is no activity till the timeout is raised
Hub request timeout exception is raised

Some more details with the setup:

Two nodes with disks, one is the SGE qmaster + offers the images for the other nodes for PXE boot (it is a setup through a donation to my cash-strapped institution, so I have little possibility of intervention), the second holding the actual data (BWA indexes, FASTA files, results etc) 32 diskless nodes which mount their main FS from the first node, plus they mount through NFS the storage area from the second node, which is where the analyses occur All nodes have a SGE execd running.

I said around 300 because this phenomenon occurs only with these sizes, below there are no issues. I thought at first that there were NFS bottlenecks, but I/O and network activity is basically absent when the nodes sit idly waiting for the JSON files to appear. I'll be testing this further tomorrow.

chapmanb commented 11 years ago

Luca; Thanks for all the details. My suggestion would be to try and identify the machines where the json file is not found by the engines, directly ssh into those and look at the NFS directory. Can you see the json there or are there other NFS issues? The json file clearly exists since other engines are finding it, but it's not being transmitted/found for some reason.

Is the Hub timeout exception raised from the same location as before, or somewhere else? It might help give clues to see the traceback. This might be masking other timeouts, so could up it to see other potential sources of errors.

Other ideas: can you start up two separate 300 core processes without issues or does this fail to start as well? What versions of ipython-dev and pyzmq do you have?

Thanks for all the patience with this. Debugging scale issues is definitely hard work.

lbeltrame commented 11 years ago

Further investigation: the ipcontroller log is full of

2013-07-24 09:29:43.647 [IPControllerApp] WARNING | iopub::IOPub message lacks parent: {'parent_header': {}, 'msg_type': 'status', 'msg_id': '1be77dbe-e820-49c8-b4d2-1caee5e8235f', 'content': {'execution_state': 'starting'}, 'header': {'date': datetime.datetime(2013, 7, 24, 9, 29, 43, 646067), 'username': 'einar', 'session': 'a8a5228e-ef03-49e3-bf45-10daa11836c1', 'msg_id': '1be77dbe-e820-49c8-b4d2-1caee5e8235f', 'msg_type': 'status'}, 'buffers': [], 'metadata': {}}

Then a lot of

2013-07-24 09:30:58.469 [IPControllerApp] registration::purging stalled registration: 0
2013-07-24 09:30:58.471 [IPControllerApp] registration::purging stalled registration: 1
2013-07-24 09:30:58.472 [IPControllerApp] registration::purging stalled registration: 2

And

2013-07-24 09:31:13.507 [IPControllerApp] heartbeat::ignoring new heart: '4d7547d8-da5e-46d1-9d52-8165e08b929e'

lbeltrame commented 11 years ago

I decided to start fresh because I wasn't sure the log was perfectly clean. Starting the pipeline where it fails (variant calling) gives, near the end (where it "looks" idle doing nothing), no more connections (Engine Connected) but instead a lot of:

2013-07-24 09:44:51.293 [IPControllerApp] client::client '\x00ox\x14@' requested 'connection_request'
2013-07-24 09:44:51.294 [IPControllerApp] client::client ['\x00ox\x14@'] connected
2013-07-24 09:45:01.313 [IPControllerApp] client::client '\x00ox\x14A' requested 'connection_request'
2013-07-24 09:45:01.314 [IPControllerApp] client::client ['\x00ox\x14A'] connected
2013-07-24 09:45:11.333 [IPControllerApp] client::client '\x00ox\x14B' requested 'connection_request'
2013-07-24 09:45:11.333 [IPControllerApp] client::client ['\x00ox\x14B'] connected

There are still warnings as before (the "lacks parent" warning). Afterwards there are new connection requests, but they aren't acknowledged.

FYI this occurs with IPython master from this morning. libzmq 3.2.3, latest pyzmq from pip.

lbeltrame commented 11 years ago

I located one of the stalled jobs and checked the SGE output.

What it says

2013-07-24 09:41:00.875 [IPEngineApp] Using existing profile dir: u'/mnt/data/projects/testproject/work/log/ipython'
2013-07-24 09:41:00.880 [IPEngineApp] WARNING | url_file u'/mnt/data/projects/testproject/work/log/ipython/security/ipcontroller-9bfdaa31-3255-4eb5-b4dd-1b1b00ea2db7-engine.json' not found
2013-07-24 09:41:00.880 [IPEngineApp] WARNING | Waiting up to 960.0 seconds for it to arrive.
2013-07-24 09:51:00.951 [IPEngineApp] Loading url_file u'/mnt/data/projects/testproject/work/log/ipython/security/ipcontroller-9bfdaa31-3255-4eb5-b4dd-1b1b00ea2db7-engine.json'
2013-07-24 09:51:00.969 [IPEngineApp] Registering with controller at tcp://10.248.212.3:38141
2013-07-24 09:53:01.068 [IPEngineApp] CRITICAL | Registration timed out after 120.0 seconds

Then again I see in the logs the warnings ("lacks parent") followed by purges and "ignoring new heart" messages. Now I'm checking if errors occur on a single node ("concentrated") or spread over different nodes.

chapmanb commented 11 years ago

Luca; The two diagnostic messages I see are purging stalled registration from the controller and Registration timed out from the engine. It looks like everything is fine with NFS and the json file and you can at least do an initial connection to the controller to register, but that's not completing. Also, I see the iopub parent warnings as well so I don't think that is a root cause.

I checked in updates to increase the timeout for purging stalled registrations. We were already at 75s (versus a default of 10s) but I upped this to 3 minutes and set a matching timeout for engines.

With the updated settings, do you still see purged registrations in the log/ipython/log/ipcontroller-GUID.log? It might be worth posting the full controller log as a gist if you still have issues and I can compare with what I'm seeing here.

lbeltrame commented 11 years ago

I'll update and report back.

lbeltrame commented 11 years ago

Interesting: right now I can grep only two connections maximum, and again I get a hub connection timeout. I really wonder what is going on here. I'll be testing things further.

EDIT: it's quite odd, the log in the controller register just two connected engines, while the output from SGE report more ID registrations....

I'll be pasting the controller engine as a gist as well.

lbeltrame commented 11 years ago

Done: https://gist.github.com/lbeltrame/6070334 Note: I don't see purged registrations anymore, but I still get a hub connection timeout.

chapmanb commented 11 years ago

Luca; Is that gist truncated? It gets right to the interesting part where engines are connecting, then stops. If it fails at that point I'd at least expect to see a terminating message at the end when it gets shut down. Further, can you paste the exact error message you're seeing from bcbio-nextgen? I'm not clear exactly when/how it's timing out.

In general this looks equivalent to what I see but slower. In my tests with 500 cores it takes 3 minutes from start up to fully connected controller and engines, and 1 minute to start registering controllers. It looks like you're taking 2+ minutes to begin registering. It might be a matter of turning up some of the timeouts but I'd need more information on the timeout to tweak them further.

lbeltrame commented 11 years ago

In data mercoledì 24 luglio 2013 08:12:47, Brad Chapman ha scritto:

can you paste the exact error message you're seeing from bcbio-nextgen? I'm not clear exactly when/how it's timing out.

I'll be pasting more (failed) runs soon, I'm doing tests by running two 300 core runs in parallel (the cluster by itself has enough capacity to handle this).

I'm not sure why the file got truncated: I'll be checking other results just in case.

Luca Beltrame, Ph.D. Translational Genomics Unit, Department of Oncology IRCCS Istituto di Ricerche Farmacologiche "Mario Negri"

lbeltrame commented 11 years ago

Is there any way to force the controller on specific hosts (e.g. by using QUEUENAME@host)? I want to see if I get the timeouts nevertheless if it runs on the diskful machines as opposed to the diskless. Otherwise I can just hack it in the source of the helper just to check if this makes a difference.

EDIT: I did, and it makes no difference.

lbeltrame commented 11 years ago

Full error: https://gist.github.com/lbeltrame/6071659

Full log: (yes, it is truncated like before, with no shutdown notice) https://gist.github.com/lbeltrame/6071711

chapmanb commented 11 years ago

Luca; Thanks, the traceback is very helpful. It looks like we've moved our timeouts back to when we try to contact the controller. I pushed another fix to ipython-cluster-helper on GitHub to avoid dying here. Could you try with these updates and see if it can progress longer? I'm not sure why the running process is having so much trouble contacting the controller but hopefully this cleans it up. Let me know if this helps and if we get stuck again where the next errors are at. Thanks for all the back and forth.

lbeltrame commented 11 years ago

Currently running, but I'm seeing purged registrations again (got up to ~5 Engine::Connected). I am puzzeld in why the size (> 330 or so) matters because it shouldn't change anything. There aren't network issues (I think it's Gb network), I'm not seeing any I/O issues when the pipeline is runnning... it looks pretty much as a "heisenbug" to me.

I'll keep this running till it works (or errors out) then take a look.

lbeltrame commented 11 years ago

With these changes, it just hangs:

2013-07-25 08:54:51.972 [IPClusterStart] Config changed:
2013-07-25 08:54:51.973 [IPClusterStart] {'SGELauncher': {'queue': 'main.q'}, 'BcbioSGEEngineSetLauncher': {'cores': 1, 'pename': u'orte', 'resources': u''}, 'IPClusterEngines': {'early_shutdown': 240}, 'Application': {'log_level': 10}, 'ProfileDir': {'location': u'/mnt/data/projects/testproject/work/log/ipython'}, 'BaseParallelApplication': {'log_to_file': True, 'cluster_id': u'cff92bed-1515-41fb-a40e-a50e1153cc12'}, 'IPClusterStart': {'delay': 10, 'controller_launcher_class': u'cluster_helper.cluster.BcbioSGEControllerLauncher', 'daemonize': True, 'engine_launcher_class': u'cluster_helper.cluster.BcbioSGEEngineSetLauncher', 'n': 600}}
2013-07-25 08:54:52.056 [IPClusterStart] Using existing profile dir: u'/mnt/data/projects/testproject/work/log/ipython'
2013-07-25 08:54:52.056 [IPClusterStart] Searching path [u'/mnt/data/projects/testproject/work', u'/mnt/data/projects/testproject/work/log/ipython'] for config files
2013-07-25 08:54:52.056 [IPClusterStart] Attempting to load config file: ipython_config.py
2013-07-25 08:54:52.057 [IPClusterStart] Config file ipython_config.py not found
2013-07-25 08:54:52.057 [IPClusterStart] Attempting to load config file: ipcluster_cff92bed_1515_41fb_a40e_a50e1153cc12_config.py
2013-07-25 08:54:52.058 [IPClusterStart] Config file not found, skipping: ipcontroller_config.py
2013-07-25 08:54:52.059 [IPClusterStart] Attempting to load config file: ipcluster_cff92bed_1515_41fb_a40e_a50e1153cc12_config.py
2013-07-25 08:54:52.060 [IPClusterStart] Config file not found, skipping: ipengine_config.py
2013-07-25 08:54:52.060 [IPClusterStart] Attempting to load config file: ipcluster_cff92bed_1515_41fb_a40e_a50e1153cc12_config.py
2013-07-25 08:54:52.061 [IPClusterStart] Config file not found, skipping: ipcluster_config.py
2013-07-25 09:40:25.085 [IPClusterStop] Using existing profile dir: u'/mnt/data/projects/testproject/work/log/ipython'
2013-07-25 09:40:25.116 [IPClusterStop] Stopping cluster [pid=4724] with [signal=2]
Traceback (most recent call last):
  File "/usr/local/bin/bcbio_nextgen.py", line 5, in <module>
    pkg_resources.run_script('bcbio-nextgen==0.7.0a', 'bcbio_nextgen.py')
  File "build/bdist.linux-x86_64/egg/pkg_resources.py", line 540, in run_script
  File "build/bdist.linux-x86_64/egg/pkg_resources.py", line 1455, in run_script
  File "/usr/local/lib/python2.7/dist-packages/bcbio_nextgen-0.7.0a-py2.7.egg/EGG-INFO/scripts/bcbio_nextgen.py", line 129, in <module>
    main(**kwargs)
  File "/usr/local/lib/python2.7/dist-packages/bcbio_nextgen-0.7.0a-py2.7.egg/EGG-INFO/scripts/bcbio_nextgen.py", line 69, in main
    fc_dir, run_info_yaml)
  File "/usr/local/lib/python2.7/dist-packages/bcbio_nextgen-0.7.0a-py2.7.egg/bcbio/pipeline/main.py", line 55, in run_main
    for xs in pipeline.run(config, config_file, run_parallel, parallel, dirs, pipeline_items):
  File "/usr/local/lib/python2.7/dist-packages/bcbio_nextgen-0.7.0a-py2.7.egg/bcbio/pipeline/main.py", line 227, in run
    multiplier=len(regions["analysis"])) as parallel:
  File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "/usr/local/lib/python2.7/dist-packages/bcbio_nextgen-0.7.0a-py2.7.egg/bcbio/distributed/ipython.py", line 165, in global_parallel
    with _view_from_parallel(parallel, work_dir, config) as view:
  File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "build/bdist.linux-x86_64/egg/cluster_helper/cluster.py", line 356, in cluster_view
IOError: Cluster startup timed out.

Log: https://gist.github.com/lbeltrame/6077776

chapmanb commented 11 years ago

Luca; It looks like it is still working when it runs into the purge timeouts, so I pushed another increase in those values to ipython-cluster-helper. Can you try again with these updates to see how much further we get? I'm not sure why the process is so slow to register. Some thoughts/questions:

How long does it take jobs in the array to get scheduled on SGE? Is everything scheduled immediately, or is there a long wait for engines to start before they can start the registration process?
There seem to be long unaccounted for lags in registration. My guess would be some kind of networking issue that is causing routing problems. I'm not super good at debugging those, but if you have anyone who is good at identifying them it might be worth looking into generally on the cluster.
I looked into the iopub::IOPub message lacks parent messages to double check them, and we do expect those because we don't use a database store (the --nodb flag in ipython-cluster-helper). The ominous looking warning only says the message is not stored in a database. We had to remove database support since it doesn't allow scaling up to large cluster.

Hope this helps.

lbeltrame commented 11 years ago

In data giovedì 25 luglio 2013 03:01:55, Brad Chapman ha scritto:

How long does it take jobs in the array to get scheduled on SGE? Is

It used to take a long time, but now it's almost instant (I moved the spool directories of the nodes from NFS to ramdisks), I would wager it takes about less than a minute to move from "qw" to "r" for 600 jobs.

I also ran 790-core MPI jobs over there without issues.

Unfortunately the other cluster I have access to is too small (96 cores) to reproduce this problem.

long unaccounted for lags in registration. My guess would be some kind of networking issue that is causing routing problems. I'm not super good at

I will investigate, hoping that it is not the cause: as I said this is a donated resource, with no sysadmin support (I'm doing this when my duty as bioinformatician allows for it...).

Hope this helps.

If all else fails, I'll try to convince the donors to allow me to put the OS on disks and try again.

Luca Beltrame - KDE Forums team KDE Science supporter GPG key ID: 6E1A4E79

lbeltrame commented 11 years ago

Interestingly, it's the controller that does not respond, rather than engines being slow. I ran some additional processes by hand and they can't seem to connect (but "telnet" reaches the port and connects). I also noticed the controller process pegging CPU at 100%. I'm going to attach gdb to it and have a look.

EDIT: It's stuck in the zmq polling. Not sure if that may mean anything. EDIT2: Some nodes fail to register after 5 minutes, probably due to this unresponsiveness from the controller process.

chapmanb commented 11 years ago

Luca; Thanks for the additional information. Could you try the latest update to ipython-cluster-helper to see if that improves things? This staggers launching of jobs in an array by a half second to try and avoid overloading the controller. My hope is that this will let nodes register in regular intervals without overloading it with a lot of simultaneous requests.

lbeltrame commented 11 years ago

I'm running a test from scratch right now, I'll let you know ASAP.

lbeltrame commented 11 years ago

It doesn't seem to improve the situation. After a while (not immediately) CPU on the controller goes to 100% again. and registrations are purged, like before. Trying to manually launch an instance from a node (via command line) doesn't even generate a registration request in the controller.

lbeltrame commented 11 years ago

This is an example log from a registration

2013-07-25 16:49:27.273 [IPEngineApp] Using existing profile dir: u'/mnt/data/projects/testproject/work/log/ipython'
2013-07-25 16:49:27.276 [IPEngineApp] Loading url_file u'/mnt/data/projects/testproject/work/log/ipython/security/ipcontroller-2c23b966-fc9a-4020-b1f4-5866825693ba-engine.json'
2013-07-25 16:49:27.292 [IPEngineApp] Registering with controller at tcp://10.248.212.205:46444
2013-07-25 16:49:27.379 [IPEngineApp] Starting to monitor the heartbeat signal from the hub every 72010 ms.
2013-07-25 16:49:27.383 [IPEngineApp] Using existing profile dir: u'/mnt/data/projects/testproject/work/log/ipython'
2013-07-25 16:49:27.385 [IPEngineApp] Completed registration with id 124
2013-07-25 16:53:03.480 [IPEngineApp] WARNING | No heartbeat in the last 72010 ms (1 time(s) in a row).
2013-07-25 16:54:15.491 [IPEngineApp] WARNING | No heartbeat in the last 72010 ms (2 time(s) in a row).
2013-07-25 16:55:27.429 [IPEngineApp] WARNING | No heartbeat in the last 72010 ms (3 time(s) in a row).
2013-07-25 16:56:39.511 [IPEngineApp] WARNING | No heartbeat in the last 72010 ms (4 time(s) in a row).

lbeltrame commented 11 years ago

gdb trace:

(gdb) py-list
  92                timeout = -1
  93            
  94            timeout = int(timeout)
  95            if timeout < 0:
  96                timeout = -1
 >97            return zmq_poll(list(self.sockets.items()), timeout=timeout)
  98

Called in /usr/local/lib/python2.7/dist-packages/zmq/sugar/poll.py.

chapmanb commented 11 years ago

Luca; Bummer, it looks like the controller becomes unresponsive for minutes at a time. The polling tracepoint indicates to me it's waiting on requests but is overloaded with work.

Does creating a longer stagger between jobs help? If you adjust 0.5 to 1 or 2, do you get more registrations finished? https://github.com/roryk/ipython-cluster-helper/blob/master/cluster_helper/cluster.py#L88

If that doesn't help it might be worth a problem report to the ipython team to see if they have any other tuning suggestions:

https://github.com/ipython/ipython/issues?state=open

Min is super helpful and may have other ideas to investigate.

lbeltrame commented 11 years ago

I'm trying the following:

[x] increasing stagger;
[x] Using pyzmq from git to see if it makes any difference
[ ] If all else fails, I'll contact the IPython people.

lbeltrame commented 11 years ago

We're getting there, It is still running but far more registrations have occurred than before. So far I used a stagger of 2. I'll try lowering it if things time out.

EDIT: spoke too soon. After about 250 registrations, it spiked again on CPU (notice that it still manages to register a few new connections). I'll check with a newer pyzmq.

lbeltrame commented 11 years ago

Over 250 connected engines seem to trip really off the controller. I'll file a bug with IPython.

lbeltrame commented 11 years ago

In the end it was SGE and its "exotic" configuration to cause this issue (see linked issue from IPython). However, what prevented me from finding the cause immediately is that the ipcontroller SGE script saves the log in the node's home dir (default for submit jobs) and not where bcbio-nextgen is running. Only there I could see the error "too many open files".

Perhaps it would be nice to have the controller set the outpiut dir (-o in SGE terms) to the same dir where the pipeline is running.

chapmanb commented 11 years ago

Luca; Glad you got this working and thanks for all the hard work catching the problem. I pushed a fix to report ipcontroller logs in the working directory: this was missing the -cwd flag.

For ulimit, how did you need to tweak this specifically for SGE in the end? I'd like to put together documentation on setting ulimit since I expect this to be a common scaling up problem. Thanks again for all the patience figuring this out.

lbeltrame commented 11 years ago

SGE needs configuration at the qmaster level: after invoking qconf -mconf from a host with admin privileges, one needs to to find the execd_params line then put S_DESCRIPTORS= there, like this (snippet from my own cluster)

execd_params                 S_DESCRIPTORS=8192

One could use also HDESCRIPTORS (S is for softlimit, H_ for hardlimit), but it is enough to set one (the other will be set to the same value.

Reference: http://gridengine.org/pipermail/users/2012-November/005226.html

bcbio / bcbio-nextgen

Increase timeout for hub connection (or --timeout does nothing?) #60