Closed dwaggott closed 9 years ago
Daryl;
This isn't expected. It should keep sending jobs independent of the state of all the engines. The controller just pushes out jobs to any waiting engines and there is no logic in there about how many are yet to spin up. An engine should be able to run multiple jobs unless it failed or got killed. I'm really confused as to what is going on with this. Any useful information would be either in the IPython log files (log/ipython/log
) or in the engine log files which should be named like bcbio-ipengine*
and bcbio-ipcontroller*
. Does it behave correctly if you run in multicore mode on a single machine or with less overall cores? Sorry to not have a better idea but hope this helps.
Oh...
2015-01-28 21:53:11.577 [IPEngineApp] Using existing profile dir: u'/scratch/PI/euan/projects/athletes/bcbio/log/ipython'
2015-01-28 21:53:11.588 [IPEngineApp] Loading url_file u'/scratch/PI/euan/projects/athletes/bcbio/log/ipython/security/ipcontroller-dc297844-7484-4263-b837-f63c7cd3e377-engine.json'
2015-01-28 21:53:11.602 [IPEngineApp] Registering with controller at tcp://10.210.47.127:55758
2015-01-28 21:53:11.699 [IPEngineApp] Starting to monitor the heartbeat signal from the hub every 60010 ms.
2015-01-28 21:53:11.702 [IPEngineApp] Using existing profile dir: u'/scratch/PI/euan/projects/athletes/bcbio/log/ipython'
2015-01-28 21:53:11.704 [IPEngineApp] Completed registration with id 6
slurmstepd: *** JOB 1437888 CANCELLED AT 2015-01-29T00:34:51 *** on sh-3-17
slurmstepd: Exceeded step memory limit at some point. Step may have been partially swapped out to disk.
slurmstepd: Exceeded job memory limit at some point. Job may have been partially swapped out to disk.
slurmstepd: _slurm_cgroup_destroy: problem deleting step cgroup path /cgroup/freezer/slurm/uid_261851/job_1437888/step_batch: Device or resource busy
2015-01-28 21:53:11.659 [IPEngineApp] Using existing profile dir: u'/scratch/PI/euan/projects/athletes/bcbio/log/ipython'
2015-01-28 21:53:11.670 [IPEngineApp] Loading url_file u'/scratch/PI/euan/projects/athletes/bcbio/log/ipython/security/ipcontroller-dc297844-7484-4263-b837-f63c7cd3e377-engine.json'
2015-01-28 21:53:11.684 [IPEngineApp] Registering with controller at tcp://10.210.47.127:55758
2015-01-28 21:53:11.780 [IPEngineApp] Starting to monitor the heartbeat signal from the hub every 60010 ms.
2015-01-28 21:53:11.783 [IPEngineApp] Using existing profile dir: u'/scratch/PI/euan/projects/athletes/bcbio/log/ipython'
2015-01-28 21:53:11.785 [IPEngineApp] Completed registration with id 7
slurmstepd: *** JOB 1437889 CANCELLED AT 2015-01-29T00:34:51 *** on sh-3-18
slurmstepd: Exceeded step memory limit at some point. oom-killer likely killed a process.
slurmstepd: Exceeded job memory limit at some point. oom-killer likely killed a process.
slurmstepd: _slurm_cgroup_destroy: problem deleting step cgroup path /cgroup/freezer/slurm/uid_261851/job_1437889/step_batch: Device or resource busy
Only looks like 11GB max ram consumption was reported. That should be fine so maybe it's a slurm reporting issue.
sacct -o MaxRSSNode -j 1437888
MaxRSS
----------
11541180K
Daryl;
Sorry about the issue. A memory spike that SLURM cannot detect is unexpected but it looks like something triggers the OOM killer on the system. Can you tell from the failed bcbio-ipengine*
slurm log files where is the process the memory consumption occurred? This would help narrow down some where we are seeing issues. If this is during alignment, you could also try increasing memory available per core in your bcbio_system.yaml
to give it more room to work with. Hope this helps.
galaxy/bcbio_system.yaml
doesn't have a default memory specified. Based on the log I'm assuming it's 2g / cpu. Does that sound right? I'll push to 3g.
An update. Rerunning the pipeline with the original memory request consistently causes these idle nodes. The logs stop reporting shortly after initialization and are only informative after the jobs get killed due to timelimits or manual intervention.
Is there a way for to monitor the engine processes for these types of shutdown?
Daryl;
The best way to monitor these is through sacct
and the bcbio-ipengine*
logs from SLURM. Unfortunately IPython is not going to get back any information on the process being killed so it isn't sure if it just can't reach an engine because of networking problems, or it died. IPython will eventually time out and kill the entire thing but will try to keep running if it has any engines that are working. This makes it resiliant to one-off nodes dying in the general case.
Does increasing the memory in bcbio_system.yaml
help to resolve the underlying issue?
I increased the memory and the jobs are still sitting empty. This time I don't see the memory error in the logs. I'll upload the logs, maybe you can see something.
I hung out on a node and watched. Looks like it's the alignment step that is dying. I'm using local scratch based on an environment variable for tmp (code updated to development). Nothing looks to be in the tmp directory, might that indicate an issue? If so, would the failure to create tmp directories be reflected in the logs?
Leaving tmp as default. I can see files being generated but still all the nodes go dormant.
# example of seemingly complete alignment that is still stuck in the tx directory.
├── [4.0K Jan 23 0:36] 3_09
│ └── [4.0K Feb 3 12:55] tx
│ ├── [4.0K Feb 3 13:28] tmp0fzrhU
│ │ ├── [ 24M Feb 3 13:28] 15_2015-01-22_project_n143-fb-sort-sr.bam
│ │ └── [3.6M Feb 3 13:28] 15_2015-01-22_project_n143-fb-sort-sr.bam.bai
│ ├── [4.0K Feb 3 13:28] tmptBchFW
│ │ ├── [5.0G Feb 3 13:30] 15_2015-01-22_project_n143-fb-sort.bam
│ │ └── [6.6M Feb 3 13:30] 15_2015-01-22_project_n143-fb-sort.bam.bai
│ └── [4.0K Feb 3 13:28] tmpucQjYH
│ ├── [ 75M Feb 3 13:28] 15_2015-01-22_project_n143-fb-sort-disc.bam
│ └── [5.0M Feb 3 13:28] 15_2015-01-22_project_n143-fb-sort-disc.bam.bai
# example of nearly complete alignment.
├── [4.0K Jan 23 3:24] 3_49
│ └── [4.0K Feb 3 12:55] tx
│ ├── [4.0K Feb 3 12:55] tmpKmAbvc
│ ├── [4.0K Feb 3 13:35] tmpUm_crr
│ │ ├── [ 22M Feb 3 13:35] 45_2015-01-22_project_n143-fb-sort-sr.bam
│ │ └── [3.5M Feb 3 13:35] 45_2015-01-22_project_n143-fb-sort-sr.bam.bai
│ └── [4.0K Feb 3 13:35] tmpZUF01z
│ ├── [ 89M Feb 3 13:35] 45_2015-01-22_project_n143-fb-sort-disc.bam
│ └── [5.1M Feb 3 13:35] 45_2015-01-22_project_n143-fb-sort-disc.bam.bai
# most samples look like this
├── [4.0K Jan 23 1:34] 3_11
│ ├── [6.6G Jan 23 1:34] 16_2015-01-22_project_n143-fb-sort.bam
│ ├── [6.7M Jan 23 1:34] 16_2015-01-22_project_n143-fb-sort.bam.bai
│ ├── [ 94M Jan 23 1:31] 16_2015-01-22_project_n143-fb-sort-disc.bam
│ ├── [5.1M Jan 23 1:31] 16_2015-01-22_project_n143-fb-sort-disc.bam.bai
│ ├── [ 28M Jan 23 1:31] 16_2015-01-22_project_n143-fb-sort-sr.bam
│ ├── [3.7M Jan 23 1:31] 16_2015-01-22_project_n143-fb-sort-sr.bam.bai
│ └── [4.0K Jan 23 1:34] tx
# example of dormant node via top
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
13437 dwaggott 20 0 15436 1708 944 R 0.3 0.0 0:00.14 top
41568 dwaggott 20 0 103m 1224 1016 S 0.0 0.0 0:00.00 slurm_script
41577 dwaggott 20 0 1044m 107m 16m S 0.0 0.2 0:02.30 python
41880 dwaggott 20 0 0 0 0 Z 0.0 0.0 0:00.00 bash <defunct>
43235 dwaggott 20 0 137m 2316 932 S 0.0 0.0 0:00.70 sshd
43236 dwaggott 20 0 109m 3924 1564 S 0.0 0.0 0:00.28 bash
43311 dwaggott 20 0 57704 1048 404 S 0.0 0.0 0:00.00 ssh-agent
Daryl;
Sorry about the continued problems. It seems like the process is still being killed/dying but it's hard to tell exactly what is happening from this information. The log files for slurm, which should be produced in the work directory as bcbio-ipengine*
should be the first place to look to try and find more details.
I dug around in the different tools and it looks like @lomereiter has been working on eliminating some memory leaks in sambamba (lomereiter/sambamba#116). I grabbed the latest development version so if you update the tools with:
bcbio_nextgen.py upgrade --tools
maybe this will avoid the underlying issue and keep things running. Sorry about all the problems and please let us know if this doesn't fix it.
What a pleasure it is getting code to work on other peoples systems! appreciate the help.
The out logs are empty and the err logs stop reporting at the same time the nodes go idle. Not sure if the few lines that show up after scancel are informative.
2015-02-03 12:51:49.512 [IPEngineApp] Using existing profile dir:
u'/scratch/PI/euan/projects/athletes/bcbio/log/ipython'
2015-02-03 12:51:49.521 [IPEngineApp] Loading url_file
u'/scratch/PI/euan/projects/athletes/bcbio/log/ipython/security/ipcontroller-381cb9e8-94
2015-02-03 12:51:49.526 [IPEngineApp] Registering with controller at tcp://
10.210.47.197:50393
2015-02-03 12:51:49.597 [IPEngineApp] Starting to monitor the heartbeat
signal from the hub every 60010 ms.
2015-02-03 12:51:49.600 [IPEngineApp] Using existing profile dir:
u'/scratch/PI/euan/projects/athletes/bcbio/log/ipython'
2015-02-03 12:51:49.602 [IPEngineApp] Completed registration with id 4
After scancel:
slurmstepd: *** JOB 1479012 CANCELLED AT 2015-02-04T06:33:25 *** on sh-2-3
slurmstepd: _slurm_cgroup_destroy: problem deleting step cgroup path
/cgroup/freezer/slurm/uid_261851/job_1479012/step_batch: Device or resource
busy
Looks like the upgrade had an error with cmake.
Running the failed line:
$ export CC=${CC:-`which gcc`} && export CXX=${CXX:-`which g++`} && export PERL5LIB=/home/dwaggott/apps/lib/perl5:${PERL5LIB} && /home/dwaggott/apps/bin/brew install --env=inherit cmake
==> Downloading http://www.cmake.org/files/v3.1/cmake-3.1.1.tar.gz
Already downloaded: /home/dwaggott/.cache/Homebrew/cmake-3.1.1.tar.gz
==> Downloading https://github.com/Kitware/CMake/commit/c5d9a828.diff
Already downloaded: /home/dwaggott/.cache/Homebrew/cmake--patch-61b15b638c1409233f36e6e3383b98cab514c3bb.diff
==> Patching
patching file Modules/FindOpenSSL.cmake
==> Downloading https://pypi.python.org/packages/source/P/Pygments/Pygments-2.0.2.tar.gz
######################################################################## 100.0%
==> python -c import setuptools... --no-user-cfg install --prefix=/tmp/cmake-XUerJG/cmake-3.1.1/sphinx --single-version-externally-managed --record=installed.txt
or: -c --help [cmd1 cmd2 ...]
or: -c --help-commands
or: -c cmd --help
error: option --no-user-cfg not recognized
READ THIS: https://github.com/Homebrew/linuxbrew/blob/master/share/doc/homebrew/Troubleshooting.md#troubleshooting
These open issues may also help:
cmake reports: CMake Error: Could not find CMAKE_ROOT !!! (https://github.com/Homebrew/homebrew/issues/35888)
cmake builds fail on CLT-only with --env=std (https://github.com/Homebrew/homebrew/issues/29101)
Installing cmake without docs avoids the sphynx problem and allows the bcbio upgrade to complete.
/home/dwaggott/apps/bin/brew install -v --without-docs --env=inherit cmake
Daryl; Sorry about the re-build issues. It looks like there was a new revision of cmake and we weren't handling it right in CloudBioLinux. I updated it to fix this for future runs. I hope the new sambamba will resolve the underlying issue. I'm totally lost on why you're not getting more information about these failing but will keep my fingers crossed that this in the underlying issue.
Using the latest update seems to have got me over the alignment hump. Thanks.
I've run into a similar problem at another point where the slurm jobs still seem to be running however the nodes are essentially idle and logging has stopped. Checking the nodes I can see python and slurm_script processes running under my name but nothing happening.
Attached are some selected sections from the submission err.log.
### notable errors
[2015-02-12T17:25Z] sh-1-7.local: bedtools genomecov: 9 : 6_2015-01-22_project_n143-fb-sort.bam
[2015-02-12T17:25Z] sh-1-7.local: sambamba-view: BAM index file (.bai) must be provided
[2015-02-12T17:25Z] sh-1-7.local: Failed to open BAM file stdin
[2015-02-12T17:25Z] sh-1-7.local: Uncaught exception occurred
Traceback (most recent call last):
File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 21, in run
_do_run(cmd, checks, log_stdout)
File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 95, in _do_run
raise subprocess.CalledProcessError(exitcode, error_msg)
CalledProcessError: Command 'set -o pipefail; /share/PI/euan/apps/bin/sambamba view -F 'mapping_quality > 0' -L /scratch/PI/euan/projects/athletes/bcbio/align/1_08/6_2015-01-22_project_n143-fb-sort-callable-split/tx/tmpprYtwo/6_2015-01-22_project_n143-fb-sort-9-callable-coverageregi
sambamba-view: BAM index file (.bai) must be provided
Failed to open BAM file stdin
' returned non-zero exit status 1
[2015-02-12T17:25Z] sh-1-7.local: Index BAM file: 6_2015-01-22_project_n143-fb-sort.bam
[2015-02-12T17:29Z] sh-1-7.local: Unexpected error
Traceback (most recent call last):
File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/bcbio/distributed/ipythontasks.py", line 38, in _setup_logging
yield config
File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/bcbio/distributed/ipythontasks.py", line 91, in postprocess_alignment
return ipython.zip_args(apply(sample.postprocess_alignment, *args))
File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/bcbio/pipeline/sample.py", line 149, in postprocess_alignment
callable.block_regions(data["work_bam"], data["sam_ref"], data["config"])
File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/bcbio/bam/callable.py", line 298, in block_regions
callable_bed = parallel_callable_loci(in_bam, ref_file, config)
File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/bcbio/bam/callable.py", line 46, in parallel_callable_loci
"callable_bed", ["config"])[0]
File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/bcbio/distributed/split.py", line 59, in parallel_split_combine
split_output = parallel_fn(parallel_name, split_args)
File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/bcbio/distributed/multi.py", line 28, in run_parallel
return run_multicore(fn, items, config, parallel=parallel)
File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/bcbio/distributed/multi.py", line 84, in run_multicore
for data in joblib.Parallel(parallel["num_jobs"])(joblib.delayed(fn)(x) for x in items):
File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 660, in __call__
self.retrieve()
File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 543, in retrieve
raise exception_type(report)
TypeError: __init__() takes at least 3 arguments (2 given)
[2015-02-12T19:31Z] sh-2-33.local: bedtools genomecov: 14 : 45_2015-01-22_project_n143-fb-sort.bam
[2015-02-12T19:31Z] sh-2-33.local: sambamba-view: BAM index file (.bai) must be provided
[2015-02-12T19:31Z] sh-2-33.local: Failed to open BAM file stdin
[2015-02-12T19:31Z] sh-2-33.local: Uncaught exception occurred
Traceback (most recent call last):
File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 21, in run
_do_run(cmd, checks, log_stdout)
File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 95, in _do_run
raise subprocess.CalledProcessError(exitcode, error_msg)
CalledProcessError: Command 'set -o pipefail; /share/PI/euan/apps/bin/sambamba view -F 'mapping_quality > 0' -L /scratch/PI/euan/projects/athletes/bcbio/align/3_49/45_2015-01-22_project_n143-fb-sort-callable-split/tx/tmp_dsCW9/45_2015-01-22_project_n143-fb-sort-14-callable-coverager
sambamba-view: BAM index file (.bai) must be provided
Failed to open BAM file stdin
' returned non-zero exit status 1
2015-02-12T19:41Z] sh-2-33.local: Unexpected error
Traceback (most recent call last):
File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/bcbio/distributed/ipythontasks.py", line 38, in _setup_logging
yield config
File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/bcbio/distributed/ipythontasks.py", line 91, in postprocess_alignment
return ipython.zip_args(apply(sample.postprocess_alignment, *args))
File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/bcbio/pipeline/sample.py", line 149, in postprocess_alignment
callable.block_regions(data["work_bam"], data["sam_ref"], data["config"])
File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/bcbio/bam/callable.py", line 298, in block_regions
callable_bed = parallel_callable_loci(in_bam, ref_file, config)
File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/bcbio/bam/callable.py", line 46, in parallel_callable_loci
"callable_bed", ["config"])[0]
File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/bcbio/distributed/split.py", line 59, in parallel_split_combine
split_output = parallel_fn(parallel_name, split_args)
File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/bcbio/distributed/multi.py", line 28, in run_parallel
return run_multicore(fn, items, config, parallel=parallel)
File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/bcbio/distributed/multi.py", line 84, in run_multicore
for data in joblib.Parallel(parallel["num_jobs"])(joblib.delayed(fn)(x) for x in items):
File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 660, in __call__
self.retrieve()
File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 543, in retrieve
raise exception_type(report)
TypeError: __init__() takes at least 3 arguments (2 given)
### end of log
2015-02-12 16:54:38.261 [IPClusterStart] Attempting to load config file: ipcluster_dba89f61_7858_41e1_bb65_46eb0d241eff_config.py
2015-02-12 16:54:38.262 [IPClusterStart] Loaded config file: /scratch/PI/euan/projects/athletes/bcbio/log/ipython/ipcluster_config.py
[2015-02-13T00:57Z] sh-1-6.local: Timing: alignment post-processing
[2015-02-13T00:57Z] sh-1-6.local: ipython: piped_bamprep
/home/dwaggott/projects1/athletes/bcbio/call_variants.sh: line 5: 12961 Killed bcbio_nextgen.py --timeout 10000 --retries 2 project_n143-fb/config/project_n143-fb.yaml -t ipython -n 160 -q normal -s slurm -r "timelimit=7-00:00:00;qos=long"
slurmstepd: Exceeded job memory limit at some point. Job may have been partially swapped out to disk.
srun: error: sh-1-6: task 0: Exited with exit code 137
slurmstepd: _slurm_cgroup_destroy: problem deleting step cgroup path /cgroup/freezer/slurm/uid_261851/job_1555498/step_0: Device or resource busy
slurmstepd: Exceeded job memory limit at some point. Job may have been partially swapped out to disk.
Daryl;
It looks like you might not have the latest version of sambamba (0.5.1). The syntax for indexing changed and there are a lot of complaints about missing bai
files, which indicates there might be earlier issues where sambamba failed to index. Could you try updating the tools with:
bcbio_nextgen.py upgrade --tools
to see if that resolves the issue? Hope this also works to avoid the memory errors you're seeing from SLURM. Thanks much for the patience and help debugging.
Getting closer. Any idea what step I should try bumping the memory for?
[2015-02-13T19:27Z] sh-5-8.local: multiprocessing: combine_sample_regions
[2015-02-13T19:27Z] sh-5-8.local: Identified 267 parallel analysis blocks
Block sizes:
min: 383
5%: 27985.8
25%: 4120160.0
median: 15509980.0
75%: 15511678.5
95%: 15537540.2
99%: 19358684.44
max: 43304603
Between block sizes:
min: 123
5%: 204.15
25%: 862.75
median: 1716.5
75%: 3436.25
95%: 9433.25
99%: 17274.35
max: 22003
[2015-02-13T19:27Z] sh-5-8.local: Timing: coverage
[2015-02-13T19:27Z] sh-5-8.local: Resource requests: freebayes, gatk, picard; memory: 2.00, 3.50, 3.50; cores: 1, 1, 1
[2015-02-13T19:27Z] sh-5-8.local: Configuring 160 jobs to run, using 1 cores each with 3.50g of memory reserved for each job
2015-02-13 11:27:50.016 [IPClusterStart] Config changed:
2015-02-13 11:27:50.016 [IPClusterStart] {'IPClusterEngines': {'early_shutdown': 240}, 'Application': {'log_level': 10}, 'BcbioSLURMControllerLauncher': {'account': u'euan', 'mem': '3.50', 'timelimit': '07-00:00:00', 'tag': '', 'cores': 1, 'resources': 'qos=long'}, 'ProfileDir': {'l
2015-02-13 11:27:50.023 [IPClusterStart] Using existing profile dir: u'/scratch/PI/euan/projects/athletes/bcbio/log/ipython'
2015-02-13 11:27:50.023 [IPClusterStart] Searching path [u'/scratch/PI/euan/projects/athletes/bcbio', u'/scratch/PI/euan/projects/athletes/bcbio/log/ipython'] for config files
2015-02-13 11:27:50.023 [IPClusterStart] Attempting to load config file: ipython_config.py
2015-02-13 11:27:50.024 [IPClusterStart] Loaded config file: /scratch/PI/euan/projects/athletes/bcbio/log/ipython/ipython_config.py
2015-02-13 11:27:50.025 [IPClusterStart] Attempting to load config file: ipcluster_b982ab29_c437_470e_b1a8_8b97e9ea763d_config.py
2015-02-13 11:27:50.025 [IPClusterStart] Loaded config file: /scratch/PI/euan/projects/athletes/bcbio/log/ipython/ipcontroller_config.py
2015-02-13 11:27:50.026 [IPClusterStart] Attempting to load config file: ipcluster_b982ab29_c437_470e_b1a8_8b97e9ea763d_config.py
2015-02-13 11:27:50.027 [IPClusterStart] Loaded config file: /scratch/PI/euan/projects/athletes/bcbio/log/ipython/ipengine_config.py
2015-02-13 11:27:50.027 [IPClusterStart] Attempting to load config file: ipcluster_b982ab29_c437_470e_b1a8_8b97e9ea763d_config.py
2015-02-13 11:27:50.028 [IPClusterStart] Loaded config file: /scratch/PI/euan/projects/athletes/bcbio/log/ipython/ipcluster_config.py
[2015-02-13T21:05Z] sh-5-8.local: Timing: alignment post-processing
[2015-02-13T21:06Z] sh-5-8.local: ipython: piped_bamprep
/home/dwaggott/projects1/athletes/bcbio/call_variants.sh: line 5: 15366 Killed bcbio_nextgen.py --timeout 10000 --retries 2 project_n143-fb/config/project_n143-fb.yaml -t ipython -n 160 -q normal -s slurm -r "timelimit=7-00:00:00;qos=long"
slurmstepd: Exceeded job memory limit at some point. Job may have been partially swapped out to disk.
slurmstepd: _slurm_cgroup_destroy: problem deleting step cgroup path /cgroup/freezer/slurm/uid_261851/job_1562343/step_0: Device or resource busy
srun: error: sh-5-8: task 0: Exited with exit code 137
slurmstepd: Exceeded job memory limit at some point. Job may have been partially swapped out to disk.
Daryl;
It looks like the main process (bcbio_nextgen.py...
) is the one running out of the memory. This does do processing and manages distribution of jobs over the worker nodes. You'll have to increase the memory allocated to it in the /home/dwaggott/projects1/athletes/bcbio/call_variants.sh
submission script. Hope this helps resolve the problem and keep things running.
The latest log shows some gatk errors. I didn't realize gatk was used in the freebayes joint calling pipeline. Regardless, looks like freebayes is running on the nodes now that I increased the memory on the submission script.
Daryl; If present, we use GATK to add additional annotations to the FreeBayes calls. This is not required/done if GATK is missing but just provides extra information for filtering downstream.
The error you're seeing is the same as previously. The input files are missing BAM indexes. This is going to make FreeBayes really slow, as it'll have to read over the entire BAM files instead of jumping directly to the regions of interest.
I'm confused as to why we're lacking indexes. What version of bcbio and sambamba do you have? It's trying to index the files earlier in the process but fails for some reason. You'll want to resolve that to get reasonable speed out of the calling. Hope this helps.
bcbio_nextgen.py -v
/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/pytz/__init__.py:29: UserWarning: Module pysam was already imported from /share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/pysam/__init__.pyc, but /share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/pysam-0.8.1-py2.7-linux-x86_64.egg is being added to sys.path
from pkg_resources import resource_stream
0.8.6a
###
$ sambamba -v
sambamba v0.5.1
Usage: sambamba [command] [args...]
Available commands: 'view', 'index', 'merge', 'sort',
'flagstat', 'slice', 'markdup', 'depth', 'mpileup'
To get help on a particular command, just call it without args.
Leave bug reports and feature requests at
https://github.com/lomereiter/sambamba/issues
$ which sambamba
/share/PI/euan/apps/bin/sambamba
dwaggott@sherlock-ln01: ~/apps
$ ll /share/PI/euan/apps/bin/sambamba
lrwxrwxrwx 1 dwaggott euan 56 Feb 13 11:23 /share/PI/euan/apps/bin/sambamba -> ../Cellar/sambamba-binary/0.5.2a-2015-02-02/bin/sambamba
Is it looking for the *sort.bam indices in the default align directory? I see nonempty indices for all the bams, but there were a few that still had tmp files i.e.
for i in `find align/ -name "*sort.bam"`; do ls -lhtr $i.bai >> tmp; done;
-rw-r--r-- 1 dwaggott euan 6.3M Feb 14 11:48 align/4_29/111_2015-01-22_project_n143-fb-sort.bam.bai
-rw-r--r-- 1 dwaggott euan 6.4M Feb 14 11:49 align/4_20/95_2015-01-22_project_n143-fb-sort.bam.bai
-rw-r--r-- 1 dwaggott euan 6.6M Feb 3 13:27 align/4_192/tx/tmpq0gimi/89_2015-01-22_project_n143-fb-sort.bam.bai
-rw-r--r-- 1 dwaggott euan 6.6M Feb 14 11:50 align/4_192/89_2015-01-22_project_n143-fb-sort.bam.bai
From the log how do I figure out what files are missing?
[2015-02-14T19:52Z] sh-1-2.local: Index BAM file: 64_2015-01-22_project_n143-fb-sort.bam
[2015-02-14T19:52Z] sh-1-2.local: Index BAM file: 64_2015-01-22_project_n143-fb-sort.bam
[2015-02-14T19:52Z] sh-1-2.local: Index BAM file: 64_2015-01-22_project_n143-fb-sort.bam
[2015-02-14T19:52Z] sh-1-2.local: Index BAM file: 97_2015-01-22_project_n143-fb-sort.bam
[2015-02-14T19:53Z] sh-1-2.local: Genotyping with FreeBayes
[2015-02-14T19:53Z] sh-1-2.local: Opened BAM reader without index file, jumping is disabled.
[2015-02-14T19:53Z] sh-1-2.local: Index BAM file: 64_2015-01-22_project_n143-fb-sort.bam
[2015-02-14T19:53Z] sh-1-2.local: Index BAM file: 64_2015-01-22_project_n143-fb-sort.bam
[2015-02-14T19:54Z] sh-1-2.local: Genotyping with FreeBayes
Daryl; Sorry about this, I'm confused as to why you have some indexes being built and other not. Is there any kind of heterogeneity on the cluster PATHs, so that some have different versions of sambamba or bcbio? My best suggestion would be to run in a single thread and see if you can reproduce the error on a single machine. This should hopefully give some insight into the BAMs with missing indexes, or why they are failing during index builds. Hope this helps some.
I'll start running a separate test case. The issues might related to the frequent restarting using different versions of the code.
I doubled the memory for the submission script but it still failed. Based on sacct it looks like it's the controller job that is going over the default 2G. Where do I increase the memory for that step?
Daryl; Sorry about the controller memory problem. Out of curiousity, how many samples are you running together? I increased the default to 4Gb and made an option to set this using the resources tag, so you can increase further by adding:
-r conmem=6
to bump it to 6Gb, for instance. You'll need to upgrade bcbio_nextgen.py upgrade -u development
to get the latest ipython-cluster-helper. Hope this works for you and gets things running. Thanks again for all the patience.
Excellent!
Simply restarting the pipeline got it moving again and it's running the freebayes steps. No clue if it's some type of heterogeneity issue. I'll keep testing.
Going forward, if there is anything you can do on the logging side to simplify identifying what job/sample/step is failing, that would be very helpful.
Is the observed behaviour of idle jobs consistent with a killed controller process? I would have thought all the jobs would die, if possible, upon some crucial failure. If not they just keep running pointlessly.
This was only 143 high coverage exomes, so not too much. I'm going to try on a second batch of 25 and a third batch of 400. I'll setup another issue on how to do some type of final joint calling on separate batches.
On Tue, Feb 17, 2015 at 11:23 AM, Brad Chapman notifications@github.com wrote:
Daryl; Sorry about the controller memory problem. Out of curiousity, how many samples are you running together? I increased the default to 4Gb and made an option to set this using the resources tag, so you can increase further by adding:
-r conmem=6
to bump it to 6Gb, for instance. You'll need to upgrade bcbio_nextgen.py upgrade -u development to get the latest ipython-cluster-helper. Hope this works for you and gets things running. Thanks again for all the patience.
— Reply to this email directly or view it on GitHub https://github.com/chapmanb/bcbio-nextgen/issues/741#issuecomment-74734081 .
Things are currently running at the qc step but I'm seeing errors in the log:
...
[2015-02-17T22:06Z] sh-5-17.local: WARNING: The input VCF has no snpEFF annotations. Variant impact will be set to unknown
[2015-02-17T22:06Z] sh-5-17.local: WARNING: The input VCF has no snpEFF annotations. Variant impact will be set to unknown
[2015-02-17T22:06Z] sh-5-17.local: WARNING: The input VCF has no snpEFF annotations. Variant impact will be set to unknown
[2015-02-17T22:06Z] sh-5-17.local: Traceback (most recent call last):
[2015-02-17T22:06Z] sh-5-17.local: File "/share/PI/euan/apps/bin/gemini", line 6, in <module>
[2015-02-17T22:06Z] sh-5-17.local: gemini.gemini_main.main()
[2015-02-17T22:06Z] sh-5-17.local: File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 1105, in main
[2015-02-17T22:06Z] sh-5-17.local: args.func(parser, args)
[2015-02-17T22:06Z] sh-5-17.local: File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 226, in loadchunk_fn
[2015-02-17T22:06Z] sh-5-17.local: gemini_load_chunk.load(parser, args)
[2015-02-17T22:06Z] sh-5-17.local: File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_load_chunk.py", line 681, in load
[2015-02-17T22:06Z] sh-5-17.local: gemini_loader.populate_from_vcf()
[2015-02-17T22:06Z] sh-5-17.local: File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_load_chunk.py", line 129, in populate_from_vcf
[2015-02-17T22:06Z] sh-5-17.local: os.remove(extra_file)
[2015-02-17T22:06Z] sh-5-17.local: OSError: [Errno 2] No such file or directory: '/scratch/PI/euan/projects/athletes/bcbio/gemini/batch1-freebayes-nomultiallelic.vcf.chunk14-extra.json'
[2015-02-17T22:09Z] sh-5-16.local: Traceback (most recent call last):
[2015-02-17T22:09Z] sh-5-16.local: File "/share/PI/euan/apps/bin/gemini", line 6, in <module>
[2015-02-17T22:09Z] sh-5-16.local: gemini.gemini_main.main()
[2015-02-17T22:09Z] sh-5-16.local: File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 1105, in main
[2015-02-17T22:09Z] sh-5-16.local: args.func(parser, args)
[2015-02-17T22:09Z] sh-5-16.local: File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 244, in mergechunk_fn
[2015-02-17T22:09Z] sh-5-16.local: gemini_merge_chunks.merge_chunks(parser, args)
[2015-02-17T22:09Z] sh-5-16.local: File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_merge_chunks.py", line 190, in merge_chunks
[2015-02-17T22:09Z] sh-5-16.local: merge_db_chunks(args)
[2015-02-17T22:09Z] sh-5-16.local: File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_merge_chunks.py", line 176, in merge_db_chunks
[2015-02-17T22:09Z] sh-5-16.local: append_sample_genotype_counts(main_curr, db)
[2015-02-17T22:09Z] sh-5-16.local: File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_merge_chunks.py", line 40, in append_sample_genotype_counts
[2015-02-17T22:09Z] sh-5-16.local: main_curr.execute(cmd)
[2015-02-17T22:09Z] sh-5-16.local: sqlite3.OperationalError: no such table: toMerge.sample_genotype_counts
[2015-02-17T22:09Z] sh-5-16.local: Traceback (most recent call last):
[2015-02-17T22:09Z] sh-5-16.local: File "/share/PI/euan/apps/bin/gemini", line 6, in <module>
[2015-02-17T22:09Z] sh-5-16.local: gemini.gemini_main.main()
[2015-02-17T22:09Z] sh-5-16.local: File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 1105, in main
[2015-02-17T22:09Z] sh-5-16.local: args.func(parser, args)
[2015-02-17T22:09Z] sh-5-16.local: File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 244, in mergechunk_fn
[2015-02-17T22:09Z] sh-5-16.local: gemini_merge_chunks.merge_chunks(parser, args)
[2015-02-17T22:09Z] sh-5-16.local: File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_merge_chunks.py", line 190, in merge_chunks
[2015-02-17T22:09Z] sh-5-16.local: merge_db_chunks(args)
[2015-02-17T22:09Z] sh-5-16.local: File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_merge_chunks.py", line 176, in merge_db_chunks
[2015-02-17T22:09Z] sh-5-16.local: append_sample_genotype_counts(main_curr, db)
[2015-02-17T22:09Z] sh-5-16.local: File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_merge_chunks.py", line 40, in append_sample_genotype_counts
[2015-02-17T22:09Z] sh-5-16.local: main_curr.execute(cmd)
[2015-02-17T22:09Z] sh-5-16.local: sqlite3.OperationalError: no such table: toMerge.sample_genotype_counts
[2015-02-17T22:09Z] sh-5-16.local: Traceback (most recent call last):
[2015-02-17T22:09Z] sh-5-16.local: File "/share/PI/euan/apps/bin/gemini", line 6, in <module>
[2015-02-17T22:09Z] sh-5-16.local: gemini.gemini_main.main()
[2015-02-17T22:09Z] sh-5-16.local: File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 1105, in main
[2015-02-17T22:09Z] sh-5-16.local: args.func(parser, args)
[2015-02-17T22:09Z] sh-5-16.local: File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 244, in mergechunk_fn
[2015-02-17T22:09Z] sh-5-16.local: gemini_merge_chunks.merge_chunks(parser, args)
[2015-02-17T22:09Z] sh-5-16.local: File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_merge_chunks.py", line 190, in merge_chunks
[2015-02-17T22:09Z] sh-5-16.local: merge_db_chunks(args)
[2015-02-17T22:09Z] sh-5-16.local: File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_merge_chunks.py", line 176, in merge_db_chunks
[2015-02-17T22:09Z] sh-5-16.local: append_sample_genotype_counts(main_curr, db)
[2015-02-17T22:09Z] sh-5-16.local: File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_merge_chunks.py", line 40, in append_sample_genotype_counts
[2015-02-17T22:09Z] sh-5-16.local: main_curr.execute(cmd)
[2015-02-17T22:09Z] sh-5-16.local: sqlite3.OperationalError: no such table: toMerge.sample_genotype_counts
[2015-02-17T22:09Z] sh-5-16.local: Traceback (most recent call last):
[2015-02-17T22:09Z] sh-5-16.local: File "/share/PI/euan/apps/bin/gemini", line 6, in <module>
[2015-02-17T22:09Z] sh-5-16.local: gemini.gemini_main.main()
[2015-02-17T22:09Z] sh-5-16.local: File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 1105, in main
[2015-02-17T22:09Z] sh-5-16.local: args.func(parser, args)
[2015-02-17T22:09Z] sh-5-16.local: File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 244, in mergechunk_fn
[2015-02-17T22:09Z] sh-5-16.local: gemini_merge_chunks.merge_chunks(parser, args)
[2015-02-17T22:09Z] sh-5-16.local: File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_merge_chunks.py", line 190, in merge_chunks
[2015-02-17T22:09Z] sh-5-16.local: merge_db_chunks(args)
[2015-02-17T22:09Z] sh-5-16.local: File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_merge_chunks.py", line 176, in merge_db_chunks
[2015-02-17T22:09Z] sh-5-16.local: append_sample_genotype_counts(main_curr, db)
[2015-02-17T22:09Z] sh-5-16.local: File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_merge_chunks.py", line 40, in append_sample_genotype_counts
[2015-02-17T22:09Z] sh-5-16.local: main_curr.execute(cmd)
[2015-02-17T22:09Z] sh-5-16.local: sqlite3.OperationalError: no such table: toMerge.sample_genotype_counts
[2015-02-17T22:09Z] sh-5-16.local: Traceback (most recent call last):
[2015-02-17T22:09Z] sh-5-16.local: File "/share/PI/euan/apps/bin/gemini", line 6, in <module>
[2015-02-17T22:09Z] sh-5-16.local: gemini.gemini_main.main()
[2015-02-17T22:09Z] sh-5-16.local: File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 1105, in main
[2015-02-17T22:09Z] sh-5-16.local: args.func(parser, args)
[2015-02-17T22:09Z] sh-5-16.local: File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 244, in mergechunk_fn
[2015-02-17T22:09Z] sh-5-16.local: gemini_merge_chunks.merge_chunks(parser, args)
[2015-02-17T22:09Z] sh-5-16.local: File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_merge_chunks.py", line 190, in merge_chunks
[2015-02-17T22:09Z] sh-5-16.local: merge_db_chunks(args)
[2015-02-17T22:09Z] sh-5-16.local: File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_merge_chunks.py", line 176, in merge_db_chunks
[2015-02-17T22:09Z] sh-5-16.local: append_sample_genotype_counts(main_curr, db)
[2015-02-17T22:09Z] sh-5-16.local: File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_merge_chunks.py", line 40, in append_sample_genotype_counts
[2015-02-17T22:09Z] sh-5-16.local: main_curr.execute(cmd)
[2015-02-17T22:09Z] sh-5-16.local: sqlite3.OperationalError: no such table: toMerge.sample_genotype_counts
[2015-02-17T22:09Z] sh-5-16.local: Traceback (most recent call last):
[2015-02-17T22:09Z] sh-5-16.local: File "/share/PI/euan/apps/bin/gemini", line 6, in <module>
[2015-02-17T22:09Z] sh-5-16.local: gemini.gemini_main.main()
[2015-02-17T22:09Z] sh-5-16.local: File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 1105, in main
[2015-02-17T22:09Z] sh-5-16.local: args.func(parser, args)
[2015-02-17T22:09Z] sh-5-16.local: File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 244, in mergechunk_fn
[2015-02-17T22:09Z] sh-5-16.local: gemini_merge_chunks.merge_chunks(parser, args)
[2015-02-17T22:09Z] sh-5-16.local: File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_merge_chunks.py", line 190, in merge_chunks
[2015-02-17T22:09Z] sh-5-16.local: merge_db_chunks(args)
[2015-02-17T22:09Z] sh-5-16.local: File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_merge_chunks.py", line 176, in merge_db_chunks
[2015-02-17T22:09Z] sh-5-16.local: append_sample_genotype_counts(main_curr, db)
[2015-02-17T22:09Z] sh-5-16.local: File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_merge_chunks.py", line 40, in append_sample_genotype_counts
[2015-02-17T22:09Z] sh-5-16.local: main_curr.execute(cmd)
[2015-02-17T22:09Z] sh-5-16.local: sqlite3.OperationalError: no such table: toMerge.sample_genotype_counts
[2015-02-17T22:09Z] sh-5-16.local: Traceback (most recent call last):
[2015-02-17T22:09Z] sh-5-16.local: File "/share/PI/euan/apps/bin/gemini", line 6, in <module>
[2015-02-17T22:09Z] sh-5-16.local: gemini.gemini_main.main()
[2015-02-17T22:09Z] sh-5-16.local: File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 1105, in main
[2015-02-17T22:09Z] sh-5-16.local: args.func(parser, args)
[2015-02-17T22:09Z] sh-5-16.local: File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 244, in mergechunk_fn
[2015-02-17T22:09Z] sh-5-16.local: gemini_merge_chunks.merge_chunks(parser, args)
[2015-02-17T22:09Z] sh-5-16.local: File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_merge_chunks.py", line 190, in merge_chunks
[2015-02-17T22:09Z] sh-5-16.local: merge_db_chunks(args)
[2015-02-17T22:09Z] sh-5-16.local: File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_merge_chunks.py", line 176, in merge_db_chunks
[2015-02-17T22:09Z] sh-5-16.local: append_sample_genotype_counts(main_curr, db)
[2015-02-17T22:09Z] sh-5-16.local: File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_merge_chunks.py", line 40, in append_sample_genotype_counts
[2015-02-17T22:09Z] sh-5-16.local: main_curr.execute(cmd)
[2015-02-17T22:09Z] sh-5-16.local: sqlite3.OperationalError: no such table: toMerge.sample_genotype_counts
[2015-02-17T22:09Z] sh-5-16.local: Traceback (most recent call last):
[2015-02-17T22:09Z] sh-5-16.local: File "/share/PI/euan/apps/bin/gemini", line 6, in <module>
[2015-02-17T22:09Z] sh-5-16.local: gemini.gemini_main.main()
[2015-02-17T22:09Z] sh-5-16.local: File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 1105, in main
[2015-02-17T22:09Z] sh-5-16.local: args.func(parser, args)
[2015-02-17T22:09Z] sh-5-16.local: File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 244, in mergechunk_fn
[2015-02-17T22:09Z] sh-5-16.local: gemini_merge_chunks.merge_chunks(parser, args)
[2015-02-17T22:09Z] sh-5-16.local: File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_merge_chunks.py", line 190, in merge_chunks
[2015-02-17T22:09Z] sh-5-16.local: merge_db_chunks(args)
[2015-02-17T22:09Z] sh-5-16.local: File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_merge_chunks.py", line 176, in merge_db_chunks
[2015-02-17T22:09Z] sh-5-16.local: append_sample_genotype_counts(main_curr, db)
[2015-02-17T22:09Z] sh-5-16.local: File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_merge_chunks.py", line 40, in append_sample_genotype_counts
[2015-02-17T22:09Z] sh-5-16.local: main_curr.execute(cmd)
[2015-02-17T22:09Z] sh-5-16.local: sqlite3.OperationalError: no such table: toMerge.sample_genotype_counts
[2015-02-17T22:09Z] sh-5-16.local: Traceback (most recent call last):
[2015-02-17T22:09Z] sh-5-16.local: File "/share/PI/euan/apps/bin/gemini", line 6, in <module>
[2015-02-17T22:09Z] sh-5-16.local: gemini.gemini_main.main()
[2015-02-17T22:09Z] sh-5-16.local: File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 1105, in main
[2015-02-17T22:09Z] sh-5-16.local: args.func(parser, args)
[2015-02-17T22:09Z] sh-5-16.local: File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 244, in mergechunk_fn
[2015-02-17T22:09Z] sh-5-16.local: gemini_merge_chunks.merge_chunks(parser, args)
[2015-02-17T22:09Z] sh-5-16.local: File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_merge_chunks.py", line 190, in merge_chunks
[2015-02-17T22:09Z] sh-5-16.local: merge_db_chunks(args)
[2015-02-17T22:09Z] sh-5-16.local: File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_merge_chunks.py", line 176, in merge_db_chunks
[2015-02-17T22:09Z] sh-5-16.local: append_sample_genotype_counts(main_curr, db)
[2015-02-17T22:09Z] sh-5-16.local: File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_merge_chunks.py", line 40, in append_sample_genotype_counts
[2015-02-17T22:09Z] sh-5-16.local: main_curr.execute(cmd)
[2015-02-17T22:09Z] sh-5-16.local: sqlite3.OperationalError: no such table: toMerge.sample_genotype_counts
[2015-02-17T22:09Z] sh-5-16.local: Indexing /scratch/PI/euan/projects/athletes/bcbio/gemini/batch1-freebayes-nomultiallelic.vcf.gz with grabix.
[2015-02-17T22:09Z] sh-5-16.local: Loading 3429621 variants.
[2015-02-17T22:09Z] sh-5-16.local: Breaking /scratch/PI/euan/projects/athletes/bcbio/gemini/batch1-freebayes-nomultiallelic.vcf.gz into 16 chunks.
[2015-02-17T22:09Z] sh-5-16.local: Loading chunk 0.
[2015-02-17T22:09Z] sh-5-16.local: Loading chunk 1.
[2015-02-17T22:09Z] sh-5-16.local: Loading chunk 2.
[2015-02-17T22:09Z] sh-5-16.local: Loading chunk 3.
[2015-02-17T22:09Z] sh-5-16.local: Loading chunk 4.
[2015-02-17T22:09Z] sh-5-16.local: Loading chunk 5.
[2015-02-17T22:09Z] sh-5-16.local: Loading chunk 6.
[2015-02-17T22:09Z] sh-5-16.local: Loading chunk 7.
[2015-02-17T22:09Z] sh-5-16.local: Loading chunk 8.
[2015-02-17T22:09Z] sh-5-16.local: Loading chunk 9.
[2015-02-17T22:09Z] sh-5-16.local: Loading chunk 10.
[2015-02-17T22:09Z] sh-5-16.local: Loading chunk 11.
[2015-02-17T22:09Z] sh-5-16.local: Loading chunk 12.
[2015-02-17T22:09Z] sh-5-16.local: Loading chunk 13.
[2015-02-17T22:09Z] sh-5-16.local: Loading chunk 14.
[2015-02-17T22:09Z] sh-5-16.local: Loading chunk 15.
[2015-02-17T22:09Z] sh-5-16.local: Done loading 3429621 variants in 16 chunks.
[2015-02-17T22:09Z] sh-5-16.local: 2015-02-17 14:08:34 merging 16 chunks.
[2015-02-17T22:09Z] sh-5-16.local: Traceback (most recent call last):
[2015-02-17T22:09Z] sh-5-16.local: File "/share/PI/euan/apps/bin/gemini", line 6, in <module>
[2015-02-17T22:09Z] sh-5-16.local: gemini.gemini_main.main()
[2015-02-17T22:09Z] sh-5-16.local: File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 1105, in main
[2015-02-17T22:09Z] sh-5-16.local: args.func(parser, args)
[2015-02-17T22:09Z] sh-5-16.local: File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 136, in load_fn
[2015-02-17T22:09Z] sh-5-16.local: gemini_load.load(parser, args)
[2015-02-17T22:09Z] sh-5-16.local: File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_load.py", line 50, in load
[2015-02-17T22:09Z] sh-5-16.local: load_multicore(args)
[2015-02-17T22:09Z] sh-5-16.local: File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_load.py", line 75, in load_multicore
[2015-02-17T22:09Z] sh-5-16.local: merge_chunks_multicore(chunks, args.db)
[2015-02-17T22:09Z] sh-5-16.local: File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_load.py", line 137, in merge_chunks_multicore
[2015-02-17T22:09Z] sh-5-16.local: cleanup_temp_db_files(chunks)
[2015-02-17T22:09Z] sh-5-16.local: File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_load.py", line 304, in cleanup_temp_db_files
[2015-02-17T22:09Z] sh-5-16.local: os.remove(chunk_db)
[2015-02-17T22:09Z] sh-5-16.local: OSError: [Errno 2] No such file or directory: '/scratch/PI/euan/projects/athletes/bcbio/gemini/batch1-freebayes-nomultiallelic.vcf.chunk1.db'
[2015-02-17T22:09Z] sh-5-16.local: Uncaught exception occurred
Traceback (most recent call last):
File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 21, in run
_do_run(cmd, checks, log_stdout)
File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 95, in _do_run
raise subprocess.CalledProcessError(exitcode, error_msg)
CalledProcessError: Command 'set -o pipefail; /share/PI/euan/apps/bin/gemini load --passonly --skip-gerp-bp -v /scratch/PI/euan/projects/athletes/bcbio/gemini/batch1-freebayes-nomultiallelic.vcf.gz -t snpEff --cores 16 /scratch/PI/euan/projects/athletes/bcbio/gemini/tx/tmp1HyQZi/b
File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_merge_chunks.py", line 190, in merge_chunks
merge_db_chunks(args)
File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_merge_chunks.py", line 176, in merge_db_chunks
append_sample_genotype_counts(main_curr, db)
File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_merge_chunks.py", line 40, in append_sample_genotype_counts
main_curr.execute(cmd)
sqlite3.OperationalError: no such table: toMerge.sample_genotype_counts
Traceback (most recent call last):
File "/share/PI/euan/apps/bin/gemini", line 6, in <module>
gemini.gemini_main.main()
File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 1105, in main
args.func(parser, args)
File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 244, in mergechunk_fn
gemini_merge_chunks.merge_chunks(parser, args)
File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_merge_chunks.py", line 190, in merge_chunks
merge_db_chunks(args)
File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_merge_chunks.py", line 176, in merge_db_chunks
append_sample_genotype_counts(main_curr, db)
File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_merge_chunks.py", line 40, in append_sample_genotype_counts
main_curr.execute(cmd)
sqlite3.OperationalError: no such table: toMerge.sample_genotype_counts
Traceback (most recent call last):
File "/share/PI/euan/apps/bin/gemini", line 6, in <module>
gemini.gemini_main.main()
File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 1105, in main
args.func(parser, args)
File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 244, in mergechunk_fn
gemini_merge_chunks.merge_chunks(parser, args)
File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_merge_chunks.py", line 190, in merge_chunks
merge_db_chunks(args)
File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_merge_chunks.py", line 176, in merge_db_chunks
append_sample_genotype_counts(main_curr, db)
File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_merge_chunks.py", line 40, in append_sample_genotype_counts
main_curr.execute(cmd)
sqlite3.OperationalError: no such table: toMerge.sample_genotype_counts
Traceback (most recent call last):
File "/share/PI/euan/apps/bin/gemini", line 6, in <module>
gemini.gemini_main.main()
File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 1105, in main
args.func(parser, args)
File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 244, in mergechunk_fn
gemini_merge_chunks.merge_chunks(parser, args)
File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_merge_chunks.py", line 190, in merge_chunks
merge_db_chunks(args)
File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_merge_chunks.py", line 176, in merge_db_chunks
append_sample_genotype_counts(main_curr, db)
File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_merge_chunks.py", line 40, in append_sample_genotype_counts
main_curr.execute(cmd)
sqlite3.OperationalError: no such table: toMerge.sample_genotype_counts
Indexing /scratch/PI/euan/projects/athletes/bcbio/gemini/batch1-freebayes-nomultiallelic.vcf.gz with grabix.
Loading 3429621 variants.
Breaking /scratch/PI/euan/projects/athletes/bcbio/gemini/batch1-freebayes-nomultiallelic.vcf.gz into 16 chunks.
Loading chunk 0.
Loading chunk 1.
Loading chunk 2.
Loading chunk 3.
Loading chunk 4.
Loading chunk 5.
Loading chunk 6.
Loading chunk 7.
...
$ fgrep Error call_variants_fb.err
[2015-02-17T21:44Z] sh-5-17.local: OSError: [Errno 2] No such file or directory: '/scratch/PI/euan/projects/athletes/bcbio/gemini/batch1-freebayes-nomultiallelic.vcf.chunk15-extra.json'
[2015-02-17T22:02Z] sh-5-17.local: OSError: [Errno 2] No such file or directory: '/scratch/PI/euan/projects/athletes/bcbio/gemini/batch1-freebayes-nomultiallelic.vcf.chunk7-extra.json'
[2015-02-17T22:03Z] sh-5-17.local: OSError: [Errno 2] No such file or directory: '/scratch/PI/euan/projects/athletes/bcbio/gemini/batch1-freebayes-nomultiallelic.vcf.chunk4-extra.json'
[2015-02-17T22:03Z] sh-5-17.local: OSError: [Errno 2] No such file or directory: '/scratch/PI/euan/projects/athletes/bcbio/gemini/batch1-freebayes-nomultiallelic.vcf.chunk3-extra.json'
[2015-02-17T22:03Z] sh-5-17.local: OSError: [Errno 2] No such file or directory: '/scratch/PI/euan/projects/athletes/bcbio/gemini/batch1-freebayes-nomultiallelic.vcf.chunk10-extra.json'
[2015-02-17T22:03Z] sh-5-17.local: OSError: [Errno 2] No such file or directory: '/scratch/PI/euan/projects/athletes/bcbio/gemini/batch1-freebayes-nomultiallelic.vcf.chunk2-extra.json'
[2015-02-17T22:03Z] sh-5-17.local: OSError: [Errno 2] No such file or directory: '/scratch/PI/euan/projects/athletes/bcbio/gemini/batch1-freebayes-nomultiallelic.vcf.chunk1-extra.json'
[2015-02-17T22:04Z] sh-5-17.local: OSError: [Errno 2] No such file or directory: '/scratch/PI/euan/projects/athletes/bcbio/gemini/batch1-freebayes-nomultiallelic.vcf.chunk8-extra.json'
[2015-02-17T22:04Z] sh-5-17.local: OSError: [Errno 2] No such file or directory: '/scratch/PI/euan/projects/athletes/bcbio/gemini/batch1-freebayes-nomultiallelic.vcf.chunk5-extra.json'
[2015-02-17T22:04Z] sh-5-17.local: OSError: [Errno 2] No such file or directory: '/scratch/PI/euan/projects/athletes/bcbio/gemini/batch1-freebayes-nomultiallelic.vcf.chunk6-extra.json'
[2015-02-17T22:04Z] sh-5-17.local: OSError: [Errno 2] No such file or directory: '/scratch/PI/euan/projects/athletes/bcbio/gemini/batch1-freebayes-nomultiallelic.vcf.chunk0-extra.json'
[2015-02-17T22:04Z] sh-5-17.local: OSError: [Errno 2] No such file or directory: '/scratch/PI/euan/projects/athletes/bcbio/gemini/batch1-freebayes-nomultiallelic.vcf.chunk11-extra.json'
[2015-02-17T22:06Z] sh-5-17.local: OSError: [Errno 2] No such file or directory: '/scratch/PI/euan/projects/athletes/bcbio/gemini/batch1-freebayes-nomultiallelic.vcf.chunk13-extra.json'
[2015-02-17T22:06Z] sh-5-17.local: OSError: [Errno 2] No such file or directory: '/scratch/PI/euan/projects/athletes/bcbio/gemini/batch1-freebayes-nomultiallelic.vcf.chunk9-extra.json'
[2015-02-17T22:06Z] sh-5-17.local: OSError: [Errno 2] No such file or directory: '/scratch/PI/euan/projects/athletes/bcbio/gemini/batch1-freebayes-nomultiallelic.vcf.chunk12-extra.json'
[2015-02-17T22:06Z] sh-5-17.local: OSError: [Errno 2] No such file or directory: '/scratch/PI/euan/projects/athletes/bcbio/gemini/batch1-freebayes-nomultiallelic.vcf.chunk14-extra.json'
[2015-02-17T22:09Z] sh-5-16.local: sqlite3.OperationalError: no such table: toMerge.sample_genotype_counts
[2015-02-17T22:09Z] sh-5-16.local: sqlite3.OperationalError: no such table: toMerge.sample_genotype_counts
[2015-02-17T22:09Z] sh-5-16.local: sqlite3.OperationalError: no such table: toMerge.sample_genotype_counts
[2015-02-17T22:09Z] sh-5-16.local: sqlite3.OperationalError: no such table: toMerge.sample_genotype_counts
[2015-02-17T22:09Z] sh-5-16.local: sqlite3.OperationalError: no such table: toMerge.sample_genotype_counts
[2015-02-17T22:09Z] sh-5-16.local: sqlite3.OperationalError: no such table: toMerge.sample_genotype_counts
[2015-02-17T22:09Z] sh-5-16.local: sqlite3.OperationalError: no such table: toMerge.sample_genotype_counts
[2015-02-17T22:09Z] sh-5-16.local: sqlite3.OperationalError: no such table: toMerge.sample_genotype_counts
[2015-02-17T22:09Z] sh-5-16.local: OSError: [Errno 2] No such file or directory: '/scratch/PI/euan/projects/athletes/bcbio/gemini/batch1-freebayes-nomultiallelic.vcf.chunk1.db'
raise subprocess.CalledProcessError(exitcode, error_msg)
CalledProcessError: Command 'set -o pipefail; /share/PI/euan/apps/bin/gemini load --passonly --skip-gerp-bp -v /scratch/PI/euan/projects/athletes/bcbio/gemini/batch1-freebayes-nomultiallelic.vcf.gz -t snpEff --cores 16 /scratch/PI/euan/projects/athletes/bcbio/gemini/tx/tmp1HyQZi/batch1-freebayes.db
sqlite3.OperationalError: no such table: toMerge.sample_genotype_counts
sqlite3.OperationalError: no such table: toMerge.sample_genotype_counts
sqlite3.OperationalError: no such table: toMerge.sample_genotype_counts
sqlite3.OperationalError: no such table: toMerge.sample_genotype_counts
sqlite3.OperationalError: no such table: toMerge.sample_genotype_counts
OSError: [Errno 2] No such file or directory: '/scratch/PI/euan/projects/athletes/bcbio/gemini/batch1-freebayes-nomultiallelic.vcf.chunk1.db'
raise subprocess.CalledProcessError(exitcode, error_msg)
CalledProcessError: Command 'set -o pipefail; /share/PI/euan/apps/bin/gemini load --passonly --skip-gerp-bp -v /scratch/PI/euan/projects/athletes/bcbio/gemini/batch1-freebayes-nomultiallelic.vcf.gz -t snpEff --cores 16 /scratch/PI/euan/projects/athletes/bcbio/gemini/tx/tmp1HyQZi/batch1-freebayes.db
sqlite3.OperationalError: no such table: toMerge.sample_genotype_counts
sqlite3.OperationalError: no such table: toMerge.sample_genotype_counts
sqlite3.OperationalError: no such table: toMerge.sample_genotype_counts
sqlite3.OperationalError: no such table: toMerge.sample_genotype_counts
sqlite3.OperationalError: no such table: toMerge.sample_genotype_counts
OSError: [Errno 2] No such file or directory: '/scratch/PI/euan/projects/athletes/bcbio/gemini/batch1-freebayes-nomultiallelic.vcf.chunk1.db'
Daryl; Sorry about the GEMINI load problem. I'm confused as to what is going on here. It looks like some sort of race condition where two databases are trying to load at similar times but I'm not sure how that would get triggered. Is it possible there are two simultaneous bcbio runs happening or something else? If the GEMINI databases eventually get created correctly and it proceeds on to QC, it sounds like you are okay but I wish I knew the underlying reason. Hope things at least finish up okay on this run.
Ah, multiple processes could be the issue.
I have an sbatch script which submits the bcbio job. It looked like this was running out of memory so I increased the allocation from a default 4G to 8G. Due to the cluster setup this required increasing the number of cores from 1 to 2. After I did that I was seeing multiple bcbio processes in squeue. I just assumed this was a reporting quirk.
My setup is below.
$ cat call_variants.sbatch
#set a job name
#SBATCH --job-name=call_variants_fb
#################
#a file for job output, you can check job progress
#SBATCH --output=call_variants_fb.out
#################
# a file for errors from the job
#SBATCH --error=call_variants_fb.err
#################
#time you think you need; default is one hour
#in minutes in this case
#SBATCH --time=7-00:00:00
#################
#quality of service; think of it as job priority
#SBATCH --qos=long
#################
#number of nodes you are requesting
#SBATCH --nodes=1
#################
# access more resources
#SBATCH --partition=normal
#################
#memory per node; default is 4000 MB per CPU
#SBATCH --mem=8000
#you could use --mem-per-cpu; they mean what we are calling cores
#################
#task to run per node; each node has 16 cores, "task" is "core" really
#SBATCH --ntasks-per-node=2
#don't use --cpus-per-task on this cluster
#################
#get emailed about job BEGIN, END, and FAIL
##SBATCH --mail-type=ALL
#################
#who to send email to; please change to your email
##SBATCH --mail-user=dwaggott@stanford.edu
#################
#now run normal batch commands
module load python/2.7.5
module load openmpi/1.6.5
srun /home/dwaggott/projects1/athletes/bcbio/call_variants.sh
$ cat call_variants.sh
#!/bin/bash
module load python/2.7.5
# call variants
bcbio_nextgen.py --timeout 10000 --retries 2 project_n143-fb/config/project_n143-fb.yaml -t ipython -n 160 -q normal -s slurm -r "timelimit=7-00:00:00;qos=long"
Reverting back to 1 cpu and restarting causes a memory issue on the submission script. sacct reports it completed without error.
[2015-02-18T22:01Z] sh-2-21.local: Storing in local filesystem: /scratch/PI/euan/projects/athletes/final/3_32/3_32-freebayes.vcf.gz.tbi
[2015-02-18T22:01Z] sh-2-21.local: Storing in local filesystem: /scratch/PI/euan/projects/athletes/final/3_07/3_07-freebayes.vcf.gz.tbi
[2015-02-18T22:01Z] sh-2-21.local: Storing in local filesystem: /scratch/PI/euan/projects/athletes/final/3_57/3_57-freebayes.vcf.gz.tbi
Exception in thread Thread-1 (most likely raised during interpreter shutdown):
Traceback (most recent call last):
File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/threading.py", line 810, in __bootstrap_inner
File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/threading.py", line 763, in run
File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/logbook/queues.py", line 265, in _target
File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/logbook/base.py", line 238, in pop_thread
File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/logbook/base.py", line 203, in pop_thread
File "logbook/_speedups.pyx", line 233, in logbook._speedups.ContextStackManager.pop_thread (logbook/_speedups.c:5839)
File "logbook/_speedups.pyx", line 236, in logbook._speedups.ContextStackManager.pop_thread (logbook/_speedups.c:5660)
<type 'exceptions.TypeError'>: 'NoneType' object is not callable
slurmstepd: Exceeded job memory limit at some point. oom-killer likely killed a process.
slurmstepd: Exceeded job memory limit at some point. oom-killer likely killed a process.
Daryl
Glad it seems to be finishing up, or at least almost finishing. I'm a bit confused about your submission setup and how this might be affecting things. Why is the bcbio call in a separate shell script instead of going directly in your call_variants.sbatch
submission script? I'm not enough of a SLURM expert to understand how the nested srun or changing --ntasks-per-node
influences the run. I normally put everything in one batch script and use --cpus-per-task
and --mem
to adjust requirements directly. Hope that helps some.
I'll rejig the submission script. I was just following some examples. The srun in this case is unnecessary.
I need a bit of help understanding how the scheduling is working.
I've restarted a joint caller pipeline on a slurm cluster using ipython (500 cores requested). A limited set of engine jobs have been kicked off and initially complete alignments and then go dormant with only a python script running. Is this expected? It could take a while for the other engine jobs in the array to start. Do all jobs in the array have to be running concurrently or can jobs with stage completions be recycled?
I guess I'm trying to get a sense of if I should drop the number of core request to an extremely conservative number when on busy cluster.