bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
985 stars 353 forks source link

nodes sitting idle #741

Closed dwaggott closed 9 years ago

dwaggott commented 9 years ago

I need a bit of help understanding how the scheduling is working.

I've restarted a joint caller pipeline on a slurm cluster using ipython (500 cores requested). A limited set of engine jobs have been kicked off and initially complete alignments and then go dormant with only a python script running. Is this expected? It could take a while for the other engine jobs in the array to start. Do all jobs in the array have to be running concurrently or can jobs with stage completions be recycled?

I guess I'm trying to get a sense of if I should drop the number of core request to an extremely conservative number when on busy cluster.

chapmanb commented 9 years ago

Daryl; This isn't expected. It should keep sending jobs independent of the state of all the engines. The controller just pushes out jobs to any waiting engines and there is no logic in there about how many are yet to spin up. An engine should be able to run multiple jobs unless it failed or got killed. I'm really confused as to what is going on with this. Any useful information would be either in the IPython log files (log/ipython/log) or in the engine log files which should be named like bcbio-ipengine*and bcbio-ipcontroller*. Does it behave correctly if you run in multicore mode on a single machine or with less overall cores? Sorry to not have a better idea but hope this helps.

dwaggott commented 9 years ago

Oh...

2015-01-28 21:53:11.577 [IPEngineApp] Using existing profile dir: u'/scratch/PI/euan/projects/athletes/bcbio/log/ipython'
2015-01-28 21:53:11.588 [IPEngineApp] Loading url_file u'/scratch/PI/euan/projects/athletes/bcbio/log/ipython/security/ipcontroller-dc297844-7484-4263-b837-f63c7cd3e377-engine.json'
2015-01-28 21:53:11.602 [IPEngineApp] Registering with controller at tcp://10.210.47.127:55758
2015-01-28 21:53:11.699 [IPEngineApp] Starting to monitor the heartbeat signal from the hub every 60010 ms.
2015-01-28 21:53:11.702 [IPEngineApp] Using existing profile dir: u'/scratch/PI/euan/projects/athletes/bcbio/log/ipython'
2015-01-28 21:53:11.704 [IPEngineApp] Completed registration with id 6
slurmstepd: *** JOB 1437888 CANCELLED AT 2015-01-29T00:34:51 *** on sh-3-17
slurmstepd: Exceeded step memory limit at some point. Step may have been partially swapped out to disk.
slurmstepd: Exceeded job memory limit at some point. Job may have been partially swapped out to disk.
slurmstepd: _slurm_cgroup_destroy: problem deleting step cgroup path /cgroup/freezer/slurm/uid_261851/job_1437888/step_batch: Device or resource busy

2015-01-28 21:53:11.659 [IPEngineApp] Using existing profile dir: u'/scratch/PI/euan/projects/athletes/bcbio/log/ipython'
2015-01-28 21:53:11.670 [IPEngineApp] Loading url_file u'/scratch/PI/euan/projects/athletes/bcbio/log/ipython/security/ipcontroller-dc297844-7484-4263-b837-f63c7cd3e377-engine.json'
2015-01-28 21:53:11.684 [IPEngineApp] Registering with controller at tcp://10.210.47.127:55758
2015-01-28 21:53:11.780 [IPEngineApp] Starting to monitor the heartbeat signal from the hub every 60010 ms.
2015-01-28 21:53:11.783 [IPEngineApp] Using existing profile dir: u'/scratch/PI/euan/projects/athletes/bcbio/log/ipython'
2015-01-28 21:53:11.785 [IPEngineApp] Completed registration with id 7
slurmstepd: *** JOB 1437889 CANCELLED AT 2015-01-29T00:34:51 *** on sh-3-18
slurmstepd: Exceeded step memory limit at some point. oom-killer likely killed a process.
slurmstepd: Exceeded job memory limit at some point. oom-killer likely killed a process.
slurmstepd: _slurm_cgroup_destroy: problem deleting step cgroup path /cgroup/freezer/slurm/uid_261851/job_1437889/step_batch: Device or resource busy
dwaggott commented 9 years ago

Only looks like 11GB max ram consumption was reported. That should be fine so maybe it's a slurm reporting issue.

https://marylou.byu.edu/wiki/index.php?page=Job+ran+out+of+memory+but+the+reported+MaxRSS+value+was+lower+than+the+requested+amount

 sacct -o MaxRSSNode -j 1437888
    MaxRSS 
---------- 
 11541180K 
chapmanb commented 9 years ago

Daryl; Sorry about the issue. A memory spike that SLURM cannot detect is unexpected but it looks like something triggers the OOM killer on the system. Can you tell from the failed bcbio-ipengine* slurm log files where is the process the memory consumption occurred? This would help narrow down some where we are seeing issues. If this is during alignment, you could also try increasing memory available per core in your bcbio_system.yaml to give it more room to work with. Hope this helps.

dwaggott commented 9 years ago

galaxy/bcbio_system.yaml doesn't have a default memory specified. Based on the log I'm assuming it's 2g / cpu. Does that sound right? I'll push to 3g.

dwaggott commented 9 years ago

An update. Rerunning the pipeline with the original memory request consistently causes these idle nodes. The logs stop reporting shortly after initialization and are only informative after the jobs get killed due to timelimits or manual intervention.

Is there a way for to monitor the engine processes for these types of shutdown?

chapmanb commented 9 years ago

Daryl; The best way to monitor these is through sacct and the bcbio-ipengine* logs from SLURM. Unfortunately IPython is not going to get back any information on the process being killed so it isn't sure if it just can't reach an engine because of networking problems, or it died. IPython will eventually time out and kill the entire thing but will try to keep running if it has any engines that are working. This makes it resiliant to one-off nodes dying in the general case.

Does increasing the memory in bcbio_system.yaml help to resolve the underlying issue?

dwaggott commented 9 years ago

I increased the memory and the jobs are still sitting empty. This time I don't see the memory error in the logs. I'll upload the logs, maybe you can see something.

dwaggott commented 9 years ago

I hung out on a node and watched. Looks like it's the alignment step that is dying. I'm using local scratch based on an environment variable for tmp (code updated to development). Nothing looks to be in the tmp directory, might that indicate an issue? If so, would the failure to create tmp directories be reflected in the logs?

dwaggott commented 9 years ago

Leaving tmp as default. I can see files being generated but still all the nodes go dormant.

# example of seemingly complete alignment that is still stuck in the tx directory.

├── [4.0K Jan 23  0:36]  3_09
│   └── [4.0K Feb  3 12:55]  tx
│       ├── [4.0K Feb  3 13:28]  tmp0fzrhU
│       │   ├── [ 24M Feb  3 13:28]  15_2015-01-22_project_n143-fb-sort-sr.bam
│       │   └── [3.6M Feb  3 13:28]  15_2015-01-22_project_n143-fb-sort-sr.bam.bai
│       ├── [4.0K Feb  3 13:28]  tmptBchFW
│       │   ├── [5.0G Feb  3 13:30]  15_2015-01-22_project_n143-fb-sort.bam
│       │   └── [6.6M Feb  3 13:30]  15_2015-01-22_project_n143-fb-sort.bam.bai
│       └── [4.0K Feb  3 13:28]  tmpucQjYH
│           ├── [ 75M Feb  3 13:28]  15_2015-01-22_project_n143-fb-sort-disc.bam
│           └── [5.0M Feb  3 13:28]  15_2015-01-22_project_n143-fb-sort-disc.bam.bai

# example of nearly complete alignment. 

├── [4.0K Jan 23  3:24]  3_49
│   └── [4.0K Feb  3 12:55]  tx
│       ├── [4.0K Feb  3 12:55]  tmpKmAbvc
│       ├── [4.0K Feb  3 13:35]  tmpUm_crr
│       │   ├── [ 22M Feb  3 13:35]  45_2015-01-22_project_n143-fb-sort-sr.bam
│       │   └── [3.5M Feb  3 13:35]  45_2015-01-22_project_n143-fb-sort-sr.bam.bai
│       └── [4.0K Feb  3 13:35]  tmpZUF01z
│           ├── [ 89M Feb  3 13:35]  45_2015-01-22_project_n143-fb-sort-disc.bam
│           └── [5.1M Feb  3 13:35]  45_2015-01-22_project_n143-fb-sort-disc.bam.bai

# most samples look like this 

├── [4.0K Jan 23  1:34]  3_11
│   ├── [6.6G Jan 23  1:34]  16_2015-01-22_project_n143-fb-sort.bam
│   ├── [6.7M Jan 23  1:34]  16_2015-01-22_project_n143-fb-sort.bam.bai
│   ├── [ 94M Jan 23  1:31]  16_2015-01-22_project_n143-fb-sort-disc.bam
│   ├── [5.1M Jan 23  1:31]  16_2015-01-22_project_n143-fb-sort-disc.bam.bai
│   ├── [ 28M Jan 23  1:31]  16_2015-01-22_project_n143-fb-sort-sr.bam
│   ├── [3.7M Jan 23  1:31]  16_2015-01-22_project_n143-fb-sort-sr.bam.bai
│   └── [4.0K Jan 23  1:34]  tx

# example of dormant node via top
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                                                                                                           
13437 dwaggott  20   0 15436 1708  944 R  0.3  0.0   0:00.14 top                                                                                                                                                                                               
41568 dwaggott  20   0  103m 1224 1016 S  0.0  0.0   0:00.00 slurm_script                                                                                                                                                                                      
41577 dwaggott  20   0 1044m 107m  16m S  0.0  0.2   0:02.30 python                                                                                                                                                                                            
41880 dwaggott  20   0     0    0    0 Z  0.0  0.0   0:00.00 bash <defunct>                                                                                                                                                                                    
43235 dwaggott  20   0  137m 2316  932 S  0.0  0.0   0:00.70 sshd                                                                                                                                                                                              
43236 dwaggott  20   0  109m 3924 1564 S  0.0  0.0   0:00.28 bash                                                                                                                                                                                              
43311 dwaggott  20   0 57704 1048  404 S  0.0  0.0   0:00.00 ssh-agent    
chapmanb commented 9 years ago

Daryl; Sorry about the continued problems. It seems like the process is still being killed/dying but it's hard to tell exactly what is happening from this information. The log files for slurm, which should be produced in the work directory as bcbio-ipengine* should be the first place to look to try and find more details.

I dug around in the different tools and it looks like @lomereiter has been working on eliminating some memory leaks in sambamba (lomereiter/sambamba#116). I grabbed the latest development version so if you update the tools with:

bcbio_nextgen.py upgrade --tools

maybe this will avoid the underlying issue and keep things running. Sorry about all the problems and please let us know if this doesn't fix it.

dwaggott commented 9 years ago

What a pleasure it is getting code to work on other peoples systems! appreciate the help.

The out logs are empty and the err logs stop reporting at the same time the nodes go idle. Not sure if the few lines that show up after scancel are informative.

2015-02-03 12:51:49.512 [IPEngineApp] Using existing profile dir:
u'/scratch/PI/euan/projects/athletes/bcbio/log/ipython'

2015-02-03 12:51:49.521 [IPEngineApp] Loading url_file
u'/scratch/PI/euan/projects/athletes/bcbio/log/ipython/security/ipcontroller-381cb9e8-94

2015-02-03 12:51:49.526 [IPEngineApp] Registering with controller at tcp://
10.210.47.197:50393

2015-02-03 12:51:49.597 [IPEngineApp] Starting to monitor the heartbeat
signal from the hub every 60010 ms.

2015-02-03 12:51:49.600 [IPEngineApp] Using existing profile dir:
u'/scratch/PI/euan/projects/athletes/bcbio/log/ipython'

2015-02-03 12:51:49.602 [IPEngineApp] Completed registration with id 4

After scancel:

slurmstepd: *** JOB 1479012 CANCELLED AT 2015-02-04T06:33:25 *** on sh-2-3

slurmstepd: _slurm_cgroup_destroy: problem deleting step cgroup path
/cgroup/freezer/slurm/uid_261851/job_1479012/step_batch: Device or resource
busy
dwaggott commented 9 years ago

Looks like the upgrade had an error with cmake.

log

Running the failed line:

$ export CC=${CC:-`which gcc`} && export CXX=${CXX:-`which g++`} && export PERL5LIB=/home/dwaggott/apps/lib/perl5:${PERL5LIB} && /home/dwaggott/apps/bin/brew install --env=inherit  cmake
==> Downloading http://www.cmake.org/files/v3.1/cmake-3.1.1.tar.gz
Already downloaded: /home/dwaggott/.cache/Homebrew/cmake-3.1.1.tar.gz
==> Downloading https://github.com/Kitware/CMake/commit/c5d9a828.diff
Already downloaded: /home/dwaggott/.cache/Homebrew/cmake--patch-61b15b638c1409233f36e6e3383b98cab514c3bb.diff
==> Patching
patching file Modules/FindOpenSSL.cmake
==> Downloading https://pypi.python.org/packages/source/P/Pygments/Pygments-2.0.2.tar.gz
######################################################################## 100.0%
==> python -c import setuptools... --no-user-cfg install --prefix=/tmp/cmake-XUerJG/cmake-3.1.1/sphinx --single-version-externally-managed --record=installed.txt
   or: -c --help [cmd1 cmd2 ...]
   or: -c --help-commands
   or: -c cmd --help

error: option --no-user-cfg not recognized

READ THIS: https://github.com/Homebrew/linuxbrew/blob/master/share/doc/homebrew/Troubleshooting.md#troubleshooting

These open issues may also help:
cmake reports: CMake Error: Could not find CMAKE_ROOT !!! (https://github.com/Homebrew/homebrew/issues/35888)
cmake builds fail on CLT-only with --env=std (https://github.com/Homebrew/homebrew/issues/29101)
dwaggott commented 9 years ago

Installing cmake without docs avoids the sphynx problem and allows the bcbio upgrade to complete.

/home/dwaggott/apps/bin/brew install -v --without-docs --env=inherit cmake

chapmanb commented 9 years ago

Daryl; Sorry about the re-build issues. It looks like there was a new revision of cmake and we weren't handling it right in CloudBioLinux. I updated it to fix this for future runs. I hope the new sambamba will resolve the underlying issue. I'm totally lost on why you're not getting more information about these failing but will keep my fingers crossed that this in the underlying issue.

dwaggott commented 9 years ago

Using the latest update seems to have got me over the alignment hump. Thanks.

I've run into a similar problem at another point where the slurm jobs still seem to be running however the nodes are essentially idle and logging has stopped. Checking the nodes I can see python and slurm_script processes running under my name but nothing happening.

Attached are some selected sections from the submission err.log.

### notable errors
[2015-02-12T17:25Z] sh-1-7.local: bedtools genomecov: 9 : 6_2015-01-22_project_n143-fb-sort.bam
[2015-02-12T17:25Z] sh-1-7.local: sambamba-view: BAM index file (.bai) must be provided
[2015-02-12T17:25Z] sh-1-7.local: Failed to open BAM file stdin
[2015-02-12T17:25Z] sh-1-7.local: Uncaught exception occurred
Traceback (most recent call last):
  File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 21, in run
    _do_run(cmd, checks, log_stdout)
  File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 95, in _do_run
    raise subprocess.CalledProcessError(exitcode, error_msg)
CalledProcessError: Command 'set -o pipefail; /share/PI/euan/apps/bin/sambamba view -F 'mapping_quality > 0' -L /scratch/PI/euan/projects/athletes/bcbio/align/1_08/6_2015-01-22_project_n143-fb-sort-callable-split/tx/tmpprYtwo/6_2015-01-22_project_n143-fb-sort-9-callable-coverageregi
sambamba-view: BAM index file (.bai) must be provided
Failed to open BAM file stdin
' returned non-zero exit status 1
[2015-02-12T17:25Z] sh-1-7.local: Index BAM file: 6_2015-01-22_project_n143-fb-sort.bam

[2015-02-12T17:29Z] sh-1-7.local: Unexpected error
Traceback (most recent call last):
  File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/bcbio/distributed/ipythontasks.py", line 38, in _setup_logging
    yield config
  File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/bcbio/distributed/ipythontasks.py", line 91, in postprocess_alignment
    return ipython.zip_args(apply(sample.postprocess_alignment, *args))
  File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/bcbio/pipeline/sample.py", line 149, in postprocess_alignment
    callable.block_regions(data["work_bam"], data["sam_ref"], data["config"])
  File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/bcbio/bam/callable.py", line 298, in block_regions
    callable_bed = parallel_callable_loci(in_bam, ref_file, config)
  File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/bcbio/bam/callable.py", line 46, in parallel_callable_loci
    "callable_bed", ["config"])[0]
  File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/bcbio/distributed/split.py", line 59, in parallel_split_combine
    split_output = parallel_fn(parallel_name, split_args)
  File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/bcbio/distributed/multi.py", line 28, in run_parallel
    return run_multicore(fn, items, config, parallel=parallel)
  File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/bcbio/distributed/multi.py", line 84, in run_multicore
    for data in joblib.Parallel(parallel["num_jobs"])(joblib.delayed(fn)(x) for x in items):
  File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 660, in __call__
    self.retrieve()
  File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 543, in retrieve
    raise exception_type(report)
TypeError: __init__() takes at least 3 arguments (2 given)

[2015-02-12T19:31Z] sh-2-33.local: bedtools genomecov: 14 : 45_2015-01-22_project_n143-fb-sort.bam
[2015-02-12T19:31Z] sh-2-33.local: sambamba-view: BAM index file (.bai) must be provided
[2015-02-12T19:31Z] sh-2-33.local: Failed to open BAM file stdin
[2015-02-12T19:31Z] sh-2-33.local: Uncaught exception occurred
Traceback (most recent call last):
  File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 21, in run
    _do_run(cmd, checks, log_stdout)
  File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 95, in _do_run
    raise subprocess.CalledProcessError(exitcode, error_msg)
CalledProcessError: Command 'set -o pipefail; /share/PI/euan/apps/bin/sambamba view -F 'mapping_quality > 0' -L /scratch/PI/euan/projects/athletes/bcbio/align/3_49/45_2015-01-22_project_n143-fb-sort-callable-split/tx/tmp_dsCW9/45_2015-01-22_project_n143-fb-sort-14-callable-coverager
sambamba-view: BAM index file (.bai) must be provided
Failed to open BAM file stdin
' returned non-zero exit status 1

2015-02-12T19:41Z] sh-2-33.local: Unexpected error
Traceback (most recent call last):
  File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/bcbio/distributed/ipythontasks.py", line 38, in _setup_logging
    yield config
  File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/bcbio/distributed/ipythontasks.py", line 91, in postprocess_alignment
    return ipython.zip_args(apply(sample.postprocess_alignment, *args))
  File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/bcbio/pipeline/sample.py", line 149, in postprocess_alignment
    callable.block_regions(data["work_bam"], data["sam_ref"], data["config"])
  File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/bcbio/bam/callable.py", line 298, in block_regions
    callable_bed = parallel_callable_loci(in_bam, ref_file, config)
  File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/bcbio/bam/callable.py", line 46, in parallel_callable_loci
    "callable_bed", ["config"])[0]
  File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/bcbio/distributed/split.py", line 59, in parallel_split_combine
    split_output = parallel_fn(parallel_name, split_args)
  File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/bcbio/distributed/multi.py", line 28, in run_parallel
    return run_multicore(fn, items, config, parallel=parallel)
  File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/bcbio/distributed/multi.py", line 84, in run_multicore
    for data in joblib.Parallel(parallel["num_jobs"])(joblib.delayed(fn)(x) for x in items):
  File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 660, in __call__
    self.retrieve()
  File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 543, in retrieve
    raise exception_type(report)
TypeError: __init__() takes at least 3 arguments (2 given)

### end of log
2015-02-12 16:54:38.261 [IPClusterStart] Attempting to load config file: ipcluster_dba89f61_7858_41e1_bb65_46eb0d241eff_config.py
2015-02-12 16:54:38.262 [IPClusterStart] Loaded config file: /scratch/PI/euan/projects/athletes/bcbio/log/ipython/ipcluster_config.py
[2015-02-13T00:57Z] sh-1-6.local: Timing: alignment post-processing
[2015-02-13T00:57Z] sh-1-6.local: ipython: piped_bamprep
/home/dwaggott/projects1/athletes/bcbio/call_variants.sh: line 5: 12961 Killed                  bcbio_nextgen.py --timeout 10000 --retries 2 project_n143-fb/config/project_n143-fb.yaml -t ipython -n 160 -q normal -s slurm -r "timelimit=7-00:00:00;qos=long"
slurmstepd: Exceeded job memory limit at some point. Job may have been partially swapped out to disk.
srun: error: sh-1-6: task 0: Exited with exit code 137
slurmstepd: _slurm_cgroup_destroy: problem deleting step cgroup path /cgroup/freezer/slurm/uid_261851/job_1555498/step_0: Device or resource busy
slurmstepd: Exceeded job memory limit at some point. Job may have been partially swapped out to disk.
chapmanb commented 9 years ago

Daryl; It looks like you might not have the latest version of sambamba (0.5.1). The syntax for indexing changed and there are a lot of complaints about missing bai files, which indicates there might be earlier issues where sambamba failed to index. Could you try updating the tools with:

bcbio_nextgen.py upgrade --tools

to see if that resolves the issue? Hope this also works to avoid the memory errors you're seeing from SLURM. Thanks much for the patience and help debugging.

dwaggott commented 9 years ago

Getting closer. Any idea what step I should try bumping the memory for?

[2015-02-13T19:27Z] sh-5-8.local: multiprocessing: combine_sample_regions
[2015-02-13T19:27Z] sh-5-8.local: Identified 267 parallel analysis blocks
Block sizes:
  min: 383
  5%: 27985.8
  25%: 4120160.0
  median: 15509980.0
  75%: 15511678.5
  95%: 15537540.2
  99%: 19358684.44
  max: 43304603
Between block sizes:
  min: 123
  5%: 204.15
  25%: 862.75
  median: 1716.5
  75%: 3436.25
  95%: 9433.25
  99%: 17274.35
  max: 22003

[2015-02-13T19:27Z] sh-5-8.local: Timing: coverage
[2015-02-13T19:27Z] sh-5-8.local: Resource requests: freebayes, gatk, picard; memory: 2.00, 3.50, 3.50; cores: 1, 1, 1
[2015-02-13T19:27Z] sh-5-8.local: Configuring 160 jobs to run, using 1 cores each with 3.50g of memory reserved for each job
2015-02-13 11:27:50.016 [IPClusterStart] Config changed:
2015-02-13 11:27:50.016 [IPClusterStart] {'IPClusterEngines': {'early_shutdown': 240}, 'Application': {'log_level': 10}, 'BcbioSLURMControllerLauncher': {'account': u'euan', 'mem': '3.50', 'timelimit': '07-00:00:00', 'tag': '', 'cores': 1, 'resources': 'qos=long'}, 'ProfileDir': {'l
2015-02-13 11:27:50.023 [IPClusterStart] Using existing profile dir: u'/scratch/PI/euan/projects/athletes/bcbio/log/ipython'
2015-02-13 11:27:50.023 [IPClusterStart] Searching path [u'/scratch/PI/euan/projects/athletes/bcbio', u'/scratch/PI/euan/projects/athletes/bcbio/log/ipython'] for config files
2015-02-13 11:27:50.023 [IPClusterStart] Attempting to load config file: ipython_config.py
2015-02-13 11:27:50.024 [IPClusterStart] Loaded config file: /scratch/PI/euan/projects/athletes/bcbio/log/ipython/ipython_config.py
2015-02-13 11:27:50.025 [IPClusterStart] Attempting to load config file: ipcluster_b982ab29_c437_470e_b1a8_8b97e9ea763d_config.py
2015-02-13 11:27:50.025 [IPClusterStart] Loaded config file: /scratch/PI/euan/projects/athletes/bcbio/log/ipython/ipcontroller_config.py
2015-02-13 11:27:50.026 [IPClusterStart] Attempting to load config file: ipcluster_b982ab29_c437_470e_b1a8_8b97e9ea763d_config.py
2015-02-13 11:27:50.027 [IPClusterStart] Loaded config file: /scratch/PI/euan/projects/athletes/bcbio/log/ipython/ipengine_config.py
2015-02-13 11:27:50.027 [IPClusterStart] Attempting to load config file: ipcluster_b982ab29_c437_470e_b1a8_8b97e9ea763d_config.py
2015-02-13 11:27:50.028 [IPClusterStart] Loaded config file: /scratch/PI/euan/projects/athletes/bcbio/log/ipython/ipcluster_config.py
[2015-02-13T21:05Z] sh-5-8.local: Timing: alignment post-processing
[2015-02-13T21:06Z] sh-5-8.local: ipython: piped_bamprep
/home/dwaggott/projects1/athletes/bcbio/call_variants.sh: line 5: 15366 Killed                  bcbio_nextgen.py --timeout 10000 --retries 2 project_n143-fb/config/project_n143-fb.yaml -t ipython -n 160 -q normal -s slurm -r "timelimit=7-00:00:00;qos=long"
slurmstepd: Exceeded job memory limit at some point. Job may have been partially swapped out to disk.
slurmstepd: _slurm_cgroup_destroy: problem deleting step cgroup path /cgroup/freezer/slurm/uid_261851/job_1562343/step_0: Device or resource busy
srun: error: sh-5-8: task 0: Exited with exit code 137
slurmstepd: Exceeded job memory limit at some point. Job may have been partially swapped out to disk.
chapmanb commented 9 years ago

Daryl; It looks like the main process (bcbio_nextgen.py...) is the one running out of the memory. This does do processing and manages distribution of jobs over the worker nodes. You'll have to increase the memory allocated to it in the /home/dwaggott/projects1/athletes/bcbio/call_variants.sh submission script. Hope this helps resolve the problem and keep things running.

dwaggott commented 9 years ago

The latest log shows some gatk errors. I didn't realize gatk was used in the freebayes joint calling pipeline. Regardless, looks like freebayes is running on the nodes now that I increased the memory on the submission script.

log

chapmanb commented 9 years ago

Daryl; If present, we use GATK to add additional annotations to the FreeBayes calls. This is not required/done if GATK is missing but just provides extra information for filtering downstream.

The error you're seeing is the same as previously. The input files are missing BAM indexes. This is going to make FreeBayes really slow, as it'll have to read over the entire BAM files instead of jumping directly to the regions of interest.

I'm confused as to why we're lacking indexes. What version of bcbio and sambamba do you have? It's trying to index the files earlier in the process but fails for some reason. You'll want to resolve that to get reasonable speed out of the calling. Hope this helps.

dwaggott commented 9 years ago
bcbio_nextgen.py -v
/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/pytz/__init__.py:29: UserWarning: Module pysam was already imported from /share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/pysam/__init__.pyc, but /share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/pysam-0.8.1-py2.7-linux-x86_64.egg is being added to sys.path
  from pkg_resources import resource_stream
0.8.6a

###

$ sambamba -v
sambamba v0.5.1

Usage: sambamba [command] [args...]

    Available commands: 'view', 'index', 'merge', 'sort',
                        'flagstat', 'slice', 'markdup', 'depth', 'mpileup'
    To get help on a particular command, just call it without args.

Leave bug reports and feature requests at 
https://github.com/lomereiter/sambamba/issues

$ which sambamba
/share/PI/euan/apps/bin/sambamba
dwaggott@sherlock-ln01: ~/apps
$ ll /share/PI/euan/apps/bin/sambamba
lrwxrwxrwx 1 dwaggott euan 56 Feb 13 11:23 /share/PI/euan/apps/bin/sambamba -> ../Cellar/sambamba-binary/0.5.2a-2015-02-02/bin/sambamba
dwaggott commented 9 years ago

Is it looking for the *sort.bam indices in the default align directory? I see nonempty indices for all the bams, but there were a few that still had tmp files i.e.

for i in `find align/ -name "*sort.bam"`; do ls -lhtr $i.bai >> tmp; done;

-rw-r--r-- 1 dwaggott euan 6.3M Feb 14 11:48 align/4_29/111_2015-01-22_project_n143-fb-sort.bam.bai
-rw-r--r-- 1 dwaggott euan 6.4M Feb 14 11:49 align/4_20/95_2015-01-22_project_n143-fb-sort.bam.bai
-rw-r--r-- 1 dwaggott euan 6.6M Feb  3 13:27 align/4_192/tx/tmpq0gimi/89_2015-01-22_project_n143-fb-sort.bam.bai
-rw-r--r-- 1 dwaggott euan 6.6M Feb 14 11:50 align/4_192/89_2015-01-22_project_n143-fb-sort.bam.bai

From the log how do I figure out what files are missing?

[2015-02-14T19:52Z] sh-1-2.local: Index BAM file: 64_2015-01-22_project_n143-fb-sort.bam
[2015-02-14T19:52Z] sh-1-2.local: Index BAM file: 64_2015-01-22_project_n143-fb-sort.bam
[2015-02-14T19:52Z] sh-1-2.local: Index BAM file: 64_2015-01-22_project_n143-fb-sort.bam
[2015-02-14T19:52Z] sh-1-2.local: Index BAM file: 97_2015-01-22_project_n143-fb-sort.bam
[2015-02-14T19:53Z] sh-1-2.local: Genotyping with FreeBayes
[2015-02-14T19:53Z] sh-1-2.local: Opened BAM reader without index file, jumping is disabled.
[2015-02-14T19:53Z] sh-1-2.local: Index BAM file: 64_2015-01-22_project_n143-fb-sort.bam
[2015-02-14T19:53Z] sh-1-2.local: Index BAM file: 64_2015-01-22_project_n143-fb-sort.bam
[2015-02-14T19:54Z] sh-1-2.local: Genotyping with FreeBayes
chapmanb commented 9 years ago

Daryl; Sorry about this, I'm confused as to why you have some indexes being built and other not. Is there any kind of heterogeneity on the cluster PATHs, so that some have different versions of sambamba or bcbio? My best suggestion would be to run in a single thread and see if you can reproduce the error on a single machine. This should hopefully give some insight into the BAMs with missing indexes, or why they are failing during index builds. Hope this helps some.

dwaggott commented 9 years ago

I'll start running a separate test case. The issues might related to the frequent restarting using different versions of the code.

I doubled the memory for the submission script but it still failed. Based on sacct it looks like it's the controller job that is going over the default 2G. Where do I increase the memory for that step?

chapmanb commented 9 years ago

Daryl; Sorry about the controller memory problem. Out of curiousity, how many samples are you running together? I increased the default to 4Gb and made an option to set this using the resources tag, so you can increase further by adding:

-r conmem=6

to bump it to 6Gb, for instance. You'll need to upgrade bcbio_nextgen.py upgrade -u development to get the latest ipython-cluster-helper. Hope this works for you and gets things running. Thanks again for all the patience.

dwaggott commented 9 years ago

Excellent!

Simply restarting the pipeline got it moving again and it's running the freebayes steps. No clue if it's some type of heterogeneity issue. I'll keep testing.

Going forward, if there is anything you can do on the logging side to simplify identifying what job/sample/step is failing, that would be very helpful.

Is the observed behaviour of idle jobs consistent with a killed controller process? I would have thought all the jobs would die, if possible, upon some crucial failure. If not they just keep running pointlessly.

This was only 143 high coverage exomes, so not too much. I'm going to try on a second batch of 25 and a third batch of 400. I'll setup another issue on how to do some type of final joint calling on separate batches.

On Tue, Feb 17, 2015 at 11:23 AM, Brad Chapman notifications@github.com wrote:

Daryl; Sorry about the controller memory problem. Out of curiousity, how many samples are you running together? I increased the default to 4Gb and made an option to set this using the resources tag, so you can increase further by adding:

-r conmem=6

to bump it to 6Gb, for instance. You'll need to upgrade bcbio_nextgen.py upgrade -u development to get the latest ipython-cluster-helper. Hope this works for you and gets things running. Thanks again for all the patience.

— Reply to this email directly or view it on GitHub https://github.com/chapmanb/bcbio-nextgen/issues/741#issuecomment-74734081 .

dwaggott commented 9 years ago

Things are currently running at the qc step but I'm seeing errors in the log:

...
[2015-02-17T22:06Z] sh-5-17.local: WARNING: The input VCF has no snpEFF annotations. Variant impact will be set to unknown
[2015-02-17T22:06Z] sh-5-17.local: WARNING: The input VCF has no snpEFF annotations. Variant impact will be set to unknown
[2015-02-17T22:06Z] sh-5-17.local: WARNING: The input VCF has no snpEFF annotations. Variant impact will be set to unknown
[2015-02-17T22:06Z] sh-5-17.local: Traceback (most recent call last):
[2015-02-17T22:06Z] sh-5-17.local:   File "/share/PI/euan/apps/bin/gemini", line 6, in <module>
[2015-02-17T22:06Z] sh-5-17.local:     gemini.gemini_main.main()
[2015-02-17T22:06Z] sh-5-17.local:   File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 1105, in main
[2015-02-17T22:06Z] sh-5-17.local:     args.func(parser, args)
[2015-02-17T22:06Z] sh-5-17.local:   File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 226, in loadchunk_fn
[2015-02-17T22:06Z] sh-5-17.local:     gemini_load_chunk.load(parser, args)
[2015-02-17T22:06Z] sh-5-17.local:   File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_load_chunk.py", line 681, in load
[2015-02-17T22:06Z] sh-5-17.local:     gemini_loader.populate_from_vcf()
[2015-02-17T22:06Z] sh-5-17.local:   File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_load_chunk.py", line 129, in populate_from_vcf
[2015-02-17T22:06Z] sh-5-17.local:     os.remove(extra_file)
[2015-02-17T22:06Z] sh-5-17.local: OSError: [Errno 2] No such file or directory: '/scratch/PI/euan/projects/athletes/bcbio/gemini/batch1-freebayes-nomultiallelic.vcf.chunk14-extra.json'
[2015-02-17T22:09Z] sh-5-16.local: Traceback (most recent call last):
[2015-02-17T22:09Z] sh-5-16.local:   File "/share/PI/euan/apps/bin/gemini", line 6, in <module>
[2015-02-17T22:09Z] sh-5-16.local:     gemini.gemini_main.main()
[2015-02-17T22:09Z] sh-5-16.local:   File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 1105, in main
[2015-02-17T22:09Z] sh-5-16.local:     args.func(parser, args)
[2015-02-17T22:09Z] sh-5-16.local:   File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 244, in mergechunk_fn
[2015-02-17T22:09Z] sh-5-16.local:     gemini_merge_chunks.merge_chunks(parser, args)
[2015-02-17T22:09Z] sh-5-16.local:   File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_merge_chunks.py", line 190, in merge_chunks
[2015-02-17T22:09Z] sh-5-16.local:     merge_db_chunks(args)
[2015-02-17T22:09Z] sh-5-16.local:   File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_merge_chunks.py", line 176, in merge_db_chunks
[2015-02-17T22:09Z] sh-5-16.local:     append_sample_genotype_counts(main_curr, db)
[2015-02-17T22:09Z] sh-5-16.local:   File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_merge_chunks.py", line 40, in append_sample_genotype_counts
[2015-02-17T22:09Z] sh-5-16.local:     main_curr.execute(cmd)
[2015-02-17T22:09Z] sh-5-16.local: sqlite3.OperationalError: no such table: toMerge.sample_genotype_counts
[2015-02-17T22:09Z] sh-5-16.local: Traceback (most recent call last):
[2015-02-17T22:09Z] sh-5-16.local:   File "/share/PI/euan/apps/bin/gemini", line 6, in <module>
[2015-02-17T22:09Z] sh-5-16.local:     gemini.gemini_main.main()
[2015-02-17T22:09Z] sh-5-16.local:   File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 1105, in main
[2015-02-17T22:09Z] sh-5-16.local:     args.func(parser, args)
[2015-02-17T22:09Z] sh-5-16.local:   File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 244, in mergechunk_fn
[2015-02-17T22:09Z] sh-5-16.local:     gemini_merge_chunks.merge_chunks(parser, args)
[2015-02-17T22:09Z] sh-5-16.local:   File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_merge_chunks.py", line 190, in merge_chunks
[2015-02-17T22:09Z] sh-5-16.local:     merge_db_chunks(args)
[2015-02-17T22:09Z] sh-5-16.local:   File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_merge_chunks.py", line 176, in merge_db_chunks
[2015-02-17T22:09Z] sh-5-16.local:     append_sample_genotype_counts(main_curr, db)
[2015-02-17T22:09Z] sh-5-16.local:   File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_merge_chunks.py", line 40, in append_sample_genotype_counts
[2015-02-17T22:09Z] sh-5-16.local:     main_curr.execute(cmd)
[2015-02-17T22:09Z] sh-5-16.local: sqlite3.OperationalError: no such table: toMerge.sample_genotype_counts
[2015-02-17T22:09Z] sh-5-16.local: Traceback (most recent call last):
[2015-02-17T22:09Z] sh-5-16.local:   File "/share/PI/euan/apps/bin/gemini", line 6, in <module>
[2015-02-17T22:09Z] sh-5-16.local:     gemini.gemini_main.main()
[2015-02-17T22:09Z] sh-5-16.local:   File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 1105, in main
[2015-02-17T22:09Z] sh-5-16.local:     args.func(parser, args)
[2015-02-17T22:09Z] sh-5-16.local:   File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 244, in mergechunk_fn
[2015-02-17T22:09Z] sh-5-16.local:     gemini_merge_chunks.merge_chunks(parser, args)
[2015-02-17T22:09Z] sh-5-16.local:   File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_merge_chunks.py", line 190, in merge_chunks
[2015-02-17T22:09Z] sh-5-16.local:     merge_db_chunks(args)
[2015-02-17T22:09Z] sh-5-16.local:   File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_merge_chunks.py", line 176, in merge_db_chunks
[2015-02-17T22:09Z] sh-5-16.local:     append_sample_genotype_counts(main_curr, db)
[2015-02-17T22:09Z] sh-5-16.local:   File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_merge_chunks.py", line 40, in append_sample_genotype_counts
[2015-02-17T22:09Z] sh-5-16.local:     main_curr.execute(cmd)
[2015-02-17T22:09Z] sh-5-16.local: sqlite3.OperationalError: no such table: toMerge.sample_genotype_counts
[2015-02-17T22:09Z] sh-5-16.local: Traceback (most recent call last):
[2015-02-17T22:09Z] sh-5-16.local:   File "/share/PI/euan/apps/bin/gemini", line 6, in <module>
[2015-02-17T22:09Z] sh-5-16.local:     gemini.gemini_main.main()
[2015-02-17T22:09Z] sh-5-16.local:   File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 1105, in main
[2015-02-17T22:09Z] sh-5-16.local:     args.func(parser, args)
[2015-02-17T22:09Z] sh-5-16.local:   File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 244, in mergechunk_fn
[2015-02-17T22:09Z] sh-5-16.local:     gemini_merge_chunks.merge_chunks(parser, args)
[2015-02-17T22:09Z] sh-5-16.local:   File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_merge_chunks.py", line 190, in merge_chunks
[2015-02-17T22:09Z] sh-5-16.local:     merge_db_chunks(args)
[2015-02-17T22:09Z] sh-5-16.local:   File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_merge_chunks.py", line 176, in merge_db_chunks
[2015-02-17T22:09Z] sh-5-16.local:     append_sample_genotype_counts(main_curr, db)
[2015-02-17T22:09Z] sh-5-16.local:   File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_merge_chunks.py", line 40, in append_sample_genotype_counts
[2015-02-17T22:09Z] sh-5-16.local:     main_curr.execute(cmd)
[2015-02-17T22:09Z] sh-5-16.local: sqlite3.OperationalError: no such table: toMerge.sample_genotype_counts
[2015-02-17T22:09Z] sh-5-16.local: Traceback (most recent call last):
[2015-02-17T22:09Z] sh-5-16.local:   File "/share/PI/euan/apps/bin/gemini", line 6, in <module>
[2015-02-17T22:09Z] sh-5-16.local:     gemini.gemini_main.main()
[2015-02-17T22:09Z] sh-5-16.local:   File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 1105, in main
[2015-02-17T22:09Z] sh-5-16.local:     args.func(parser, args)
[2015-02-17T22:09Z] sh-5-16.local:   File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 244, in mergechunk_fn
[2015-02-17T22:09Z] sh-5-16.local:     gemini_merge_chunks.merge_chunks(parser, args)
[2015-02-17T22:09Z] sh-5-16.local:   File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_merge_chunks.py", line 190, in merge_chunks
[2015-02-17T22:09Z] sh-5-16.local:     merge_db_chunks(args)
[2015-02-17T22:09Z] sh-5-16.local:   File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_merge_chunks.py", line 176, in merge_db_chunks
[2015-02-17T22:09Z] sh-5-16.local:     append_sample_genotype_counts(main_curr, db)
[2015-02-17T22:09Z] sh-5-16.local:   File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_merge_chunks.py", line 40, in append_sample_genotype_counts
[2015-02-17T22:09Z] sh-5-16.local:     main_curr.execute(cmd)
[2015-02-17T22:09Z] sh-5-16.local: sqlite3.OperationalError: no such table: toMerge.sample_genotype_counts
[2015-02-17T22:09Z] sh-5-16.local: Traceback (most recent call last):
[2015-02-17T22:09Z] sh-5-16.local:   File "/share/PI/euan/apps/bin/gemini", line 6, in <module>
[2015-02-17T22:09Z] sh-5-16.local:     gemini.gemini_main.main()
[2015-02-17T22:09Z] sh-5-16.local:   File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 1105, in main
[2015-02-17T22:09Z] sh-5-16.local:     args.func(parser, args)
[2015-02-17T22:09Z] sh-5-16.local:   File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 244, in mergechunk_fn
[2015-02-17T22:09Z] sh-5-16.local:     gemini_merge_chunks.merge_chunks(parser, args)
[2015-02-17T22:09Z] sh-5-16.local:   File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_merge_chunks.py", line 190, in merge_chunks
[2015-02-17T22:09Z] sh-5-16.local:     merge_db_chunks(args)
[2015-02-17T22:09Z] sh-5-16.local:   File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_merge_chunks.py", line 176, in merge_db_chunks
[2015-02-17T22:09Z] sh-5-16.local:     append_sample_genotype_counts(main_curr, db)
[2015-02-17T22:09Z] sh-5-16.local:   File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_merge_chunks.py", line 40, in append_sample_genotype_counts
[2015-02-17T22:09Z] sh-5-16.local:     main_curr.execute(cmd)
[2015-02-17T22:09Z] sh-5-16.local: sqlite3.OperationalError: no such table: toMerge.sample_genotype_counts
[2015-02-17T22:09Z] sh-5-16.local: Traceback (most recent call last):
[2015-02-17T22:09Z] sh-5-16.local:   File "/share/PI/euan/apps/bin/gemini", line 6, in <module>
[2015-02-17T22:09Z] sh-5-16.local:     gemini.gemini_main.main()
[2015-02-17T22:09Z] sh-5-16.local:   File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 1105, in main
[2015-02-17T22:09Z] sh-5-16.local:     args.func(parser, args)
[2015-02-17T22:09Z] sh-5-16.local:   File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 244, in mergechunk_fn
[2015-02-17T22:09Z] sh-5-16.local:     gemini_merge_chunks.merge_chunks(parser, args)
[2015-02-17T22:09Z] sh-5-16.local:   File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_merge_chunks.py", line 190, in merge_chunks
[2015-02-17T22:09Z] sh-5-16.local:     merge_db_chunks(args)
[2015-02-17T22:09Z] sh-5-16.local:   File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_merge_chunks.py", line 176, in merge_db_chunks
[2015-02-17T22:09Z] sh-5-16.local:     append_sample_genotype_counts(main_curr, db)
[2015-02-17T22:09Z] sh-5-16.local:   File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_merge_chunks.py", line 40, in append_sample_genotype_counts
[2015-02-17T22:09Z] sh-5-16.local:     main_curr.execute(cmd)
[2015-02-17T22:09Z] sh-5-16.local: sqlite3.OperationalError: no such table: toMerge.sample_genotype_counts
[2015-02-17T22:09Z] sh-5-16.local: Traceback (most recent call last):
[2015-02-17T22:09Z] sh-5-16.local:   File "/share/PI/euan/apps/bin/gemini", line 6, in <module>
[2015-02-17T22:09Z] sh-5-16.local:     gemini.gemini_main.main()
[2015-02-17T22:09Z] sh-5-16.local:   File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 1105, in main
[2015-02-17T22:09Z] sh-5-16.local:     args.func(parser, args)
[2015-02-17T22:09Z] sh-5-16.local:   File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 244, in mergechunk_fn
[2015-02-17T22:09Z] sh-5-16.local:     gemini_merge_chunks.merge_chunks(parser, args)
[2015-02-17T22:09Z] sh-5-16.local:   File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_merge_chunks.py", line 190, in merge_chunks
[2015-02-17T22:09Z] sh-5-16.local:     merge_db_chunks(args)
[2015-02-17T22:09Z] sh-5-16.local:   File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_merge_chunks.py", line 176, in merge_db_chunks
[2015-02-17T22:09Z] sh-5-16.local:     append_sample_genotype_counts(main_curr, db)
[2015-02-17T22:09Z] sh-5-16.local:   File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_merge_chunks.py", line 40, in append_sample_genotype_counts
[2015-02-17T22:09Z] sh-5-16.local:     main_curr.execute(cmd)
[2015-02-17T22:09Z] sh-5-16.local: sqlite3.OperationalError: no such table: toMerge.sample_genotype_counts
[2015-02-17T22:09Z] sh-5-16.local: Traceback (most recent call last):
[2015-02-17T22:09Z] sh-5-16.local:   File "/share/PI/euan/apps/bin/gemini", line 6, in <module>
[2015-02-17T22:09Z] sh-5-16.local:     gemini.gemini_main.main()
[2015-02-17T22:09Z] sh-5-16.local:   File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 1105, in main
[2015-02-17T22:09Z] sh-5-16.local:     args.func(parser, args)
[2015-02-17T22:09Z] sh-5-16.local:   File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 244, in mergechunk_fn
[2015-02-17T22:09Z] sh-5-16.local:     gemini_merge_chunks.merge_chunks(parser, args)
[2015-02-17T22:09Z] sh-5-16.local:   File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_merge_chunks.py", line 190, in merge_chunks
[2015-02-17T22:09Z] sh-5-16.local:     merge_db_chunks(args)
[2015-02-17T22:09Z] sh-5-16.local:   File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_merge_chunks.py", line 176, in merge_db_chunks
[2015-02-17T22:09Z] sh-5-16.local:     append_sample_genotype_counts(main_curr, db)
[2015-02-17T22:09Z] sh-5-16.local:   File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_merge_chunks.py", line 40, in append_sample_genotype_counts
[2015-02-17T22:09Z] sh-5-16.local:     main_curr.execute(cmd)
[2015-02-17T22:09Z] sh-5-16.local: sqlite3.OperationalError: no such table: toMerge.sample_genotype_counts
[2015-02-17T22:09Z] sh-5-16.local: Indexing /scratch/PI/euan/projects/athletes/bcbio/gemini/batch1-freebayes-nomultiallelic.vcf.gz with grabix.
[2015-02-17T22:09Z] sh-5-16.local: Loading 3429621 variants.
[2015-02-17T22:09Z] sh-5-16.local: Breaking /scratch/PI/euan/projects/athletes/bcbio/gemini/batch1-freebayes-nomultiallelic.vcf.gz into 16 chunks.
[2015-02-17T22:09Z] sh-5-16.local: Loading chunk 0.
[2015-02-17T22:09Z] sh-5-16.local: Loading chunk 1.
[2015-02-17T22:09Z] sh-5-16.local: Loading chunk 2.
[2015-02-17T22:09Z] sh-5-16.local: Loading chunk 3.
[2015-02-17T22:09Z] sh-5-16.local: Loading chunk 4.
[2015-02-17T22:09Z] sh-5-16.local: Loading chunk 5.
[2015-02-17T22:09Z] sh-5-16.local: Loading chunk 6.
[2015-02-17T22:09Z] sh-5-16.local: Loading chunk 7.
[2015-02-17T22:09Z] sh-5-16.local: Loading chunk 8.
[2015-02-17T22:09Z] sh-5-16.local: Loading chunk 9.
[2015-02-17T22:09Z] sh-5-16.local: Loading chunk 10.
[2015-02-17T22:09Z] sh-5-16.local: Loading chunk 11.
[2015-02-17T22:09Z] sh-5-16.local: Loading chunk 12.
[2015-02-17T22:09Z] sh-5-16.local: Loading chunk 13.
[2015-02-17T22:09Z] sh-5-16.local: Loading chunk 14.
[2015-02-17T22:09Z] sh-5-16.local: Loading chunk 15.
[2015-02-17T22:09Z] sh-5-16.local: Done loading 3429621 variants in 16 chunks.
[2015-02-17T22:09Z] sh-5-16.local: 2015-02-17 14:08:34 merging 16 chunks.
[2015-02-17T22:09Z] sh-5-16.local: Traceback (most recent call last):
[2015-02-17T22:09Z] sh-5-16.local:   File "/share/PI/euan/apps/bin/gemini", line 6, in <module>
[2015-02-17T22:09Z] sh-5-16.local:     gemini.gemini_main.main()
[2015-02-17T22:09Z] sh-5-16.local:   File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 1105, in main
[2015-02-17T22:09Z] sh-5-16.local:     args.func(parser, args)
[2015-02-17T22:09Z] sh-5-16.local:   File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 136, in load_fn
[2015-02-17T22:09Z] sh-5-16.local:     gemini_load.load(parser, args)
[2015-02-17T22:09Z] sh-5-16.local:   File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_load.py", line 50, in load
[2015-02-17T22:09Z] sh-5-16.local:     load_multicore(args)
[2015-02-17T22:09Z] sh-5-16.local:   File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_load.py", line 75, in load_multicore
[2015-02-17T22:09Z] sh-5-16.local:     merge_chunks_multicore(chunks, args.db)
[2015-02-17T22:09Z] sh-5-16.local:   File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_load.py", line 137, in merge_chunks_multicore
[2015-02-17T22:09Z] sh-5-16.local:     cleanup_temp_db_files(chunks)
[2015-02-17T22:09Z] sh-5-16.local:   File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_load.py", line 304, in cleanup_temp_db_files
[2015-02-17T22:09Z] sh-5-16.local:     os.remove(chunk_db)
[2015-02-17T22:09Z] sh-5-16.local: OSError: [Errno 2] No such file or directory: '/scratch/PI/euan/projects/athletes/bcbio/gemini/batch1-freebayes-nomultiallelic.vcf.chunk1.db'
[2015-02-17T22:09Z] sh-5-16.local: Uncaught exception occurred
Traceback (most recent call last):
  File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 21, in run
    _do_run(cmd, checks, log_stdout)
  File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 95, in _do_run
    raise subprocess.CalledProcessError(exitcode, error_msg)
CalledProcessError: Command 'set -o pipefail; /share/PI/euan/apps/bin/gemini load  --passonly --skip-gerp-bp  -v /scratch/PI/euan/projects/athletes/bcbio/gemini/batch1-freebayes-nomultiallelic.vcf.gz -t snpEff --cores 16 /scratch/PI/euan/projects/athletes/bcbio/gemini/tx/tmp1HyQZi/b
  File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_merge_chunks.py", line 190, in merge_chunks
    merge_db_chunks(args)
  File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_merge_chunks.py", line 176, in merge_db_chunks
    append_sample_genotype_counts(main_curr, db)
  File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_merge_chunks.py", line 40, in append_sample_genotype_counts
    main_curr.execute(cmd)
sqlite3.OperationalError: no such table: toMerge.sample_genotype_counts
Traceback (most recent call last):
  File "/share/PI/euan/apps/bin/gemini", line 6, in <module>
    gemini.gemini_main.main()
  File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 1105, in main
    args.func(parser, args)
  File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 244, in mergechunk_fn
    gemini_merge_chunks.merge_chunks(parser, args)
  File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_merge_chunks.py", line 190, in merge_chunks
    merge_db_chunks(args)
  File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_merge_chunks.py", line 176, in merge_db_chunks
    append_sample_genotype_counts(main_curr, db)
  File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_merge_chunks.py", line 40, in append_sample_genotype_counts
    main_curr.execute(cmd)
sqlite3.OperationalError: no such table: toMerge.sample_genotype_counts
Traceback (most recent call last):
  File "/share/PI/euan/apps/bin/gemini", line 6, in <module>
    gemini.gemini_main.main()
  File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 1105, in main
    args.func(parser, args)
  File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 244, in mergechunk_fn
    gemini_merge_chunks.merge_chunks(parser, args)
  File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_merge_chunks.py", line 190, in merge_chunks
    merge_db_chunks(args)
  File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_merge_chunks.py", line 176, in merge_db_chunks
    append_sample_genotype_counts(main_curr, db)
  File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_merge_chunks.py", line 40, in append_sample_genotype_counts
    main_curr.execute(cmd)
sqlite3.OperationalError: no such table: toMerge.sample_genotype_counts
Traceback (most recent call last):
  File "/share/PI/euan/apps/bin/gemini", line 6, in <module>
    gemini.gemini_main.main()
  File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 1105, in main
    args.func(parser, args)
  File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 244, in mergechunk_fn
    gemini_merge_chunks.merge_chunks(parser, args)
  File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_merge_chunks.py", line 190, in merge_chunks
    merge_db_chunks(args)
  File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_merge_chunks.py", line 176, in merge_db_chunks
    append_sample_genotype_counts(main_curr, db)
  File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/gemini/gemini_merge_chunks.py", line 40, in append_sample_genotype_counts
    main_curr.execute(cmd)
sqlite3.OperationalError: no such table: toMerge.sample_genotype_counts
Indexing /scratch/PI/euan/projects/athletes/bcbio/gemini/batch1-freebayes-nomultiallelic.vcf.gz with grabix.
Loading 3429621 variants.
Breaking /scratch/PI/euan/projects/athletes/bcbio/gemini/batch1-freebayes-nomultiallelic.vcf.gz into 16 chunks.
Loading chunk 0.
Loading chunk 1.
Loading chunk 2.
Loading chunk 3.
Loading chunk 4.
Loading chunk 5.
Loading chunk 6.
Loading chunk 7.

...
$ fgrep Error call_variants_fb.err
[2015-02-17T21:44Z] sh-5-17.local: OSError: [Errno 2] No such file or directory: '/scratch/PI/euan/projects/athletes/bcbio/gemini/batch1-freebayes-nomultiallelic.vcf.chunk15-extra.json'
[2015-02-17T22:02Z] sh-5-17.local: OSError: [Errno 2] No such file or directory: '/scratch/PI/euan/projects/athletes/bcbio/gemini/batch1-freebayes-nomultiallelic.vcf.chunk7-extra.json'
[2015-02-17T22:03Z] sh-5-17.local: OSError: [Errno 2] No such file or directory: '/scratch/PI/euan/projects/athletes/bcbio/gemini/batch1-freebayes-nomultiallelic.vcf.chunk4-extra.json'
[2015-02-17T22:03Z] sh-5-17.local: OSError: [Errno 2] No such file or directory: '/scratch/PI/euan/projects/athletes/bcbio/gemini/batch1-freebayes-nomultiallelic.vcf.chunk3-extra.json'
[2015-02-17T22:03Z] sh-5-17.local: OSError: [Errno 2] No such file or directory: '/scratch/PI/euan/projects/athletes/bcbio/gemini/batch1-freebayes-nomultiallelic.vcf.chunk10-extra.json'
[2015-02-17T22:03Z] sh-5-17.local: OSError: [Errno 2] No such file or directory: '/scratch/PI/euan/projects/athletes/bcbio/gemini/batch1-freebayes-nomultiallelic.vcf.chunk2-extra.json'
[2015-02-17T22:03Z] sh-5-17.local: OSError: [Errno 2] No such file or directory: '/scratch/PI/euan/projects/athletes/bcbio/gemini/batch1-freebayes-nomultiallelic.vcf.chunk1-extra.json'
[2015-02-17T22:04Z] sh-5-17.local: OSError: [Errno 2] No such file or directory: '/scratch/PI/euan/projects/athletes/bcbio/gemini/batch1-freebayes-nomultiallelic.vcf.chunk8-extra.json'
[2015-02-17T22:04Z] sh-5-17.local: OSError: [Errno 2] No such file or directory: '/scratch/PI/euan/projects/athletes/bcbio/gemini/batch1-freebayes-nomultiallelic.vcf.chunk5-extra.json'
[2015-02-17T22:04Z] sh-5-17.local: OSError: [Errno 2] No such file or directory: '/scratch/PI/euan/projects/athletes/bcbio/gemini/batch1-freebayes-nomultiallelic.vcf.chunk6-extra.json'
[2015-02-17T22:04Z] sh-5-17.local: OSError: [Errno 2] No such file or directory: '/scratch/PI/euan/projects/athletes/bcbio/gemini/batch1-freebayes-nomultiallelic.vcf.chunk0-extra.json'
[2015-02-17T22:04Z] sh-5-17.local: OSError: [Errno 2] No such file or directory: '/scratch/PI/euan/projects/athletes/bcbio/gemini/batch1-freebayes-nomultiallelic.vcf.chunk11-extra.json'
[2015-02-17T22:06Z] sh-5-17.local: OSError: [Errno 2] No such file or directory: '/scratch/PI/euan/projects/athletes/bcbio/gemini/batch1-freebayes-nomultiallelic.vcf.chunk13-extra.json'
[2015-02-17T22:06Z] sh-5-17.local: OSError: [Errno 2] No such file or directory: '/scratch/PI/euan/projects/athletes/bcbio/gemini/batch1-freebayes-nomultiallelic.vcf.chunk9-extra.json'
[2015-02-17T22:06Z] sh-5-17.local: OSError: [Errno 2] No such file or directory: '/scratch/PI/euan/projects/athletes/bcbio/gemini/batch1-freebayes-nomultiallelic.vcf.chunk12-extra.json'
[2015-02-17T22:06Z] sh-5-17.local: OSError: [Errno 2] No such file or directory: '/scratch/PI/euan/projects/athletes/bcbio/gemini/batch1-freebayes-nomultiallelic.vcf.chunk14-extra.json'
[2015-02-17T22:09Z] sh-5-16.local: sqlite3.OperationalError: no such table: toMerge.sample_genotype_counts
[2015-02-17T22:09Z] sh-5-16.local: sqlite3.OperationalError: no such table: toMerge.sample_genotype_counts
[2015-02-17T22:09Z] sh-5-16.local: sqlite3.OperationalError: no such table: toMerge.sample_genotype_counts
[2015-02-17T22:09Z] sh-5-16.local: sqlite3.OperationalError: no such table: toMerge.sample_genotype_counts
[2015-02-17T22:09Z] sh-5-16.local: sqlite3.OperationalError: no such table: toMerge.sample_genotype_counts
[2015-02-17T22:09Z] sh-5-16.local: sqlite3.OperationalError: no such table: toMerge.sample_genotype_counts
[2015-02-17T22:09Z] sh-5-16.local: sqlite3.OperationalError: no such table: toMerge.sample_genotype_counts
[2015-02-17T22:09Z] sh-5-16.local: sqlite3.OperationalError: no such table: toMerge.sample_genotype_counts
[2015-02-17T22:09Z] sh-5-16.local: OSError: [Errno 2] No such file or directory: '/scratch/PI/euan/projects/athletes/bcbio/gemini/batch1-freebayes-nomultiallelic.vcf.chunk1.db'
    raise subprocess.CalledProcessError(exitcode, error_msg)
CalledProcessError: Command 'set -o pipefail; /share/PI/euan/apps/bin/gemini load  --passonly --skip-gerp-bp  -v /scratch/PI/euan/projects/athletes/bcbio/gemini/batch1-freebayes-nomultiallelic.vcf.gz -t snpEff --cores 16 /scratch/PI/euan/projects/athletes/bcbio/gemini/tx/tmp1HyQZi/batch1-freebayes.db
sqlite3.OperationalError: no such table: toMerge.sample_genotype_counts
sqlite3.OperationalError: no such table: toMerge.sample_genotype_counts
sqlite3.OperationalError: no such table: toMerge.sample_genotype_counts
sqlite3.OperationalError: no such table: toMerge.sample_genotype_counts
sqlite3.OperationalError: no such table: toMerge.sample_genotype_counts
OSError: [Errno 2] No such file or directory: '/scratch/PI/euan/projects/athletes/bcbio/gemini/batch1-freebayes-nomultiallelic.vcf.chunk1.db'
    raise subprocess.CalledProcessError(exitcode, error_msg)
CalledProcessError: Command 'set -o pipefail; /share/PI/euan/apps/bin/gemini load  --passonly --skip-gerp-bp  -v /scratch/PI/euan/projects/athletes/bcbio/gemini/batch1-freebayes-nomultiallelic.vcf.gz -t snpEff --cores 16 /scratch/PI/euan/projects/athletes/bcbio/gemini/tx/tmp1HyQZi/batch1-freebayes.db
sqlite3.OperationalError: no such table: toMerge.sample_genotype_counts
sqlite3.OperationalError: no such table: toMerge.sample_genotype_counts
sqlite3.OperationalError: no such table: toMerge.sample_genotype_counts
sqlite3.OperationalError: no such table: toMerge.sample_genotype_counts
sqlite3.OperationalError: no such table: toMerge.sample_genotype_counts
OSError: [Errno 2] No such file or directory: '/scratch/PI/euan/projects/athletes/bcbio/gemini/batch1-freebayes-nomultiallelic.vcf.chunk1.db'
chapmanb commented 9 years ago

Daryl; Sorry about the GEMINI load problem. I'm confused as to what is going on here. It looks like some sort of race condition where two databases are trying to load at similar times but I'm not sure how that would get triggered. Is it possible there are two simultaneous bcbio runs happening or something else? If the GEMINI databases eventually get created correctly and it proceeds on to QC, it sounds like you are okay but I wish I knew the underlying reason. Hope things at least finish up okay on this run.

dwaggott commented 9 years ago

Ah, multiple processes could be the issue.

I have an sbatch script which submits the bcbio job. It looked like this was running out of memory so I increased the allocation from a default 4G to 8G. Due to the cluster setup this required increasing the number of cores from 1 to 2. After I did that I was seeing multiple bcbio processes in squeue. I just assumed this was a reporting quirk.

My setup is below.

$ cat call_variants.sbatch

#set a job name  
#SBATCH --job-name=call_variants_fb
#################  
#a file for job output, you can check job progress
#SBATCH --output=call_variants_fb.out
#################
# a file for errors from the job
#SBATCH --error=call_variants_fb.err
#################
#time you think you need; default is one hour
#in minutes in this case
#SBATCH --time=7-00:00:00
#################
#quality of service; think of it as job priority
#SBATCH --qos=long
#################
#number of nodes you are requesting
#SBATCH --nodes=1
#################
# access more resources
#SBATCH --partition=normal
#################
#memory per node; default is 4000 MB per CPU
#SBATCH --mem=8000
#you could use --mem-per-cpu; they mean what we are calling cores
#################
#task to run per node; each node has 16 cores, "task" is "core" really
#SBATCH --ntasks-per-node=2
#don't use --cpus-per-task on this cluster
#################
#get emailed about job BEGIN, END, and FAIL
##SBATCH --mail-type=ALL
#################
#who to send email to; please change to your email
##SBATCH  --mail-user=dwaggott@stanford.edu
#################
#now run normal batch commands
module load python/2.7.5
module load openmpi/1.6.5
srun /home/dwaggott/projects1/athletes/bcbio/call_variants.sh
$ cat call_variants.sh 
#!/bin/bash
module load python/2.7.5

# call variants
bcbio_nextgen.py  --timeout 10000 --retries 2 project_n143-fb/config/project_n143-fb.yaml  -t ipython -n 160 -q normal -s slurm -r "timelimit=7-00:00:00;qos=long"
dwaggott commented 9 years ago

Reverting back to 1 cpu and restarting causes a memory issue on the submission script. sacct reports it completed without error.

[2015-02-18T22:01Z] sh-2-21.local: Storing in local filesystem: /scratch/PI/euan/projects/athletes/final/3_32/3_32-freebayes.vcf.gz.tbi
[2015-02-18T22:01Z] sh-2-21.local: Storing in local filesystem: /scratch/PI/euan/projects/athletes/final/3_07/3_07-freebayes.vcf.gz.tbi
[2015-02-18T22:01Z] sh-2-21.local: Storing in local filesystem: /scratch/PI/euan/projects/athletes/final/3_57/3_57-freebayes.vcf.gz.tbi
Exception in thread Thread-1 (most likely raised during interpreter shutdown):
Traceback (most recent call last):
  File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/threading.py", line 810, in __bootstrap_inner
  File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/threading.py", line 763, in run
  File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/logbook/queues.py", line 265, in _target
  File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/logbook/base.py", line 238, in pop_thread
  File "/share/PI/euan/apps/bcbio/anaconda/lib/python2.7/site-packages/logbook/base.py", line 203, in pop_thread
  File "logbook/_speedups.pyx", line 233, in logbook._speedups.ContextStackManager.pop_thread (logbook/_speedups.c:5839)
  File "logbook/_speedups.pyx", line 236, in logbook._speedups.ContextStackManager.pop_thread (logbook/_speedups.c:5660)
<type 'exceptions.TypeError'>: 'NoneType' object is not callable
slurmstepd: Exceeded job memory limit at some point. oom-killer likely killed a process.
slurmstepd: Exceeded job memory limit at some point. oom-killer likely killed a process.
chapmanb commented 9 years ago

Daryl Glad it seems to be finishing up, or at least almost finishing. I'm a bit confused about your submission setup and how this might be affecting things. Why is the bcbio call in a separate shell script instead of going directly in your call_variants.sbatch submission script? I'm not enough of a SLURM expert to understand how the nested srun or changing --ntasks-per-node influences the run. I normally put everything in one batch script and use --cpus-per-task and --mem to adjust requirements directly. Hope that helps some.

dwaggott commented 9 years ago

I'll rejig the submission script. I was just following some examples. The srun in this case is unnecessary.