bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
985 stars 353 forks source link

Repeated execution of bedtools subtract on identical files during structural variation-detect_sv #882

Closed schelhorn closed 9 years ago

schelhorn commented 9 years ago

In the structural variation-detect_sv step of the variant2 analysis, multiple bedtools subtract operations of the form:

bedtools subtract -b /path/to/bcbio/genomes/Hsapiens/GRCh37/variation/LCR.bed.gz -a /path/to/working/directory/bedprep/Agilent_SureSelect_Human_All_Exon_V4_51mbp_GRCh37_Regions.bed

are executed, probably during delly runs. This bedtools operation occurs once for each sample which doesn't make much sense since its output is constant. It is interspersed with:

bedtools subtract -b /path/to/bcbio/genomes/Hsapiens/GRCh37/variation/sv_repeat_telomere_centromere.bed -a /local/node/tmp/pybedtools.eIx_WT.tmp

the latter of which may be specific for each sample and thus is probably okay. As another curiosity, these commands seem to be executed in serial fashion by the IPython cluster engine on only one worker node, in contrast to the parallel execution that is usually done. However, I had to restart bcbio, so this may be a consequence of this. The IPython controller logs do not contain errors (only missing heartbeat warnings) but are full hundreds of:

2015-06-10 11:34:01.782 [VMFixIPControllerApp] task::task '8e1ee2c4-5524-4eba-956e-ca680b55db21' arrived on 21
2015-06-10 11:34:01.783 [VMFixIPControllerApp] task::task 'b96c1fe3-60c2-4137-a213-01442974bb62' finished on 21
2015-06-10 11:34:01.833 [VMFixIPControllerApp] task::task 'aae40656-1434-4511-b17e-ea6f699d16df' arrived on 21
2015-06-10 11:34:01.833 [VMFixIPControllerApp] task::task '8e1ee2c4-5524-4eba-956e-ca680b55db21' finished on 21
2015-06-10 11:34:13.467 [VMFixIPControllerApp] task::task 'ec100382-04a1-4c62-b942-ccb9f7071fff' arrived on 21
2015-06-10 11:34:13.468 [VMFixIPControllerApp] task::task 'aae40656-1434-4511-b17e-ea6f699d16df' finished on 21

each of which only takes a short time to complete and likely is identical to the bedtools tasks.

The last lines of the log are (headnode is the torque submit node):

[2015-06-10T09:24Z] headnode: ipython: split_variants_by_sample
[2015-06-10T09:24Z] headnode: Timing: prepped BAM merging
[2015-06-10T09:24Z] headnode: Timing: validation
[2015-06-10T09:24Z] headnode: ipython: compare_to_rm
[2015-06-10T09:24Z] headnode: Timing: ensemble calling
[2015-06-10T09:24Z] headnode: Timing: validation summary
[2015-06-10T09:24Z] headnode: Timing: structural variation
[2015-06-10T09:24Z] headnode: ipython: detect_sv

If possible, I'd prefer parallel execution of these bedtools steps as well as skipping the unnecessary bedtools commands; of course, these two goals may be interrelated.

schelhorn commented 9 years ago

The bcbio debug log shows the same pattern of short commands that are also in the IPython logs:

[2015-06-10T11:56Z] WORKERNODEHOSTNAME: Resource requests: delly; memory: 1.00; cores: 1
[2015-06-10T11:56Z] WORKERNODEHOSTNAME: Configuring 1 jobs to run, using 1 cores each with 1.00g of memory reserved for each job

so the jobs are indeed issued by the delly workflow.

chapmanb commented 9 years ago

Sven-Eric; Sorry about the problems on the re-runs. I checked in a fix which I hope will resolve this issue by properly detecting if the files already exist before doing some of the pre-processing, which is what I think leads to your repeated runs. This slightly changes the names of these files by introducing the chromosome name into them for subparts (this was already included in the base file, but it could have been a potential conflict on re-use later) so the first run will re-prep these and then they should be in place for any subsequent runs.

Eiither way, these should proceed in parallel independently of each other on a subsequent run. Is it possible you have a checkpoint file (checkpoints_parallel/multicore2.done) present from your other run that avoids parallelizing for these steps?

Hope this helps get things running faster and thank you as always for the detailed report.

schelhorn commented 9 years ago

Thanks a lot; multicore2.done had not completed and I re-ran using the latest commit. However, I received an error in structural.run:

Traceback (most recent call last):
  File "/cluster/software/bcbio_dev/app/bin/bcbio_nextgen.py", line 224, in <module>
    main(**kwargs)
  File "/cluster/software/bcbio_dev/app/bin/bcbio_nextgen.py", line 43, in main
    run_main(**kwargs)
  File "/cluster/software/bcbio_dev/lib/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 44, in run_main
    fc_dir, run_info_yaml)
  File "/cluster/software/bcbio_dev/lib/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 79, in _run_toplevel
    for xs in pipeline.run(config, run_info_yaml, parallel, dirs, samples):
  File "/cluster/software/bcbio_dev/lib/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 183, in run
    samples = structural.run(samples, run_parallel)
  File "/cluster/software/bcbio_dev/lib/anaconda/lib/python2.7/site-packages/bcbio/structural/__init__.py", line 105, in run
    processed = run_parallel("detect_sv", ([xs, background, xs[0]["config"]] for xs in to_process.values()))
  File "/cluster/software/bcbio_dev/lib/anaconda/lib/python2.7/site-packages/bcbio/distributed/ipython.py", line 130, in run
    for data in view.map_sync(fn, items, track=False):
  File "/cluster/software/bcbio_dev/lib/anaconda/lib/python2.7/site-packages/IPython/parallel/client/view.py", line 355, in map_sync
    return self.map(f,*sequences,**kwargs)
  File "<string>", line 2, in map
  File "/cluster/software/bcbio_dev/lib/anaconda/lib/python2.7/site-packages/IPython/parallel/client/view.py", line 55, in sync_results
    ret = f(self, *args, **kwargs)
  File "<string>", line 2, in map
  File "/cluster/software/bcbio_dev/lib/anaconda/lib/python2.7/site-packages/IPython/parallel/client/view.py", line 40, in save_ids
    ret = f(self, *args, **kwargs)
  File "/cluster/software/bcbio_dev/lib/anaconda/lib/python2.7/site-packages/IPython/parallel/client/view.py", line 1123, in map
    return pf.map(*sequences)
  File "/cluster/software/bcbio_dev/lib/anaconda/lib/python2.7/site-packages/IPython/parallel/client/remotefunction.py", line 271, in map
    ret = self(*sequences)
  File "<string>", line 2, in __call__
  File "/cluster/software/bcbio_dev/lib/anaconda/lib/python2.7/site-packages/IPython/parallel/client/remotefunction.py", line 78, in sync_view_results
    return f(self, *args, **kwargs)
  File "/cluster/software/bcbio_dev/lib/anaconda/lib/python2.7/site-packages/IPython/parallel/client/remotefunction.py", line 254, in __call__
    return r.get()
  File "/cluster/software/bcbio_dev/lib/anaconda/lib/python2.7/site-packages/IPython/parallel/client/asyncresult.py", line 104, in get
    raise self._exception
IPython.parallel.error.CompositeError: one or more exceptions from call to method: detect_sv
[15:apply]: AssertionError: Finding job resources but no items to process

I will delete the structural results folder and retry.

chapmanb commented 9 years ago

Sven-Eric; Thanks for testing this and sorry about the additional issues. I'm not sure why you're getting this error exactly but I added a small fix to avoid calculating resources if we don't have any jobs to run. Hopefully this resolves it and you can finish cleanly.

schelhorn commented 9 years ago

Seems to work just fine now. Thanks for the quick fix, Brad. Closing.