bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
993 stars 354 forks source link

"sambamba-view: not enough data in stream" during multiprocessing: piped_bamprep #1985

Closed amizeranschi closed 7 years ago

amizeranschi commented 7 years ago

Hello,

I'm getting the following error during a germline variant calling analysis: "sambamba-view: not enough data in stream"

These are the last lines from bcbio-nextgen.log:

[2017-06-22T22:25Z] Timing: hla typing [2017-06-22T22:25Z] Timing: alignment post-processing [2017-06-22T22:25Z] multiprocessing: piped_bamprep [2017-06-22T22:38Z] Uncaught exception occurred Traceback (most recent call last): File "/mnt/storage-brno3-cerit/nfs4/home/amizeranschi/bcbio/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 22, in run _do_run(cmd, checks, log_stdout, env=env) File "/mnt/storage-brno3-cerit/nfs4/home/amizeranschi/bcbio/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 102, in _do_run raise subprocess.CalledProcessError(exitcode, error_msg) CalledProcessError: Command 'set -o pipefail; /mnt/storage-brno3-cerit/nfs4/home/amizeranschi/bcbio/anaconda/bin/sambamba view -f bam -L /scratch/amizeranschi/job_44870.wagap-pro.cerit-sc.cz/allDNA-P10w-down-germline/work/regions/P10w-comp-noanalysis_blocks.bed /scratch/amizeranschi/job_44870.wagap-pro.cerit-sc.cz/allDNA-P10w-down-germline/work/bamclean/P10w_1-down-10/P10w_1-down-10-reorder-fixrgs-gatkfilter-dedup.bam | /mnt/storage-brno3-cerit/nfs4/home/amizeranschi/bcbio/anaconda/bin/bedtools intersect -abam - -b /scratch/amizeranschi/job_44870.wagap-pro.cerit-sc.cz/allDNA-P10w-down-germline/work/regions/P10w-comp-noanalysis_blocks.bed -f 1.0 -nonamecheck> /scratch/amizeranschi/job_44870.wagap-pro.cerit-sc.cz/allDNA-P10w-down-germline/work/bcbiotx/tmpURqEqi/P10w_1-down-10-reorder-fixrgs-gatkfilter-dedup-noanalysis-prep.bam sambamba-view: not enough data in stream ' returned non-zero exit status 1

This is the YAML template that I'm using:

details:

  • analysis: variant2 genome_build: sacCer3 resources: default: memory: 4G cores: 32 jvm_opts: ["-Xms3000m", "-Xmx4000m"] metadata: batch: batch1 algorithm: aligner: bwa mark_duplicates: true recalibrate: gatk realign: gatk variantcaller: [gatk-haplotype, freebayes, platypus, samtools] ensemble: numpass: 2 ploidy: 1
    tools_off: gemini svcaller: [manta] variant_regions: ../config/variant_regions.bed

Can someone provide any clues as to why the previously mentioned error is occurring?

Thank you!

amizeranschi commented 7 years ago

Here is some extra information, in case it sheds more light on this issue.

The same issue occurs during a somatic variant calling pipeline, which I ran on the same data:

[2017-06-23T07:38Z] multiprocessing: piped_bamprep [2017-06-23T07:53Z] Uncaught exception occurred Traceback (most recent call last): File "/mnt/storage-brno3-cerit/nfs4/home/amizeranschi/bcbio/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 22, in run _do_run(cmd, checks, log_stdout, env=env) File "/mnt/storage-brno3-cerit/nfs4/home/amizeranschi/bcbio/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 102, in _do_run raise subprocess.CalledProcessError(exitcode, error_msg) CalledProcessError: Command 'set -o pipefail; /mnt/storage-brno3-cerit/nfs4/home/amizeranschi/bcbio/anaconda/bin/sambamba view -f bam -L /scratch/amizeranschi/job_44974.wagap-pro.cerit-sc.cz/allDNA-P10w-down-somatic/work/regions/P10w-comp-10-noanalysis_blocks.bed /scratch/amizeranschi/job_44974.wagap-pro.cerit-sc.cz/allDNA-P10w-down-somatic/work/bamclean/P10-20N-down-10/P10-20N-down-10-reorder-fixrgs-gatkfilter-dedup.bam | /mnt/storage-brno3-cerit/nfs4/home/amizeranschi/bcbio/anaconda/bin/bedtools intersect -abam - -b /scratch/amizeranschi/job_44974.wagap-pro.cerit-sc.cz/allDNA-P10w-down-somatic/work/regions/P10w-comp-10-noanalysis_blocks.bed -f 1.0 -nonamecheck> /scratch/amizeranschi/job_44974.wagap-pro.cerit-sc.cz/allDNA-P10w-down-somatic/work/bcbiotx/tmppZ3gc0/P10-20N-down-10-reorder-fixrgs-gatkfilter-dedup-noanalysis-prep.bam sambamba-view: not enough data in stream ' returned non-zero exit status 1

My input data consists of downsampled BAM files. I created these by running the alignment step of the pipeline (no recalibrating/realigning/variant calling). I then filtered out unwanted reads and downsampled the remaining reads using either samtools or sambamba (I tried both tools, redoing the downsampling and recreating all BAM files altogether; the error still occurs).

I also tried adding bam_clean: picard, but it did not help with the sambamba error (and took a very long time to run...).

The germline variant calling job (described in the previous post) has crashed (I am using PBS Pro in a shared environment), but I still have access to the temporary files in the scratch directory. I tried manually running the command that generated the error, however it seems that the temporary Bcbio directory that contained the culprit file got deleted (possibly after the error occurred?) So, I can not reproduce the error myself, after the Bcbio crash. However, I did try running the pipeline multiple times, as I mentioned, and the sambamba error occurs every time...

set -o pipefail; /mnt/storage-brno3-cerit/nfs4/home/amizeranschi/bcbio/anaconda/bin/sambamba view -f bam -L /scratch/amizeranschi/job_44870.wagap-pro.cerit-sc.cz/allDNA-P10w-down-germline/work/regions/P10w-comp-noanalysis_blocks.bed /scratch/amizeranschi/job_44870.wagap-pro.cerit-sc.cz/allDNA-P10w-down-germline/work/bamclean/P10w_1-down-10/P10w_1-down-10-reorder-fixrgs-gatkfilter-dedup.bam | /mnt/storage-brno3-cerit/nfs4/home/amizeranschi/bcbio/anaconda/bin/bedtools intersect -abam - -b /scratch/amizeranschi/job_44870.wagap-pro.cerit-sc.cz/allDNA-P10w-down-germline/work/regions/P10w-comp-noanalysis_blocks.bed -f 1.0 -nonamecheck> /scratch/amizeranschi/job_44870.wagap-pro.cerit-sc.cz/allDNA-P10w-down-germline/work/bcbiotx/tmpURqEqi/P10w_1-down-10-reorder-fixrgs-gatkfilter-dedup-noanalysis-prep.bam -bash: /scratch/amizeranschi/job_44870.wagap-pro.cerit-sc.cz/allDNA-P10w-down-germline/work/bcbiotx/tmpURqEqi/P10w_1-down-10-reorder-fixrgs-gatkfilter-dedup-noanalysis-prep.bam: No such file or directory

ls -la /scratch/amizeranschi/job_44870.wagap-pro.cerit-sc.cz/allDNA-P10w-down-germline/work/bcbiotx/tmpURqEqi ls: cannot access /scratch/amizeranschi/job_44870.wagap-pro.cerit-sc.cz/allDNA-P10w-down-germline/work/bcbiotx/tmpURqEqi: No such file or directory

chapmanb commented 7 years ago

Thanks for the detailed report and apologies about the issue. I'm not sure what exactly is going on here but happy to try and help debug. On a practical note, do you need recalibration and realignment. They're generally not needed and add in a lot of extra processing time and file manipulation. If not, removing those from your configuration will skip this step and avoid the problem. The implementation here is not perfect as we're trying to parallelize and we're hoping to redo it with the new GATK4 where less parallelism may work faster.

If you do need it, a couple of thoughts on debugging:

If you can provide any additional details from that happy to try to think of fixes or workarounds for the problem. Thanks again for the help debugging.

amizeranschi commented 7 years ago

Hi Brad,

Thank you very much for your answer. I've managed to reproduce the error by running the altered command that you recommended:

/mnt/storage-brno3-cerit/nfs4/home/amizeranschi/bcbio/anaconda/bin/sambamba view -f bam -L /scratch/amizeranschi/job_44870.wagap-pro.cerit-sc.cz/allDNA-P10w-down-germline/work/regions/P10w-comp-noanalysis_blocks.bed /scratch/amizeranschi/job_44870.wagap-pro.cerit-sc.cz/allDNA-P10w-down-germline/work/bamclean/P10w_1-down-10/P10w_1-down-10-reorder-fixrgs-gatkfilter-dedup.bam | /mnt/storage-brno3-cerit/nfs4/home/amizeranschi/bcbio/anaconda/bin/bedtools intersect -abam - -b /scratch/amizeranschi/job_44870.wagap-pro.cerit-sc.cz/allDNA-P10w-down-germline/work/regions/P10w-comp-noanalysis_blocks.bed -f 1.0 -nonamecheck> P10w_1-down-10-reorder-fixrgs-gatkfilter-dedup-noanalysis-prep.bam sambamba-view: not enough data in stream

However, I am not sure what exactly you meant by the following:

The best way to debug might be to subset this and see if you can reproduce the issue with a few regions to determine what block of the BAM file is causing issues.

Could you please shed more light on this?

This is the content of the BED file:

cat /scratch/amizeranschi/job_44870.wagap-pro.cerit-sc.cz/allDNA-P10w-down-germline/work/regions/P10w-comp-noanalysis_blocks.bed chrI 162230 163217 chrII 264465 264859 chrII 469725 474216 chrII 810924 812585 chrIII 85947 86241 chrIII 148618 151520 chrIV 517242 517757 chrIV 647358 650790 chrIV 988154 991767 chrIV 1096386 1101056 chrIV 1207301 1208525 chrIV 1526117 1531933 chrV 115949 117044 chrV 444037 448664 chrV 572359 576874 chrVI 138504 143247 chrVII 536442 537742 chrVII 812096 816759 chrVII 1084939 1090940 chrVIII 87685 88548 chrVIII 212972 213546 chrVIII 535338 535952 chrX 473144 474286 chrX 732155 733121 chrXII 215701 216282 chrXII 452088 454152 chrXII 593823 595126 chrXII 979254 980075 chrXII 1068809 1070946 chrXIII 185314 189478 chrXIII 373306 373995 chrXIV 99848 100173 chrXIV 565435 565763 chrXV 119517 122576 chrXV 595527 596729 chrXV 707432 708189 chrXV 1080171 1081083 chrXVI 437502 438845 chrXVI 805257 805981 chrXVI 944915 945525 chrM 0 85779

chapmanb commented 7 years ago

Thanks for following up and for the additional details. It's great you can reproduce outside of bcbio. What I meant about the BED file was that you can subset that to try and identify the genomic region causing the issue to see if there is anything we can guess from that. So if you split the BED file in half and run the first half and second, which does it fail on? Then keep subsetting and splitting until you've hopefully identified a single region it fails on and we can hopefully have a better idea of how to reproduce and what is going on. Sorry to not have a better idea but I don't have a good way to reproduce and sambamba is pretty unique in it's ability to subset a file based on a BED file using indices so I'm not sure of a good solution to sub in. Hope this helps.

amizeranschi commented 7 years ago

Hello, I switched off the realignment and recalibration steps and the pipeline is running now. Thank you very much for your help, I will close this topic.