bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
994 stars 354 forks source link

missing index #809

Closed dwaggott closed 9 years ago

dwaggott commented 9 years ago

Latest code. Joint gatk pipeline on n=4 whole genomes using sge cluster. Scanning the qacct it looks like memory was fine.

[2015-03-31T09:27Z] scg3-1-2.local: 
[2015-03-31T09:27Z] scg3-1-3.local: Warning: bcbio no longer support explicit setting of mark_duplicate algorithm. Using best-practice choice based on input data.
[2015-03-31T09:27Z] scg3-1-1.local: samblaster: Version 0.1.21
[2015-03-31T09:27Z] scg3-1-1.local: samblaster: Inputting from stdin
[2015-03-31T09:27Z] scg3-1-1.local: samblaster: Outputting to stdout
[2015-03-31T09:27Z] scg3-1-1.local: samblaster: Opening /dev/fd/62 for write.
[2015-03-31T09:27Z] scg3-1-1.local: samblaster: Opening /dev/fd/63 for write.
[2015-03-31T09:27Z] scg3-1-3.local: bwa mem alignment from fastq: LP6005693-DNA_D03
[2015-03-31T09:27Z] scg3-1-1.local: [E::bwa_idx_load_from_disk] fail to locate the index files
[2015-03-31T09:27Z] scg3-1-1.local: samblaster: Missing header on input sam file.  Exiting.
[2015-03-31T09:27Z] scg3-1-1.local: 
[2015-03-31T09:27Z] scg3-1-1.local: Uncaught exception occurred
Traceback (most recent call last):
  File "/home/dwaggott/ashley1/apps/bcbio/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 21, in run
    _do_run(cmd, checks, log_stdout)
  File "/home/dwaggott/ashley1/apps/bcbio/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 95, in _do_run
    raise subprocess.CalledProcessError(exitcode, error_msg)
CalledProcessError: Command 'set -o pipefail; /home/dwaggott/ashley1/apps/bin/bwa mem -M -t 12  -R '@RG\tID:1\tPL:illumina\tPU:1_2015-03-28_tk_gatk_joint\tSM:LP6005535-DNA_E01_1' -v 1 /srv/gs1/projects/ashley/apps/bcbio/genomes/Hsapiens/GRCh37/seq/GRCh37 /srv/gs1/projects/ashley/tk/results/bcbio/tk_gatk_joint/work/align_prep/LP6005535-DNA_E01_1-1.fq.gz /srv/gs1/projects/ashley/tk/results/bcbio/tk_gatk_joint/work/align_prep/LP6005535-DNA_E01_1-2.fq.gz | /home/dwaggott/ashley1/apps/bin/samblaster --splitterFile >(/home/dwaggott/ashley1/apps/bin/samtools sort -@ 12 -m 682M -T /home/dwaggott/scratch/bcbiotx/49ea9a18-518b-4cbc-9069-f1e687ee717b/tmp6Ntn0I/1_2015-03-28_tk_gatk_joint-sort-sorttmp-spl -o /home/dwaggott/scratch/bcbiotx/7fa65998-7721-4b5f-b0e6-14b0e2cfa4d5/tmpVFp1oM/1_2015-03-28_tk_gatk_joint-sort-sr.bam /dev/stdin) --discordantFile >(/home/dwaggott/ashley1/apps/bin/samtools sort -@ 12 -m 682M -T /home/dwaggott/scratch/bcbiotx/49ea9a18-518b-4cbc-9069-f1e687ee717b/tmp6Ntn0I/1_2015-03-28_tk_gatk_joint-sort-sorttmp-disc -o /home/dwaggott/scratch/bcbiotx/b5614880-dbab-4de7-8fa2-0c1ba447f66f/tmpRID0TI/1_2015-03-28_tk_gatk_joint-sort-disc.bam /dev/stdin) | samtools view -b -S -u - | sambamba sort -t 12 -m 682M --tmpdir /home/dwaggott/scratch/bcbiotx/49ea9a18-518b-4cbc-9069-f1e687ee717b/tmp6Ntn0I/1_2015-03-28_tk_gatk_joint-sort-sorttmp-full -o /home/dwaggott/scratch/bcbiotx/49ea9a18-518b-4cbc-9069-f1e687ee717b/tmp6Ntn0I/1_2015-03-28_tk_gatk_joint-sort.bam /dev/stdin
samblaster: Version 0.1.21
samblaster: Inputting from stdin
samblaster: Outputting to stdout
samblaster: Opening /dev/fd/62 for write.
samblaster: Opening /dev/fd/63 for write.
[E::bwa_idx_load_from_disk] fail to locate the index files
samblaster: Missing header on input sam file.  Exiting.

' returned non-zero exit status 1
[2015-03-31T09:27Z] scg3-1-3.local: samblaster: Version 0.1.21
[2015-03-31T09:27Z] scg3-1-3.local: samblaster: Inputting from stdin
[2015-03-31T09:27Z] scg3-1-3.local: samblaster: Outputting to stdout
[2015-03-31T09:27Z] scg3-1-3.local: samblaster: Opening /dev/fd/62 for write.
[2015-03-31T09:27Z] scg3-1-3.local: samblaster: Opening /dev/fd/63 for write.
chapmanb commented 9 years ago

Daryl; Sorry about the issue. bwa is complaining because it can't find the pre-created bwa indices for the genome. The command line does look strange, as it points to the seq directory in the bwa mem call:

/srv/gs1/projects/ashley/apps/bcbio/genomes/Hsapiens/GRCh37/seq/GRCh37

where it should be pointing to:

/srv/gs1/projects/ashley/apps/bcbio/genomes/Hsapiens/GRCh37/bwa/GRCh37.fa

Is it possible something changed with the installed galaxy *.loc files in /srv/gs1/projects/ashley/apps/bcbio/galaxy/tool-data/bwa_index.loc so they point to the wrong location?

dwaggott commented 9 years ago

I did end up running --genomes GRCh37 a second time after I got a complaint of missing files. I assumed they failed to properly install. If I accidently used bcbio_nextgen_install.py instead of bcbio_nextgen.py upgrade would it overwrite the loc file directory and ultimately ruin the install?

I only see loc files for sam, picard and gatk.

Upgrading a few tools including bwa to see if it sorts out.

chapmanb commented 9 years ago

Daryl; Running bcbio_nextgen_install.py against an existing installation should work, although I'm confused as to what is going on. Do you have a bwa directory of indices in /srv/gs1/projects/ashley/apps/bcbio/genomes/Hsapiens/GRCh37/bwa/? Is it possible you didn't add --aligners bwa to the install/upgrade command line? I'm not sure why you wouldn't have a bwa-based .loc file beyond that. Hope a re-run of the data install adding bwa fixes it.

dwaggott commented 9 years ago

Upgrading adding the aligners fixed it.

On Thu, Apr 2, 2015 at 7:26 AM, Brad Chapman notifications@github.com wrote:

Daryl; Running bcbio_nextgen_install.py against an existing installation should work, although I'm confused as to what is going on. Do you have a bwa directory of indices in /srv/gs1/projects/ashley/apps/bcbio/genomes/Hsapiens/GRCh37/bwa/? Is it possible you didn't add --aligners bwa to the install/upgrade command line? I'm not sure why you wouldn't have a bwa-based .loc file beyond that. Hope a re-run of the data install adding bwa fixes it.

— Reply to this email directly or view it on GitHub https://github.com/chapmanb/bcbio-nextgen/issues/809#issuecomment-88928106 .

dwaggott commented 9 years ago

Wait, it was running but looks like a similar error.

I couldn't find the referenced bwa index. The upgrade didn't report a pure *.fa being unpacked..

$ ll /srv/gs1/projects/ashley/apps/bcbio/genomes/Hsapiens/GRCh37/bwa/
total 5.1G
-rw-rw-r-- 1 dwaggott scgpm-informatics_ashley 2.9G Mar 20  2013 GRCh37.fa.bwt
-rw-rw-r-- 1 dwaggott scgpm-informatics_ashley 740M Mar 20  2013 GRCh37.fa.pac
-rw-rw-r-- 1 dwaggott scgpm-informatics_ashley 6.7K Mar 20  2013 GRCh37.fa.ann
-rw-rw-r-- 1 dwaggott scgpm-informatics_ashley 6.5K Mar 20  2013 GRCh37.fa.amb
-rw-rw-r-- 1 dwaggott scgpm-informatics_ashley 1.5G Mar 20  2013 GRCh37.fa.sa
[2015-04-02T14:31Z] scg1-3-5.local: Uncaught exception occurred
Traceback (most recent call last):
  File "/home/dwaggott/ashley1/apps/bcbio/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 21, in run
    _do_run(cmd, checks, log_stdout)
  File "/home/dwaggott/ashley1/apps/bcbio/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 95, in _do_run
    raise subprocess.CalledProcessError(exitcode, error_msg)
CalledProcessError: Command 'set -o pipefail; /home/dwaggott/ashley1/apps/bin/bwa mem -M -t 12  -R '@RG\tID:3\tPL:illumina\tPU:3_2015-03-28_tk_gatk_joint\tSM:LP6005692-DNA_C03' -v 1 /srv/gs1/projects/ashley/apps/bcbio/genomes/Hsapiens/GRCh37/bwa/GRCh37.fa /srv/gs1/projects/ashley/tk/results/bcbio/tk_gatk_joint/work/align_prep/LP6005692-DNA_C03-1.fq.gz /srv/gs1/projects/ashley/tk/results/bcbio/tk_gatk_joint/work/align_prep/LP6005692-DNA_C03-2.fq.gz | /home/dwaggott/ashley1/apps/bin/samblaster --splitterFile >(/home/dwaggott/ashley1/apps/bin/samtools sort -@ 12 -m 682M -T /home/dwaggott/scratch/bcbiotx/86be74d6-919f-4450-8aa1-cf0138a67a11/tmpUmyeiQ/3_2015-03-28_tk_gatk_joint-sort-sorttmp-spl -o /home/dwaggott/scratch/bcbiotx/f9481993-8b95-4843-9b32-e8864e97ebac/tmpS3E_DY/3_2015-03-28_tk_gatk_joint-sort-sr.bam /dev/stdin) --discordantFile >(/home/dwaggott/ashley1/apps/bin/samtools sort -@ 12 -m 682M -T /home/dwaggott/scratch/bcbiotx/86be74d6-919f-4450-8aa1-cf0138a67a11/tmpUmyeiQ/3_2015-03-28_tk_gatk_joint-sort-sorttmp-disc -o /home/dwaggott/scratch/bcbiotx/76b5038d-8be9-4630-918c-48ef051a85d1/tmptLQHpM/3_2015-03-28_tk_gatk_joint-sort-disc.bam /dev/stdin) | samtools view -b -S -u - | sambamba sort -t 12 -m 682M --tmpdir /home/dwaggott/scratch/bcbiotx/86be74d6-919f-4450-8aa1-cf0138a67a11/tmpUmyeiQ/3_2015-03-28_tk_gatk_joint-sort-sorttmp-full -o /home/dwaggott/scratch/bcbiotx/86be74d6-919f-4450-8aa1-cf0138a67a11/tmpUmyeiQ/3_2015-03-28_tk_gatk_joint-sort.bam /dev/stdin
[M::mem_pestat] mean and std.dev: (2749.66, 966.65)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 9933)
[M::mem_pestat] skip orientation FF
...
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 7550)
[M::mem_pestat] mean and std.dev: (2863.50, 981.85)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 9334)
[M::mem_pestat] skip orientation FF
[M::mem_pestat] skip orientation RF
[M::mem_pestat] skip orientation RR
samblaster: Can't find first and/or second of pair in sam block of length 1 for id: HS2000-9101_162:8:2114:15015:27353
samblaster: Are you sure the input is sorted by read ids?

' returned non-zero exit status 1
[2015-04-02T14:31Z] scg1-3-5.local: Unexpected error
Traceback (most recent call last):
  File "/home/dwaggott/ashley1/apps/bcbio/anaconda/lib/python2.7/site-packages/bcbio/distributed/ipythontasks.py", line 38, in _setup_logging
    yield config
  File "/home/dwaggott/ashley1/apps/bcbio/anaconda/lib/python2.7/site-packages/bcbio/distributed/ipythontasks.py", line 79, in process_alignment
    return ipython.zip_args(apply(sample.process_alignment, *args))
  File "/home/dwaggott/ashley1/apps/bcbio/anaconda/lib/python2.7/site-packages/bcbio/pipeline/sample.py", line 98, in process_alignment
    data = align_to_sort_bam(fastq1, fastq2, aligner, data)
  File "/home/dwaggott/ashley1/apps/bcbio/anaconda/lib/python2.7/site-packages/bcbio/pipeline/alignment.py", line 64, in align_to_sort_bam
    names, align_dir, data)
  File "/home/dwaggott/ashley1/apps/bcbio/anaconda/lib/python2.7/site-packages/bcbio/pipeline/alignment.py", line 98, in _align_from_fastq
    out = align_fn(fastq1, fastq2, align_ref, names, align_dir, data)
  File "/home/dwaggott/ashley1/apps/bcbio/anaconda/lib/python2.7/site-packages/bcbio/ngsalign/bwa.py", line 110, in align_pipe
    names, rg_info, data)
  File "/home/dwaggott/ashley1/apps/bcbio/anaconda/lib/python2.7/site-packages/bcbio/ngsalign/bwa.py", line 128, in _align_mem
    [do.file_nonempty(tx_out_file), do.file_reasonable_size(tx_out_file, fastq_file)])
  File "/home/dwaggott/ashley1/apps/bcbio/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 21, in run
    _do_run(cmd, checks, log_stdout)
  File "/home/dwaggott/ashley1/apps/bcbio/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 95, in _do_run
    raise subprocess.CalledProcessError(exitcode, error_msg)
CalledProcessError: Command 'set -o pipefail; /home/dwaggott/ashley1/apps/bin/bwa mem -M -t 12  -R '@RG\tID:3\tPL:illumina\tPU:3_2015-03-28_tk_gatk_joint\tSM:LP6005692-DNA_C03' -v 1 /srv/gs1/projects/ashley/apps/bcbio/genomes/Hsapiens/GRCh37/bwa/GRCh37.fa /srv/gs1/projects/ashley/tk/results/bcbio/tk_gatk_joint/work/align_prep/LP6005692-DNA_C03-1.fq.gz /srv/gs1/projects/ashley/tk/results/bcbio/tk_gatk_joint/work/align_prep/LP6005692-DNA_C03-2.fq.gz | /home/dwaggott/ashley1/apps/bin/samblaster --splitterFile >(/home/dwaggott/ashley1/apps/bin/samtools sort -@ 12 -m 682M -T /home/dwaggott/scratch/bcbiotx/86be74d6-919f-4450-8aa1-cf0138a67a11/tmpUmyeiQ/3_2015-03-28_tk_gatk_joint-sort-sorttmp-spl -o /home/dwaggott/scratch/bcbiotx/f9481993-8b95-4843-9b32-e8864e97ebac/tmpS3E_DY/3_2015-03-28_tk_gatk_joint-sort-sr.bam /dev/stdin) --discordantFile >(/home/dwaggott/ashley1/apps/bin/samtools sort -@ 12 -m 682M -T /home/dwaggott/scratch/bcbiotx/86be74d6-919f-4450-8aa1-cf0138a67a11/tmpUmyeiQ/3_2015-03-28_tk_gatk_joint-sort-sorttmp-disc -o /home/dwaggott/scratch/bcbiotx/76b5038d-8be9-4630-918c-48ef051a85d1/tmptLQHpM/3_2015-03-28_tk_gatk_joint-sort-disc.bam /dev/stdin) | samtools view -b -S -u - | sambamba sort -t 12 -m 682M --tmpdir /home/dwaggott/scratch/bcbiotx/86be74d6-919f-4450-8aa1-cf0138a67a11/tmpUmyeiQ/3_2015-03-28_tk_gatk_joint-sort-sorttmp-full -o /home/dwaggott/scratch/bcbiotx/86be74d6-919f-4450-8aa1-cf0138a67a11/tmpUmyeiQ/3_2015-03-28_tk_gatk_joint-sort.bam /dev/stdin
[M::mem_pestat] mean and std.dev: (2749.66, 966.65)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 9933)
[M::mem_pestat] skip orientation FF
[M::mem_pestat] skip orientation RF
chapmanb commented 9 years ago

Daryl; Your bwa directory looks right, there is no *.fa file, it just serves as the base name to find the index files. The command also looks right but it appears to be dying prematurely leading to the error from samblaster. I can't diagnose from this error log but there are probably other errors upstream of this that might be causitive if you want to dig into it more. You could also run in single core mode and post the log as a gist and I might be able to identify the issue. Sorry about the problems but hope this helps.

chapmanb commented 9 years ago

Please re-open if you still run into issues. You should also be able to close your own issues if they are resolved.