bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
989 stars 354 forks source link

net.sf.picard.sam.ReorderSam missing dictionary file for hg19 #336

Closed matanhofree closed 10 years ago

matanhofree commented 10 years ago

It seems picard is missing a dictionary file for certain tasks. It expects to find this file next to the reference. Running the following solves the issue for me. Consider including the dict file with the download or this command if that error is raised.

java -Xms750m -Xmx8500m -jar /external/extra-tools/picard-tools-1.108/CreateSequenceDictionary.jar REFERENCE=/external/bcbio-nextgen/genomes/Hsapiens/hg19/seq/hg19.fa OUTPUT=/external/bcbio-nextgen/genomes/Hsapiens/hg19/seq/hg19.dict
[Thu Mar 06 01:27:18 PST 2014] net.sf.picard.sam.CreateSequenceDictionary REFERENCE=/external/bcbio-nextgen/genomes/Hsapiens/hg19/seq/hg19.fa OUTPUT=/external/bcbio-nextgen/genomes/Hsapiens/hg19/seq/hg19.dict    TRUNCATE_NAMES_AT_WHITESPACE=true NUM_SEQUENCES=2147483647 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false

Trace:

[2014-03-06 01:21] Uncaught exception occurred
Traceback (most recent call last):
  File "/cellar/users/mhofree/projects/cancer_ngs/external/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 23, in run
    _do_run(cmd, checks, log_stdout)
  File "/cellar/users/mhofree/projects/cancer_ngs/external/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 117, in _do_run
    raise subprocess.CalledProcessError(exitcode, error_msg)
CalledProcessError: Command 'java -Xms750m -Xmx2500m -jar /cellar/users/mhofree/projects/cancer_ngs/external/extra-tools/picard-tools-1.108/ReorderSam.jar INPUT=/cellar/users/mhofree/projects/cancer_ngs/results/2014_02_02_ngs_node1/ramdrive/TCGA-CQ-5330/inputData/0a2f92be-8565-40a6-b377-2107b78af047/C495.TCGA-CQ-5330-10A-01D-1683-08.2.bam OUTPUT=/mnt/tmp/TCGA-CQ-5330/work_norealign/bamclean/TCGA-CQ-5330-10A-01D-1683-08/tx/tmpM6Mu1O/C495.TCGA-CQ-5330-10A-01D-1683-08.2-reorder.bam REFERENCE=/cellar/users/mhofree/projects/cancer_ngs/external/bcbio-nextgen/genomes/Hsapiens/hg19/seq/hg19.fa ALLOW_INCOMPLETE_DICT_CONCORDANCE=true TMP_DIR=/mnt/tmp/TCGA-CQ-5330/work_norealign/tmp/tmp4tC8f2 VALIDATION_STRINGENCY=SILENT
[Thu Mar 06 01:21:30 PST 2014] net.sf.picard.sam.ReorderSam INPUT=/cellar/users/mhofree/projects/cancer_ngs/results/2014_02_02_ngs_node1/ramdrive/TCGA-CQ-5330/inputData/0a2f92be-8565-40a6-b377-2107b78af047/C495.TCGA-CQ-5330-10A-01D-1683-08.2.bam OUTPUT=/mnt/tmp/TCGA-CQ-5330/work_norealign/bamclean/TCGA-CQ-5330-10A-01D-1683-08/tx/tmpM6Mu1O/C495.TCGA-CQ-5330-10A-01D-1683-08.2-reorder.bam REFERENCE=/cellar/users/mhofree/projects/cancer_ngs/external/bcbio-nextgen/genomes/Hsapiens/hg19/seq/hg19.fa ALLOW_INCOMPLETE_DICT_CONCORDANCE=true TMP_DIR=[/mnt/tmp/TCGA-CQ-5330/work_norealign/tmp/tmp4tC8f2] VALIDATION_STRINGENCY=SILENT    ALLOW_CONTIG_LENGTH_DISCORDANCE=false VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
[Thu Mar 06 01:21:30 PST 2014] Executing as mhofree@node1 on Linux 3.8.0-35-generic amd64; OpenJDK 64-Bit Server VM 1.7.0_51-b00; Picard version: 1.108(1695) IntelDeflater
WARNING: BAM index file /cellar/users/mhofree/projects/cancer_ngs/results/2014_02_02_ngs_node1/ramdrive/TCGA-CQ-5330/inputData/0a2f92be-8565-40a6-b377-2107b78af047/C495.TCGA-CQ-5330-10A-01D-1683-08.2.bam.bai is older than BAM /cellar/users/mhofree/projects/cancer_ngs/results/2014_02_02_ngs_node1/ramdrive/TCGA-CQ-5330/inputData/0a2f92be-8565-40a6-b377-2107b78af047/C495.TCGA-CQ-5330-10A-01D-1683-08.2.bam
ERROR   2014-03-06 01:21:30     ReorderSam      No reference sequence dictionary found. Aborting.  You can create a sequence dictionary for the reference fasta using CreateSequenceDictionary.jar.
[Thu Mar 06 01:21:30 PST 2014] net.sf.picard.sam.ReorderSam done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=753926144
To get help, see http://picard.sourceforge.net/index.shtml#GettingHelp
' returned non-zero exit status 1
An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line statement', (34, 0))

Traceback (most recent call last):
  File "/cellar/users/mhofree/projects/cancer_ngs/external/ngs-tools/bin/bcbio_nextgen.py", line 59, in <module>
    main(**kwargs)
  File "/cellar/users/mhofree/projects/cancer_ngs/external/ngs-tools/bin/bcbio_nextgen.py", line 39, in main
    run_main(**kwargs)
  File "/cellar/users/mhofree/projects/cancer_ngs/external/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 41, in run_main
    fc_dir, run_info_yaml)
  File "/cellar/users/mhofree/projects/cancer_ngs/external/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 88, in _run_toplevel
    for xs in pipeline.run(config, config_file, parallel, dirs, pipeline_items):
  File "/cellar/users/mhofree/projects/cancer_ngs/external/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 304, in run
    samples = run_parallel("process_alignment", samples)
  File "/cellar/users/mhofree/projects/cancer_ngs/external/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/distributed/multi.py", line 28, in run_parallel
chapmanb commented 10 years ago

Matan; Thanks much for the report. I'm confused about this because the download does include hg19.dict. Did you install with the bcbio installer or another approach? Is it possible the you deleted the hg19.dict file at some point? Son reported this missing earlier (#297) but I couldn't replicate, so I'm totally confused how the downloaded dictionary file is disappearing.

Independent of that, I pushed a fix to ensure the index is present before calling this while should resolve the issue going forward. It calls CreateSequenceDictionary exactly as you did manually. Thanks again and hope this fixes things going forward.

matanhofree commented 10 years ago

Seems like my fix only solved the immediate problem. Re-running seems breaks with the following error:

ValueError: No database found in /external/ngs-tools/share/java/snpeff-3_4/data for hg19

I suspect this might be an installer issue. I actually tried to add hg19 just today using the following command: bcbio_nextgen.py upgrade -u stable --genomes hg19 --data Should that have installed the snpeff and hg19.dict?

chapmanb commented 10 years ago

Matan; My apologies, the snpEff issue is a bug in the latest release. The automated installer is not correctly identifying the databases that need to be downloaded. If you upgrade to development and re-get the data:

bcbio_nextgen.py upgrade -u development --genomes hg19 --data

It should get the snpEff databases. Thanks for the report and sorry again about the problem.