bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
994 stars 353 forks source link

RNA-seq default pipeline yields errors #283

Closed kspham closed 10 years ago

kspham commented 10 years ago

ValueError: Cannot detect which reference version /usr/local/share/bcbio-nextgen/genomes/Hsapiens/GRCh37/bowtie2/GRCh37 is. Should end in either .ebwt (bowtie) or .bt2 (bowtie2).

It's clear that the default installation forgot to download the bowtie indices. S.

lpantano commented 10 years ago

yep, happened the same to me.

kspham commented 10 years ago

And how did you resume it? From tophat2 and skipping the trimming steps that has already been done?

roryk commented 10 years ago

Hi Son and Lorena,

Sorry for the trouble-- in the default minimal installation we weren't installing bowtie2. I added that to the installer here: 28417f2. Son, if you do:

bcbio_nextgen.py upgrade --aligners bowtie2

it should install the bowtie2 indices for GRCh37.

The pipeline will automatically pick up where it left off, so you should be all good after installing the missing bowtie2 index.

Thanks for the report!

kspham commented 10 years ago

Very strange $bcbio_nextgen.py upgrade --aligners bowtie2 usage: bcbio_nextgen.py upgrade [-h] [--tooldir TOOLDIR] [--tools] [-u {stable,development,system,skip}] [--toolplus {protected,data}] [--genomes {GRCh37,hg19,mm10,mm9,rn5,canFam3}] [--aligners {bowtie,bowtie2,bwa,novoalign,star,ucsc}] [--data] [--nosudo] [--isolate] [--tooldist {minimal,full}] [--distribution {ubuntu,debian,centos,scientificlinux,macosx}]

optional arguments: -h, --help show this help message and exit --tooldir TOOLDIR Directory to install 3rd party software tools. Leave unspecified for no tools --tools Boolean argument specifying upgrade of tools. Uses previously saved install directory -u {stable,development,system,skip}, --upgrade {stable,development,system,skip} Code version to upgrade --toolplus {protected,data} Specify additional tool categories to install --genomes {GRCh37,hg19,mm10,mm9,rn5,canFam3} Genomes to download --aligners {bowtie,bowtie2,bwa,novoalign,star,ucsc} Aligner indexes to download --data Upgrade data dependencies --nosudo Specify we cannot use sudo for commands --isolate Created an isolated installation without PATH updates --tooldist {minimal,full} Type of tool distribution to install. Defaults to a minimum install. --distribution {ubuntu,debian,centos,scientificlinux,macosx} Operating system distribution

On Fri, Jan 31, 2014 at 11:33 AM, Rory Kirchner notifications@github.comwrote:

Hi Son and Lorena,

Sorry for the trouble-- in the default minimal installation we weren't installing bowtie2. I added that to the installer here: 28417f2https://github.com/chapmanb/bcbio-nextgen/commit/28417f2. Son, if you do:

bcbio_nextgen.py upgrade --aligners bowtie2

it should install the bowtie2 indices for GRCh37.

The pipeline will automatically pick up where it left off, so you should be all good after installing the missing bowtie2 index.

Thanks for the report!

Reply to this email directly or view it on GitHubhttps://github.com/chapmanb/bcbio-nextgen/issues/283#issuecomment-33834460 .

roryk commented 10 years ago

Oops. How about:

bcbio_nextgen.py upgrade --aligners bowtie2 --genomes GRCh37 --data
kspham commented 10 years ago

It works. But the pipeline doesn't automatically pick up where it left off! It's unzipping the read files again

On Fri, Jan 31, 2014 at 12:07 PM, Rory Kirchner notifications@github.comwrote:

Oops. How about:

bcbio_nextgen.py upgrade --aligners bowtie2 --genomes GRCh37 --data

Reply to this email directly or view it on GitHubhttps://github.com/chapmanb/bcbio-nextgen/issues/283#issuecomment-33837581 .

roryk commented 10 years ago

Hi Son,

@porterjamesj just fixed that issue here: https://github.com/chapmanb/bcbio-nextgen/pull/270. If you upgrade to the development version you should pull in his patch:

bcbio_nextgen.py upgrade -u development
kspham commented 10 years ago

The upgrade helps somehow (bypass the unzipping stage) but it seems that it doesn't use the trimmed reads but uses the original fastq files :(

On Fri, Jan 31, 2014 at 1:59 PM, Rory Kirchner notifications@github.comwrote:

Hi Son,

@porterjamesj https://github.com/porterjamesj just fixed that issue here: #270 https://github.com/chapmanb/bcbio-nextgen/pull/270. If you upgrade to the development version you should pull in his patch:

bcbio_nextgen.py upgrade -u development

Reply to this email directly or view it on GitHubhttps://github.com/chapmanb/bcbio-nextgen/issues/283#issuecomment-33846780 .

roryk commented 10 years ago

Hi Son,

I'm really sorry to make you run around like this. It looks like the trimming got accidentally dropped from the RNA-seq pipeline in the development version; we've been retooling some of the infrastructure and I missed this one. I restored it here: a1ed4064b2d46d1e. I reopened the issue, could you let me know if it ends up running okay? If you run the upgrade again you will get the fix.

kspham commented 10 years ago

Works -- Please close!

On Fri, Jan 31, 2014 at 2:52 PM, Rory Kirchner notifications@github.comwrote:

Hi Son,

I'm really sorry to make you run around like this. It looks like the trimming got accidentally dropped from the RNA-seq pipeline in the development version; we've been retooling some of the infrastructure and I missed this one. I restored it here: a1ed406https://github.com/chapmanb/bcbio-nextgen/commit/a1ed4064b2d46d1e. I reopened the issue, could you let me know if it ends up running okay? If you run the upgrade again you will get the fix.

Reply to this email directly or view it on GitHubhttps://github.com/chapmanb/bcbio-nextgen/issues/283#issuecomment-33850881 .

aminmg commented 10 years ago

Hi All, I have followed the instructions above, but still get the error message:

$ /data/aminData/bcbio-nextgen/anaconda/bin/python bcbio_nextgen.py /data/aminData/bcbio-nextgen/galaxy/bcbio_system.yaml data/aminData/example/rnaseqExample/config/smallRNA.yaml [2014-03-13 11:19] Using input YAML configuration: /data/aminData/example/rnaseqExample/config/smallRNA.yaml [2014-03-13 11:19] Checking sample YAML configuration: /data/aminData/example/rnaseqExample/config/smallRNA.yaml [2014-03-13 11:19] Testing minimum versions of installed programs [2014-03-13 11:19] Resource requests: picard; memory: 2.5; cores: 1 [2014-03-13 11:19] Configuring 1 jobs to run, using 1 cores each with 2.8g of memory reserved for each job [2014-03-13 11:19] run local -- checkpoint passed: trimming [2014-03-13 11:19] multiprocessing: process_lane [2014-03-13 11:19] Preparing 1_070113_control_experiment_small_COLO [2014-03-13 11:19] multiprocessing: trim_lane [2014-03-13 11:19] Trimming low quality ends and read through adapter sequence from /data/aminData/example/rnaseqExample/input/small_COLO_R1.fastq, /data/aminData/example/rnaseqExample/input/small_COLO_R2.fastq. [2014-03-13 11:19] Resource requests: tophat2; memory: 1.0; cores: 16 [2014-03-13 11:19] Configuring 1 jobs to run, using 1 cores each with 1.2g of memory reserved for each job [2014-03-13 11:19] multiprocessing: process_alignment [2014-03-13 11:19] Aligning lane 1_070113_control_experiment_small_COLO with tophat2 aligner Traceback (most recent call last): File "/data/aminData/bcbio-nextgen/anaconda/bin/bcbio_nextgen.py", line 59, in main(kwargs) File "/data/aminData/bcbio-nextgen/anaconda/bin/bcbio_nextgen.py", line 39, in main run_main(kwargs) File "/data/aminData/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 40, in run_main fc_dir, run_info_yaml) File "/data/aminData/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 87, in _run_toplevel for xs in pipeline.run(config, config_file, parallel, dirs, pipeline_items): File "/data/aminData/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 396, in run samples = run_parallel("process_alignment", samples) File "/data/aminData/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/distributed/multi.py", line 28, in run_parallel return run_multicore(fn, items, config, parallel=parallel) File "/data/aminData/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/distributed/multi.py", line 82, in run_multicore for data in joblib.Parallel(parallel["num_jobs"])(joblib.delayed(fn)(x) for x in items): File "/data/aminData/bcbio-nextgen/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 644, in call self.dispatch(function, args, kwargs) File "/data/aminData/bcbio-nextgen/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 391, in dispatch job = ImmediateApply(func, args, kwargs) File "/data/aminData/bcbio-nextgen/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 129, in init self.results = func(_args, _kwargs) File "/data/aminData/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/utils.py", line 47, in wrapper return apply(f, _args, _kwargs) File "/data/aminData/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/distributed/multitasks.py", line 21, in process_alignment return lane.process_alignment(*args) File "/data/aminData/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/pipeline/lane.py", line 105, in process_alignment data = align_to_sort_bam(fastq1, fastq2, aligner, data) File "/data/aminData/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/pipeline/alignment.py", line 64, in align_to_sort_bam names, align_dir, data) File "/data/aminData/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/pipeline/alignment.py", line 96, in _align_from_fastq out = align_fn(fastq1, fastq2, align_ref, names, align_dir, data) File "/data/aminData/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/ngsalign/tophat.py", line 258, in align align_dir, data, names) File "/data/aminData/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/ngsalign/tophat.py", line 114, in tophat_align if _ref_version(ref_file) == 1 or options.get("fusion-search", False): File "/data/aminData/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/ngsalign/tophat.py", line 368, in _ref_version "(bowtie2)." % (ref_file)) ValueError: Cannot detect which reference version /data/aminData/bcbio-nextgen/genomes/Hsapiens/GRCh37/bowtie2 is. Should end in either .ebwt (bowtie) or .bt2 (bowtie2).

roryk commented 10 years ago

Hi Amin,

Sorry for the trouble. Hmmm. The value of ref_file there should look like this (with the trailing GRCh37):

/data/aminData/bcbio-nextgen/genomes/Hsapiens/GRCh37/bowtie2/GRCh37

In your /data/aminData/bcbio-nextgen/galaxy/tool-data/bowtie2_indicies.loc is the entry for GRCh37 missing the trailing GRCh37?

aminmg commented 10 years ago

Hi Roy, Thanks a million for your reply. For some reason my bowtie2_indicies.loc had two entries:

$ more bowtie2_indices.loc GRCh37 GRCh37 Human (GRCh37) /data/aminData/bcbio-nextgen/genomes/Hsapiens/GRCh37/bowtie2 GRCh37 GRCh37 Human (GRCh37) /data/aminData/bcbio-nextgen/genomes/Hsapiens/GRCh37/bowtie2/GRCh37

Probabely the first one bu automated installer, the second one when I upgrade? Removing the first one solved the problem. Thankx -A

roryk commented 10 years ago

Great Amin, glad to hear it. I'm not sure where that second one came from. Hmm.

aminmg commented 10 years ago

Oops: After the fix above, it went quite far but failed:

[2014-03-13 23:03] Traceback (most recent call last): File "/usr/local/bin/bcbio_nextgen.py", line 59, in main(kwargs) File "/usr/local/bin/bcbio_nextgen.py", line 39, in main run_main(kwargs) File "/usr/local/lib/python2.7/dist-packages/bcbio/pipeline/main.py", line 40, in run_main fc_dir, run_info_yaml) File "/usr/local/lib/python2.7/dist-packages/bcbio/pipeline/main.py", line 87, in _run_toplevel for xs in pipeline.run(config, config_file, parallel, dirs, pipeline_items): File "/usr/local/lib/python2.7/dist-packages/bcbio/pipeline/main.py", line 401, in run samples = rnaseq.estimate_expression(samples, run_parallel) File "/usr/local/lib/python2.7/dist-packages/bcbio/pipeline/rnaseq.py", line 9, in estimate_expression samples = run_parallel("generate_transcript_counts", samples) File "/usr/local/lib/python2.7/dist-packages/bcbio/distributed/multi.py", line 28, in run_parallel return run_multicore(fn, items, config, parallel=parallel) File "/usr/local/lib/python2.7/dist-packages/bcbio/distributed/multi.py", line 82, in run_multicore for data in joblib.Parallel(parallel["num_jobs"])(joblib.delayed(fn)(x) for x in items): File "/usr/local/lib/python2.7/dist-packages/joblib/parallel.py", line 644, in call self.dispatch(function, args, kwargs) File "/usr/local/lib/python2.7/dist-packages/joblib/parallel.py", line 391, in dispatch job = ImmediateApply(func, args, kwargs) File "/usr/local/lib/python2.7/dist-packages/joblib/parallel.py", line 129, in init self.results = func(_args, _kwargs) File "/usr/local/lib/python2.7/dist-packages/bcbio/utils.py", line 47, in wrapper return apply(f, _args, _kwargs) File "/usr/local/lib/python2.7/dist-packages/bcbio/distributed/multitasks.py", line 69, in generate_transcript_counts return rnaseq.generate_transcript_counts(*args) File "/usr/local/lib/python2.7/dist-packages/bcbio/pipeline/rnaseq.py", line 15, in generate_transcript_counts data["count_file"] = featureCounts.count(data) File "/usr/local/lib/python2.7/dist-packages/bcbio/rnaseq/featureCounts.py", line 52, in count fixed_count_file = _format_count_file(count_file) File "/usr/local/lib/python2.7/dist-packages/bcbio/rnaseq/featureCounts.py", line 68, in _format_count_file df = pd.io.parsers.read_table(count_file, sep="\t", index_col=0, header=1) AttributeError: 'NoneType' object has no attribute 'io'

roryk commented 10 years ago

Hi Amin,

Bummer-- it looks like the Python library pandas isn't installed. If you bcbio_nextgen.py upgrade -u development does it resolve the issue?

Was bcbio-nextgen installed with the installer? It looks like the bcbio-nextgen that is getting called is in the systemwide python directory.

Thanks a lot for the report!

aminmg commented 10 years ago

Thanks a lot Roy. It worked. Can you please also point me to how can I reordersam? Now it complains that:

File "/data/aminData/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 117, in _do_run raise subprocess.CalledProcessError(exitcode, error_msg) subprocess.CalledProcessError: Command 'set -o pipefail; java -jar -Xms750m -Xmx20g /data/aminData/tools/share/java/RNA-SeQC/RNA-SeQC_v1.1.7.jar -n 1000 -s /data/aminData/example/rnaseqExample/work/qc/Control_rep1_COLO/tx/tmpH1ln9i/rnaseqc/sample_file.txt -t /data/aminData/bcbio-nextgen/genomes/Hsapiens/GRCh37/rnaseq/ref-transcripts.gtf -r /data/aminData/bcbio-nextgen/genomes/Hsapiens/GRCh37/seq/GRCh37.fa -o /data/aminData/example/rnaseqExample/work/qc/Control_rep1_COLO/tx/tmpH1ln9i/rnaseqc -BWArRNA /data/aminData/bcbio-nextgen/genomes/Hsapiens/GRCh37/rnaseq/rRNA.fa -bwa /data/aminData/tools/bin/bwa -ttype 2 RNA-SeQC v1.1.7 05/14/12 Retriving contig names from reference contig names in reference: 84 Loading GTF for Read Counting Converting to refGene Transcript objects to RefGen format: 2 s Running IntronicExpressionReadBlock Walker .... Arguments: [-T, IntronicExpressionReadBlock, --outfile_metrics, /data/aminData/example/rnaseqExample/work/qc/Control_rep1_COLO/tx/tmpH1ln9i/rnaseqc/Control_rep1_COLO/Control_rep1_COLO.metrics.tmp.txt, -R, /data/aminData/bcbio-nextgen/genomes/Hsapiens/GRCh37/seq/GRCh37.fa, -I, /data/aminData/example/rnaseqExample/work/align/Control_rep1_COLO/1_070113_control_experiment_small_COLO_tophat/1_070113_control_experiment_small_COLO.bam, -refseq, /data/aminData/example/rnaseqExample/work/qc/Control_rep1_COLO/tx/tmpH1ln9i/rnaseqc/refGene.txt, -l, ERROR] org.broadinstitute.sting.utils.exceptions.UserException$LexicographicallySortedSequenceDictionary: Lexicographically sorted human genome sequence detected in reads. For safety's sake the GATK requires human contigs in karyotypic order: 1, 2, ..., 10, 11, ..., 20, 21, 22, X, Y with M either leading or trailing these contigs. This is because all distributed GATK resources are sorted in karyotypic order, and your processing will fail when you need to use these files. You can use the ReorderSam utility to fix this problem: http://www.broadinstitute.org/gsa/wiki/index.php/ReorderSam reads contigs = [1, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 2, 20, 21, 22, 3, 4, 5, 6, 7, 8, 9, GL000191.1, GL000192.1, GL000193.1, GL000194.1, GL000195.1, GL000196.1, GL000197.1, GL000198.1, GL000199.1, GL000200.1, GL000201.1, GL000202.1, GL000203.1, GL000204.1, GL000205.1, GL000206.1, GL000207.1, GL000208.1, GL000209.1, GL000210.1, GL000211.1, GL000212.1, GL000213.1, GL000214.1, GL000215.1, GL000216.1, GL000217.1, GL000218.1, GL000219.1, GL000220.1, GL000221.1, GL000222.1, GL000223.1, GL000224.1, GL000225.1, GL000226.1, GL000227.1, GL000228.1, GL000229.1, GL000230.1, GL000231.1, GL000232.1, GL000233.1, GL000234.1, GL000235.1, GL000236.1, GL000237.1, GL000238.1, GL000239.1, GL000240.1, GL000241.1, GL000242.1, GL000243.1, GL000244.1, GL000245.1, GL000246.1, GL000247.1, GL000248.1, GL000249.1, MT, X, Y] at org.broadinstitute.sting.utils.SequenceDictionaryUtils.validateDictionaries(SequenceDictionaryUtils.java:128) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.validateSourcesAgainstReference(GenomeAnalysisEngine.java:730) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.getReferenceOrderedDataSources(GenomeAnalysisEngine.java:809) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.initializeDataSources(GenomeAnalysisEngine.java:672) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:227) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:236) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:146) at org.broadinstitute.cga.rnaseq.gatk.GATKTools.runIntronReadCount(GATKTools.java:226) at org.broadinstitute.cga.rnaseq.ReadCountMetrics.runRegionCounting(ReadCountMetrics.java:243) at org.broadinstitute.cga.rnaseq.ReadCountMetrics.runReadCountMetrics(ReadCountMetrics.java:58) at org.broadinstitute.cga.rnaseq.RNASeqMetrics.runMetrics(RNASeqMetrics.java:220) at org.broadinstitute.cga.rnaseq.RNASeqMetrics.execute(RNASeqMetrics.java:166) at org.broadinstitute.cga.rnaseq.RNASeqMetrics.main(RNASeqMetrics.java:135) RNA-SeQC Total Runtime: 0 min ' returned non-zero exit status 3

roryk commented 10 years ago

Hi Amin,

Sorry for all of the issues. Did you use Tophat to align the samples? I fixed that bug here: 18dea9adb0850cb, so if you update your bcbio_nextgen installation to the development version it should fix that if you used Tophat.

bcbio_nextgen.py upgrade -u development
aminmg commented 10 years ago

Hi Roy, Thanks a lot. It works.

roryk commented 10 years ago

Great, thanks for following up! Let us know if you run into any more issues.