bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
986 stars 353 forks source link

argument change in GATK #460

Closed kspham closed 10 years ago

kspham commented 10 years ago

Hi Brad, I ran gatk-nextgen for on paired-end illumina data and get this error!

ERROR MESSAGE: SAM/BAM file SAMFileReader{/media/proj1/mutech/2s2ns/work/bamclean/SID38973/2ns-reorder-fixrgs.bam} is malformed: the BAM file has a read with no stored bases (i.e. it uses '*') which is not supported in the GATK; see the --filter_bases_not_stored argument. Offender: @;>
ERROR ------------------------------------------------------------------------------------------

GATK said that this error can be resolved by setting --filter_bases_not_stored, but is it feasible with bcbio-nextgen? http://gatkforums.broadinstitute.org/discussion/3597/did-anyone-solved-the-incompatibility-of-bam-files-by-tmap-with-gatk

Is it common for illumina platforms to ouput reads with no stored bases?

Many thanks! Son.

chapmanb commented 10 years ago

Son; Thanks for the report. I don't have much experience with IonTorrent data and mappers but added in these flags for filtering pre-aligned input BAMs. If you update to the latest development and remove your bamclean directory, you should be able to re-run the problem samples with this filter applied. Hope this helps.

kspham commented 10 years ago

Thank you, Brad, I also feel strange because it's Illumina samples, not IonTorrent data.

On Wed, Jun 18, 2014 at 12:42 PM, Brad Chapman notifications@github.com wrote:

Son; Thanks for the report. I don't have much experience with IonTorrent data and mappers but added in these flags for filtering pre-aligned input BAMs. If you update to the latest development and remove your bamclean directory, you should be able to re-run the problem samples with this filter applied. Hope this helps.

— Reply to this email directly or view it on GitHub https://github.com/chapmanb/bcbio-nextgen/issues/460#issuecomment-46483978 .

chapmanb commented 10 years ago

Son -- you might want to look at your alignment file to be sure it was prepared correctly or just re-run the alignment as part of the bcbio pipeline.

kspham commented 10 years ago

I also suspected so -- but all the bam files terminated correctly (by checking the EOF signature of them). Probably need to merge all the lanes into 2 files (paired-end) and run bcbio-nextgen with bwa again. I always try to avoid this since this step takes a lot of time. As far as I understand bcbio-nextgen still doesn't handles multiple lanes?

Thanks again, Brad! Son.

On Wed, Jun 18, 2014 at 12:49 PM, Brad Chapman notifications@github.com wrote:

Son -- you might want to look at your alignment file to be sure it was prepared correctly or just re-run the alignment as part of the bcbio pipeline.

— Reply to this email directly or view it on GitHub https://github.com/chapmanb/bcbio-nextgen/issues/460#issuecomment-46484878 .

chapmanb commented 10 years ago

That's right, there isn't anything to handle these. The easiest way to merge is to use bedtools bamtofastq to convert them all into fastq then cat them together. Hope this helps.

Lisa44 commented 9 years ago

Hi,

I am struggling with the same problem. I preformed amplicon sequencing using the Illumina MiSeq. I am using the fastq files from the MiSeq, creating sam files using bwa, then converting the sam files to bam files using samtools, and attempting to realign the bam files using GATK. GATK gives me the following error:

ERROR MESSAGE: SAM/BAM file SAMFileReader{Sample1.S1.sorted.bam} is malformed: the BAM file has a read with no stored bases (i.e. it uses '*') which is not supported in the GATK; see the --filter_bases_not_stored argument. Offender: (null)

Did you find out what was the cause of this error?

Regards, Lisa

drmjc commented 9 years ago

for the record, I just had this error, issued by GATK BaseRecalibrator. I tracked it down to using bwa-mem -a, ie "-a: Output all found alignments for single-end or unpaired paired-end reads. These alignments will be flagged as secondary alignments."

This can be resolved by either recreating the BAM, without '-a', or fixing the BAM, by passing the --filter_bases_not_stored to the next GATK tool you need to run, which makes the MalformedReadFilter less sensitive & just discards the bad data

chapmanb commented 9 years ago

Mark; Thanks for following up here with the solution -- great to have this discussion be useful. Much appreciated.

deber1980 commented 7 years ago

Hello,

I'm running into this issue myself. I tracked this down to an empty read. I'm not really sure how can I provide bcbio the relevant flags to overcome this problematic reads.

` GGGGGGGGGGIIIGIIIIIIGIIIIIIIIIIIIIIIIIIIIIIIGGIIIGIIIIIIIGGGGGGGIGIIGGIIIIGGIIGGGIGGIIGIIIGIIGIIGIIIG @D00166:565:HWFZ6BCXX:1:2302:18025:97411 1:N:0:TGGTGGT CTTGTGGGGTGACTGAGGAACCCCAGCGACTCTTTTATGGTGAGTGCTCTCAGCCTCAAGACTCCTCCCATAGAGACTGGGGGAAAAGAGGGGACTTTACC + AAGAGGGAAGGAGGGAAGA..AGGI.AGAGG.GGAAGGIIGGGGGGGGIGIGGGIGGGAGGIGGGG.GG<.GAAGGGGGGIGA.GGGGGGGGGGGGGGGAG @D00166:565:HWFZ6BCXX:1:2302:18159:97423 1:N:0:TGGTGGT

+

@D00166:565:HWFZ6BCXX:1:2302:18080:97449 1:N:0:TGGTGGT ATAAGCGTTAGTTCTTGAAACCAAGGCATTTGGGCAAATATTATACATTTTTATTTTATTAATTTTCCAGAACCCGTTTGAACCATGAAGCCATTTGTGC`

And this is the log [2016-10-17T07:14Z] INFO 07:14:05,256 ProgressMeter - done 4741.0 1.0 s 4.7 m 99.7% 1.0 s 0.0 s [2016-10-17T07:14Z] INFO 07:14:05,256 ProgressMeter - Total runtime 1.34 secs, 0.02 min, 0.00 hours [2016-10-17T07:14Z] INFO 07:14:05,257 MicroScheduler - 2 reads were filtered out during the traversal out of approximately 4743 total reads (0.04%) [2016-10-17T07:14Z] INFO 07:14:05,257 MicroScheduler - -> 0 reads (0.00% of total) failing BadCigarFilter [2016-10-17T07:14Z] INFO 07:14:05,257 MicroScheduler - -> 0 reads (0.00% of total) failing MalformedReadFilter [2016-10-17T07:14Z] INFO 07:14:05,258 MicroScheduler - -> 2 reads (0.04% of total) failing NotPrimaryAlignmentFilter [2016-10-17T07:14Z] INFO 07:14:05,846 GATKRunReport - Uploaded run statistics report to AWS S3 [2016-10-17T07:14Z] Index BAM file: 1_2016-10-16_germline_pipeline-sort-chrUn_gl000215_0_172545-prep.bam [2016-10-17T07:14Z] GATK: realign ('chrUn_gl000216', 0, 172294) : 224433 [2016-10-17T07:14Z] INFO 07:14:06,285 GATKRunReport - Uploaded run statistics report to AWS S3 [2016-10-17T07:14Z] ##### ERROR ------------------------------------------------------------------------------------------ [2016-10-17T07:14Z] ##### ERROR A USER ERROR has occurred (version 3.5-0-g36282e4): [2016-10-17T07:14Z] ##### ERROR [2016-10-17T07:14Z] ##### ERROR This means that one or more arguments or inputs in your command are incorrect. [2016-10-17T07:14Z] ##### ERROR The error message below tells you what is the problem. [2016-10-17T07:14Z] ##### ERROR [2016-10-17T07:14Z] ##### ERROR If the problem is an invalid argument, please check the online documentation guide [2016-10-17T07:14Z] ##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool. [2016-10-17T07:14Z] ##### ERROR [2016-10-17T07:14Z] ##### ERROR Visit our website and forum for extensive documentation and answers to [2016-10-17T07:14Z] ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk [2016-10-17T07:14Z] ##### ERROR [2016-10-17T07:14Z] ##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself. [2016-10-17T07:14Z] ##### ERROR [2016-10-17T07:14Z] ##### ERROR MESSAGE: SAM/BAM/CRAM file htsjdk.samtools.SamReader$PrimitiveSamReaderToSamReaderAdapter@4a8fa358 is malformed. Please see http://gatkforums.broadinstitute.org/discussion/1317/collected-faqs-about-input-files-for-sequence-read-data-bam-cramfor more information. Error details: the BAM file has a read with no stored bases (i.e. it uses '*') which is not supported in the GATK; see the --filter_bases_not_stored argument. Offender: D00155:466:HWF7YBCXX:1:2205:18159:97423 [2016-10-17T07:14Z] ##### ERROR ------------------------------------------------------------------------------------------ [2016-10-17T07:14Z] Uncaught exception occurred Traceback (most recent call last): File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 21, in run _do_run(cmd, checks, log_stdout) File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 95, in _do_run raise subprocess.CalledProcessError(exitcode, error_msg) CalledProcessError: Command 'set -o pipefail; java -Xms166m -Xmx583m -XX:+UseSerialGC -Djava.io.tmpdir=/mnt/work/tx/tmpqUEvZa -jar /mnt/work/inputs/jars/gatk/GenomeAnalysisTK.jar -T PrintReads -L chr1:109827209-142541836 -R /mnt/work/inputs/data/genomes/hg19/seq/hg19.fa -I /mnt/work/align/224433/1_2016-10-16_germline_pipeline-sort.bam -BQSR /mnt/work/align/224433/1_2016-10-16_germline_pipeline-sort.grp -U LENIENT_VCF_PROCESSING --read_filter BadCigar --readfilter NotPrimaryAlignment -o /mnt/work/bamprep/224433/chr1/tx/tmp1KuGJ/1_2016-10-16_germline_pipeline-sort-chr1_109827208_142541836-prep.bam INFO 07:12:35,635 HelpFormatter - -------------------------------------------------------------------------------- INFO 07:12:35,637 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.5-0-g36282e4, Compiled 2015/11/25 04:03:56 INFO 07:12:35,637 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 07:12:35,637 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 07:12:35,640 HelpFormatter - Program Args: -T PrintReads -L chr1:109827209-142541836 -R /mnt/work/inputs/data/genomes/hg19/seq/hg19.fa -I /mnt/work/align/224433/1_2016-10-16_germline_pipeline-sort.bam -BQSR /mnt/work/align/224433/1_2016-10-16_germline_pipeline-sort.grp -U LENIENT_VCF_PROCESSING --read_filter BadCigar --readfilter NotPrimaryAlignment -o /mnt/work/bamprep/224433/chr1/tx/tmp1KuGJ/1_2016-10-16_germline_pipeline-sort-chr1_109827208_142541836-prep.bam INFO 07:12:35,645 HelpFormatter - Executing as ubuntu@frontend001 on Linux 3.13.0-98-generic amd64; OpenJDK 64-Bit Server VM 1.7.0_95-b00. INFO 07:12:35,646 HelpFormatter - Date/Time: 2016/10/17 07:12:35 INFO 07:12:35,646 HelpFormatter - -------------------------------------------------------------------------------- INFO 07:12:35,646 HelpFormatter - -------------------------------------------------------------------------------- INFO 07:12:35,816 GenomeAnalysisEngine - Strictness is SILENT INFO 07:12:36,351 ContextCovariate - Context sizes: base substitution model 2, indel substitution model 3 INFO 07:12:36,383 GenomeAnalysisEngine - Downsampling Settings: No downsampling INFO 07:12:36,390 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 07:12:36,420 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.03 INFO 07:12:36,445 IntervalUtils - Processing 32714628 bp from intervals INFO 07:12:36,664 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files INFO 07:12:36,883 GenomeAnalysisEngine - Done preparing for traversal INFO 07:12:36,884 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 07:12:36,885 ProgressMeter - | processed | time | per 1M | | total | remaining INFO 07:12:36,886 ProgressMeter - Location | reads | elapsed | reads | completed | runtime | runtime INFO 07:12:36,901 ReadShardBalancer$1 - Loading BAM index data INFO 07:12:37,149 ReadShardBalancer$1 - Done loading BAM index data INFO 07:13:07,579 ProgressMeter - chr1:110469275 0.0 30.0 s 50.8 w 2.0% 25.5 m 25.0 m INFO 07:13:38,006 ProgressMeter - chr1:111924380 200005.0 61.0 s 5.1 m 6.4% 15.9 m 14.8 m INFO 07:14:06,285 GATKRunReport - Uploaded run statistics report to AWS S3

ERROR ------------------------------------------------------------------------------------------
ERROR A USER ERROR has occurred (version 3.5-0-g36282e4):
ERROR
ERROR This means that one or more arguments or inputs in your command are incorrect.
ERROR The error message below tells you what is the problem.
ERROR
ERROR If the problem is an invalid argument, please check the online documentation guide
ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
ERROR
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
ERROR
ERROR MESSAGE: SAM/BAM/CRAM file htsjdk.samtools.SamReader$PrimitiveSamReaderToSamReaderAdapter@4a8fa358 is malformed. Please see http://gatkforums.broadinstitute.org/discussion/1317/collected-faqs-about-input-files-for-sequence-read-data-bam-cramfor more information. Error details: the BAM file has a read with no stored bases (i.e. it uses '*') which is not supported in the GATK; see the --filter_bases_not_stored argument. Offender: D00155:466:HWF7YBCXX:1:2205:18159:97423
ERROR ------------------------------------------------------------------------------------------

' returned non-zero exit status 1 Traceback (most recent call last): File "/usr/local/bin/bcbio_nextgen.py", line 226, in main(kwargs) File "/usr/local/bin/bcbio_nextgen.py", line 43, in main run_main(kwargs) File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 39, in run_main fc_dir, run_info_yaml) File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 82, in _run_toplevel for xs in pipeline(config, run_info_yaml, parallel, dirs, samples): File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 147, in variant2pipeline samples = region.parallel_prep_region(samples, run_parallel) File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/pipeline/region.py", line 139, in parallel_prep_region "piped_bamprep", _add_combine_info, file_key, ["config"]) File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/distributed/split.py", line 59, in parallel_split_combine split_output = parallel_fn(parallel_name, split_args) File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/distributed/multi.py", line 28, in run_parallel return run_multicore(fn, items, config, parallel=parallel) File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/distributed/multi.py", line 86, in run_multicore for data in joblib.Parallel(parallel["num_jobs"])(joblib.delayed(fn)(x) for x in items): File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 812, in call self.retrieve() File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 749, in retrieve exception = exception_type(report) TypeError: init() takes at least 3 arguments (2 given) ' returned non-zero exit status 1

chapmanb commented 7 years ago

Following up on this in #1603