chapmanb / bcbio.variation

Toolkit to analyze genomic variation data, built on the GATK with Clojure
66 stars 15 forks source link

variant-ensemble error with normalized freebayes file #31

Closed stsmall closed 8 years ago

stsmall commented 8 years ago

Hi Brad, Thanks for writing bcbio.variation. I am attempting to use the variant-ensemble methods to produce a concordant set of SNP/indel calls from GATK, samtools, and freebayes. In a previous correspondence you mentioned that I should normalize freebayes output to decompose MNPs and complex types. I followed the recommended normalization pipeline as given in bcbio-nextgen (https://github.com/chapmanb/bcbio-nextgen/blob/cf66cea237037a6d2d98851ce46d821abc965fd8/bcbio/variation/freebayes.py#L109). When I run variant-ensemble on the 3 files, now with the normalized freebayes, I get the following error (below). If I use the freebayes file without normalization, the program complete without error, but the concordance of Indels is very low. Any ideas? thanks, scott

$ nice java -jar ~/programs_that_work/bcbio.variation.recall/bcbio.variation-0.2.6-standalone.jar variant-ensemble combine-callers.yaml /SerreDLab/smalls/bowtie2_index/Wb-PNG_Genome_assembly-pt22.spades.ragoutrep.gapfill.mt.fasta test.ensemble.vcf Haiti1007-3.HC.vcf.gz Haiti1007-3.norm.fb.vcf.gz Haiti1007-3.smt.vcf.gz INFO 17:21:05,250 HelpFormatter - --------------------------------------------------------------------------------- INFO 17:21:05,254 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.2-12-g034abb9, Compiled 2014/07/16 15:53:02 INFO 17:21:05,254 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 17:21:05,255 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 17:21:05,261 HelpFormatter - Program Args: -T CombineVariants --unsafe LENIENT_VCF_PROCESSING --read_filter BadCigar --read_filter NotPrimaryAlignment -R /SerreDLab/smalls/bowtie2_index/Wb-PNG_Genome_assembly-pt22.spades.ragoutrep.gapfill.mt.fasta -o /data/smalls/wuchereria/Wb_MF_swga_analysis/vcfs/test.ensemble-work/prep/txtmp3047152843532184609/Haiti1007-3-combo-mincombine.vcf --rod_priority_list combo,gatk,freebayes,samtools --variant:combo /data/smalls/wuchereria/Wb_MF_swga_analysis/vcfs/test.ensemble-work/prep/Haiti1007-3-combo.vcf --variant:gatk /data/smalls/wuchereria/Wb_MF_swga_analysis/vcfs/test.ensemble-work/prep/Haiti1007-3-gatk.vcf --variant:freebayes /data/smalls/wuchereria/Wb_MF_swga_analysis/vcfs/test.ensemble-work/prep/Haiti1007-3-freebayes.vcf --variant:samtools /data/smalls/wuchereria/Wb_MF_swga_analysis/vcfs/test.ensemble-work/prep/Haiti1007-3-samtools.vcf --sites_only --minimalVCF INFO 17:21:05,261 HelpFormatter - Executing as smalls@tequila on Linux 2.6.18-348.12.1.el5 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_60-b19. INFO 17:21:05,262 HelpFormatter - Date/Time: 2016/05/04 17:21:05 INFO 17:21:05,262 HelpFormatter - --------------------------------------------------------------------------------- INFO 17:21:05,262 HelpFormatter - --------------------------------------------------------------------------------- INFO 17:21:06,679 GenomeAnalysisEngine - Strictness is SILENT INFO 17:21:06,955 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 INFO 17:21:08,486 RMDTrackBuilder - Writing Tribble index to disk for file /data/smalls/wuchereria/Wb_MF_swga_analysis/vcfs/test.ensemble-work/prep/Haiti1007-3-combo.vcf.idx INFO 17:21:10,342 RMDTrackBuilder - Writing Tribble index to disk for file /data/smalls/wuchereria/Wb_MF_swga_analysis/vcfs/test.ensemble-work/prep/Haiti1007-3-gatk.vcf.idx INFO 17:21:13,617 RMDTrackBuilder - Writing Tribble index to disk for file /data/smalls/wuchereria/Wb_MF_swga_analysis/vcfs/test.ensemble-work/prep/Haiti1007-3-freebayes.vcf.idx INFO 17:21:15,234 RMDTrackBuilder - Writing Tribble index to disk for file /data/smalls/wuchereria/Wb_MF_swga_analysis/vcfs/test.ensemble-work/prep/Haiti1007-3-samtools.vcf.idx INFO 17:21:15,893 GenomeAnalysisEngine - Preparing for traversal INFO 17:21:15,944 GenomeAnalysisEngine - Done preparing for traversal INFO 17:21:15,944 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 17:21:15,945 ProgressMeter - | processed | time | per 1M | | total | remaining INFO 17:21:15,945 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime INFO 17:21:15,946 CombineVariants - Pre-stripping genotypes for performance Exception in thread "main" java.lang.ClassCastException: java.util.ArrayList cannot be cast to java.lang.String at htsjdk.variant.variantcontext.CommonInfo.getAttributeAsInt(CommonInfo.java:242) at htsjdk.variant.variantcontext.VariantContext.getAttributeAsInt(VariantContext.java:703) at org.broadinstitute.gatk.utils.variant.GATKVariantContextUtils.simpleMerge(GATKVariantContextUtils.java:946) at org.broadinstitute.gatk.tools.walkers.variantutils.CombineVariants.map(CombineVariants.java:309) at org.broadinstitute.gatk.tools.walkers.variantutils.CombineVariants.map(CombineVariants.java:117) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48) at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:99) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:314) at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at bcbio.run.broad$run_gatk$fn2391.invoke(broad.clj:32) at bcbio.run.broad$run_gatk.invoke(broad.clj:31) at bcbio.variation.combine$combine_variants.doInvoke(combine.clj:73) at clojure.lang.RestFn.invoke(RestFn.java:1557) at bcbio.variation.recall$get_min_merged.invoke(recall.clj:170) at bcbio.variation.recall$fn7711.invoke(recall.clj:187) at clojure.lang.MultiFn.invoke(MultiFn.java:251) at bcbio.variation.recall$create_merged$fn7716.invoke(recall.clj:201) at clojure.core$map$fn4553.invoke(core.clj:2624) at clojure.lang.LazySeq.sval(LazySeq.java:40) at clojure.lang.LazySeq.seq(LazySeq.java:49) at clojure.lang.RT.seq(RT.java:507) at clojure.core$seq4128.invoke(core.clj:137) at clojure.core$map$fn4560.invoke(core.clj:2633) at clojure.lang.LazySeq.sval(LazySeq.java:40) at clojure.lang.LazySeq.seq(LazySeq.java:49) at clojure.lang.RT.seq(RT.java:507) at clojure.core$seq4128.invoke(core.clj:137) at clojure.core$map$fn4553.invoke(core.clj:2616) at clojure.lang.LazySeq.sval(LazySeq.java:40) at clojure.lang.LazySeq.seq(LazySeq.java:49) at clojure.lang.RT.seq(RT.java:507) at clojure.core$seq4128.invoke(core.clj:137) at clojure.core$map$fn4557.invoke(core.clj:2627) at clojure.lang.LazySeq.sval(LazySeq.java:40) at clojure.lang.LazySeq.seq(LazySeq.java:49) at clojure.lang.RT.seq(RT.java:507) at clojure.core$seq4128.invoke(core.clj:137) at clojure.core$map$fn4553.invoke(core.clj:2616) at clojure.lang.LazySeq.sval(LazySeq.java:40) at clojure.lang.LazySeq.seq(LazySeq.java:49) at clojure.lang.RT.seq(RT.java:507) at clojure.core$seq4128.invoke(core.clj:137) at clojure.core$map$fn4560.invoke(core.clj:2633) at clojure.lang.LazySeq.sval(LazySeq.java:40) at clojure.lang.LazySeq.seq(LazySeq.java:49) at clojure.lang.RT.seq(RT.java:507) at clojure.core$seq4128.invoke(core.clj:137) at clojure.core$map$fn4553.invoke(core.clj:2616) at clojure.lang.LazySeq.sval(LazySeq.java:40) at clojure.lang.LazySeq.seq(LazySeq.java:49) at clojure.lang.RT.seq(RT.java:507) at clojure.core$seq4128.invoke(core.clj:137) at clojure.core$reduce1.invoke(core.clj:903) at clojure.core$reverse.invoke(core.clj:917) at clojure.math.combinatorics$combinations.invoke(combinatorics.clj:190) at bcbio.variation.compare$variant_comparison_from_config$iter82578261$fn8262.invoke(compare.clj:255) at clojure.lang.LazySeq.sval(LazySeq.java:40) at clojure.lang.LazySeq.seq(LazySeq.java:49) at clojure.lang.RT.seq(RT.java:507) at clojure.core$seq4128.invoke(core.clj:137) at clojure.core$tree_seq$walk5045$fn5046.invoke(core.clj:4739) at clojure.lang.LazySeq.sval(LazySeq.java:40) at clojure.lang.LazySeq.seq(LazySeq.java:49) at clojure.lang.LazySeq.more(LazySeq.java:85) at clojure.lang.RT.more(RT.java:683) at clojure.core$rest4114.invoke(core.clj:73) at clojure.core$flatten.invoke(core.clj:6851) at bcbio.variation.compare$variant_comparison_from_config.invoke(compare.clj:253) at bcbio.variation.ensemble$consensus_calls.invoke(ensemble.clj:113) at bcbio.variation.ensemble$_main.doInvoke(ensemble.clj:132) at clojure.lang.RestFn.applyTo(RestFn.java:137) at clojure.core$apply.invoke(core.clj:630) at bcbio.variation.core$_main.doInvoke(core.clj:35) at clojure.lang.RestFn.applyTo(RestFn.java:137) at bcbio.variation.core.main(Unknown Source)

chapmanb commented 8 years ago

Scott; Thanks for the report and sorry about the issues. vcfallelicprimitives can leave some fields out of sync with the header, which it sounds like is happening here. If you can afford to drop the genotype annotations from FreeBayes, running without the --keep-geno argument might give you a cleaner output that will work with the ensemble method. I'd also recommend trying bcbio.variation.recall ensemble calling (https://github.com/chapmanb/bcbio.variation.recall#ensemble) which is a bit simpler so won't be as strict about the FreeBayes formatting. Hope this helps get your analysis finished.

stsmall commented 8 years ago

Hi Brad, I did as you suggested and it works fine now. I also tried vt decompose_blocksub in place of vcfallelicprimitives and that worked as well and allowed me to retain the info fields. thanks for your help, scott