Closed a113n closed 9 years ago
If I manually changed the Type to Float, followed by bgzip and tabix, the following error would still be present. This is strange because bcbio-variation-recall ensemble
is still referring to the old file where the type of FREQ is string:
$ bcbio-variation-recall ensemble --cores=16 --numpass 2 --nofiltered /expt/ensemble/Case_S1/Case_074_S1-ensemble.vcf.gz /data/genomes/Hsapiens/GRCh37/seq/GRCh37.fa /expt/mutect/Case_S1-ploidyfix.vcf.gz /expt/varscan/Case_S1-ploidyfix.vcf.gz /expt/vardict/Case_S1-ploidyfix.vcf.gz
java.lang.IllegalStateException: Incompatible header types, collision between these two types: FORMAT=<ID=FREQ,Number=1,Type=String,Description="Variant allele frequency"> FORMAT=<ID=FREQ,Number=A,Type=Float,Description="Allele fraction of the alternate allele with regard to reference">
htsjdk.variant.vcf.VCFUtils.smartMergeHeaders VCFUtils.java: 91
bcbio.variation.variantcontext/merge-headers/fn variantcontext.clj: 148
bcbio.variation.variantcontext/write-vcf-w-template variantcontext.clj: 174
clojure.lang.RestFn.invoke RestFn.java: 494
bcbio.variation.ensemble.intersect/ensemble-vcfs/fn intersect.clj: 81
bcbio.variation.ensemble.intersect/ensemble-vcfs intersect.clj: 77
bcbio.variation.ensemble.intersect/-main intersect.clj: 124
clojure.lang.RestFn.applyTo RestFn.java: 137
clojure.core/apply core.clj: 624
bcbio.variation.recall.main/-main main.clj: 33
clojure.lang.RestFn.applyTo RestFn.java: 137
bcbio.variation.recall.main.main :
$ zgrep "^#.*FREQ" /expt/mutect/Case_S1-ploidyfix.vcf.gz |head
##FORMAT=<ID=FREQ,Number=A,Type=Float,Description="Allele fraction of the alternate allele with regard to reference">
$ zgrep "^#.*FREQ" /expt/varscan/Case_S1-ploidyfix.vcf.gz |head
##FORMAT=<ID=FREQ,Number=1,Type=Float,Description="Variant allele frequency">
$ zgrep "^#.*FREQ" /expt/vardict/Case_S1-ploidyfix.vcf.gz |head
Thanks for the report and sorry about the problem. You're right in your assessment of the issue but I'm having trouble reproducing this with the version of VarScan in bcbio (2.3.7). Wiith this version VarScan outputs type as float:
##FORMAT=<ID=FREQ,Number=1,Type=Float,Description="Variant allele frequency">
This is still slightly different than the MuTect definition, but the GATK header merging handlles this.
Are you using a different version of VarScan or have some other differences that might explain the origin of the Type=String
output? I'm trying to reproduce and understand the issue so we can hopefully put a fix in place. Thanks much for the help debugging.
Thanks for looking into the issue. I ran the following command (with slight modification for the sake of brevity) according to bcbio-nextgen-commands.log, and indeed VarScan output FREQ as String.
$ /expt/bcbio/bin/samtools mpileup -f /expt/bcbio/genomes/Hsapiens/GRCh37/seq/GRCh37.fa -d 1000 -L 1000 -l /expt/bcbio/varscan/1/Case_S1-1_0_15765000-raw-regions.bed /expt/bamprep/Case_S1_T/1/8_2015-10-02_TargetCapture-sort-1_0_15765000-prep.bam | grep -v -P ' 0 $' > test.mpileup
$ cat test.mpileup | java -Xms681m -Xmx1818m -jar /expt/bcbio/share/java/varscan/VarScan.v2.3.7.jar mpileup2cns --min-coverage 5 --p-value 0.98 --vcf-sample-list sample_list.txt --output-vcf --variants | awk -F$'\t' -v OFS='\t' '{if ($0 !~ /^#/) gsub(/[KMRYSWBVHDX]/, "N", $4) } {print}' | vcfuniqalleles > test.vcf
$ grep "^#.*FREQ" test.vcf
##FORMAT=<ID=FREQ,Number=1,Type=String,Description="Variant allele frequency">
This issue is also reproducible using the latest VarScan (2.4.0) downloaded from github. Here attached a minimal example that could aid the debugging process: https://www.dropbox.com/s/53e9litbmcynxp6/test.mpileup?dl=0
Thanks so much for the additional information, this helps a lot. We did have a fix for this in place but it was only being applied to paired runs, not single sample VarScan calling. I pushed a set of fixes that resolves this and also tries to clean up VarScan calling to be more consistent and use pipes instead of writing mpileup fixes to disk. If you update with:
bcbio_nextgen.py upgrade -u development --tools
I hope the latest version will work cleanly for you. On your run, you'll need to remove the varscan and ensemble directories so those get re-generated with the fixes.
Thanks again for all the help debugging and please let us know if you run into anything else.
Thanks for the cleanup, Brad. My clusters will love the disk being hit less. ;)
Hi,
I am using bcbio-nextgen stable version 0.9.3 for somatic variant calling, for some reason this version is unable to perform
bcbio-variation-recall ensemble
successfully.subprocess.CalledProcessError: Command 'bcbio-variation-recall ensemble --cores=16 --numpass 2 --nofiltered /expt/ensemble/Case_S1/Case_074_S1-ensemble.vcf.gz /data/genomes/Hsapiens/GRCh37/seq/GRCh37.fa /expt/mutect/Case_S1-ploidyfix.vcf.gz /expt/varscan/Case_S1-ploidyfix.vcf.gz /expt/vardict/Case_S1-ploidyfix.vcf.gz [1;31mjava.lang.IllegalStateExceptionESC[m: ESC[3mIncompatible header types, coll ision between these two types: FORMAT=<ID=FREQ,Number=1,Type=String,Description="Var iant allele frequency"> FORMAT=<ID=FREQ,Number=A,Type=Float,Description="Allele frac tion of the alternate allele with regard to reference">
I checked the three constituent variant call files, and it seems the issue originates from the varscan VCF:
##FORMAT=<ID=FREQ,Number=1,Type=String,Description="Variant allele frequency">
Note: The type of FREQ is incorrectly set to StringThe field is correctly set in VCF files from mutect and vardict. May I know how I could fix the issue? Thanks!