bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
994 stars 354 forks source link

Issues of VarScan header FREQ field #1062

Closed a113n closed 9 years ago

a113n commented 9 years ago

Hi,

I am using bcbio-nextgen stable version 0.9.3 for somatic variant calling, for some reason this version is unable to perform bcbio-variation-recall ensemble successfully.

subprocess.CalledProcessError: Command 'bcbio-variation-recall ensemble --cores=16 --numpass 2 --nofiltered /expt/ensemble/Case_S1/Case_074_S1-ensemble.vcf.gz /data/genomes/Hsapiens/GRCh37/seq/GRCh37.fa /expt/mutect/Case_S1-ploidyfix.vcf.gz /expt/varscan/Case_S1-ploidyfix.vcf.gz /expt/vardict/Case_S1-ploidyfix.vcf.gz [1;31mjava.lang.IllegalStateExceptionESC[m: ESC[3mIncompatible header types, coll ision between these two types: FORMAT=<ID=FREQ,Number=1,Type=String,Description="Var iant allele frequency"> FORMAT=<ID=FREQ,Number=A,Type=Float,Description="Allele frac tion of the alternate allele with regard to reference">

I checked the three constituent variant call files, and it seems the issue originates from the varscan VCF: ##FORMAT=<ID=FREQ,Number=1,Type=String,Description="Variant allele frequency"> Note: The type of FREQ is incorrectly set to String

The field is correctly set in VCF files from mutect and vardict. May I know how I could fix the issue? Thanks!

a113n commented 9 years ago

If I manually changed the Type to Float, followed by bgzip and tabix, the following error would still be present. This is strange because bcbio-variation-recall ensemble is still referring to the old file where the type of FREQ is string:

$ bcbio-variation-recall ensemble --cores=16 --numpass 2 --nofiltered /expt/ensemble/Case_S1/Case_074_S1-ensemble.vcf.gz /data/genomes/Hsapiens/GRCh37/seq/GRCh37.fa /expt/mutect/Case_S1-ploidyfix.vcf.gz /expt/varscan/Case_S1-ploidyfix.vcf.gz /expt/vardict/Case_S1-ploidyfix.vcf.gz

java.lang.IllegalStateException: Incompatible header types, collision between these two types: FORMAT=<ID=FREQ,Number=1,Type=String,Description="Variant allele frequency"> FORMAT=<ID=FREQ,Number=A,Type=Float,Description="Allele fraction of the alternate allele with regard to reference">
      htsjdk.variant.vcf.VCFUtils.smartMergeHeaders       VCFUtils.java:  91
    bcbio.variation.variantcontext/merge-headers/fn  variantcontext.clj: 148
bcbio.variation.variantcontext/write-vcf-w-template  variantcontext.clj: 174
                         clojure.lang.RestFn.invoke         RestFn.java: 494
bcbio.variation.ensemble.intersect/ensemble-vcfs/fn       intersect.clj:  81
   bcbio.variation.ensemble.intersect/ensemble-vcfs       intersect.clj:  77
           bcbio.variation.ensemble.intersect/-main       intersect.clj: 124
                        clojure.lang.RestFn.applyTo         RestFn.java: 137
                                 clojure.core/apply            core.clj: 624
                  bcbio.variation.recall.main/-main            main.clj:  33
                        clojure.lang.RestFn.applyTo         RestFn.java: 137
                   bcbio.variation.recall.main.main                    :   

$ zgrep "^#.*FREQ" /expt/mutect/Case_S1-ploidyfix.vcf.gz |head
##FORMAT=<ID=FREQ,Number=A,Type=Float,Description="Allele fraction of the alternate allele with regard to reference">
$ zgrep "^#.*FREQ" /expt/varscan/Case_S1-ploidyfix.vcf.gz |head
##FORMAT=<ID=FREQ,Number=1,Type=Float,Description="Variant allele frequency">
$ zgrep "^#.*FREQ" /expt/vardict/Case_S1-ploidyfix.vcf.gz |head
chapmanb commented 9 years ago

Thanks for the report and sorry about the problem. You're right in your assessment of the issue but I'm having trouble reproducing this with the version of VarScan in bcbio (2.3.7). Wiith this version VarScan outputs type as float:

##FORMAT=<ID=FREQ,Number=1,Type=Float,Description="Variant allele frequency">

This is still slightly different than the MuTect definition, but the GATK header merging handlles this.

Are you using a different version of VarScan or have some other differences that might explain the origin of the Type=String output? I'm trying to reproduce and understand the issue so we can hopefully put a fix in place. Thanks much for the help debugging.

a113n commented 9 years ago

Thanks for looking into the issue. I ran the following command (with slight modification for the sake of brevity) according to bcbio-nextgen-commands.log, and indeed VarScan output FREQ as String.

$ /expt/bcbio/bin/samtools mpileup -f /expt/bcbio/genomes/Hsapiens/GRCh37/seq/GRCh37.fa -d 1000 -L 1000 -l /expt/bcbio/varscan/1/Case_S1-1_0_15765000-raw-regions.bed /expt/bamprep/Case_S1_T/1/8_2015-10-02_TargetCapture-sort-1_0_15765000-prep.bam | grep -v -P ' 0               $' > test.mpileup
$ cat test.mpileup | java -Xms681m -Xmx1818m  -jar /expt/bcbio/share/java/varscan/VarScan.v2.3.7.jar mpileup2cns --min-coverage 5 --p-value 0.98   --vcf-sample-list sample_list.txt --output-vcf --variants | awk -F$'\t' -v OFS='\t' '{if ($0 !~ /^#/) gsub(/[KMRYSWBVHDX]/, "N", $4) } {print}' | vcfuniqalleles > test.vcf
$ grep "^#.*FREQ" test.vcf
##FORMAT=<ID=FREQ,Number=1,Type=String,Description="Variant allele frequency">

This issue is also reproducible using the latest VarScan (2.4.0) downloaded from github. Here attached a minimal example that could aid the debugging process: https://www.dropbox.com/s/53e9litbmcynxp6/test.mpileup?dl=0

chapmanb commented 9 years ago

Thanks so much for the additional information, this helps a lot. We did have a fix for this in place but it was only being applied to paired runs, not single sample VarScan calling. I pushed a set of fixes that resolves this and also tries to clean up VarScan calling to be more consistent and use pipes instead of writing mpileup fixes to disk. If you update with:

bcbio_nextgen.py upgrade -u development --tools

I hope the latest version will work cleanly for you. On your run, you'll need to remove the varscan and ensemble directories so those get re-generated with the fixes.

Thanks again for all the help debugging and please let us know if you run into anything else.

lbeltrame commented 9 years ago

Thanks for the cleanup, Brad. My clusters will love the disk being hit less. ;)