chapmanb / bcbio.variation

Toolkit to analyze genomic variation data, built on the GATK with Clojure
66 stars 15 forks source link

poor concordance for indels but not SNPs #30

Closed stsmall closed 8 years ago

stsmall commented 8 years ago

Hi Brad, Thanks for writing bcbio.variation! I am attempting to use ensemble to create a SNP/Indel set from a GATK HaplotypeCaller and freebayes vcfs. Using the example config.yaml you provide in the README, I received the following statisitcs:

I also tried running a normalization piperline (vcflib > vt normalize) on the freebayes vcf before using ensemble. This produced the same output as above. I also tried altering the classifiers from [AD, PL, DP] to include [AO, GQ, DPR, FS]. This produced an error: Filter VCF with {{:variant-type :snp, :attr-key :all, :zygosity :hom} nil, {:variant-type :snp, :attr-key :all, :zygosity :het} nil, {:variant-type :complex, :attr-key :all, :zygosity :hom} nil, {:variant-type :complex, :attr-key :all, :zygosity :het} nil} Exception in thread "main" java.lang.ClassCastException: java.util.ArrayList cannot be cast to java.lang.String

What would you suggest i do? Should I include more detailed classifiers, like "balance" and "calling" in addition to "general"? thanks, scott

chapmanb commented 8 years ago

Scott; Thanks for trying out the ensemble calling and for the detailed question. Ensemble calling is not very effective without 3 or more calling inputs. The primary driver of the outputs is the selection of x out of n samples, and with only 2 it's hard to resolve beyond doing a union (which tends to be too liberal) or intersection (which tends to be too conservative). We found the extra work of training a classifier helps some but also is hard to generalize to lots of datasets.

We have a simpler method that only implements the x out of n approach which you could also use if you run into problems with this:

https://github.com/chapmanb/bcbio.variation.recall#ensemble

Hope this helps.