freebayes / freebayes

Bayesian haplotype-based genetic polymorphism discovery and genotyping.
http://arxiv.org/abs/1207.3907
MIT License
784 stars 264 forks source link

GQ weirdness #122

Closed travc closed 9 years ago

travc commented 9 years ago

I posted this to the mailing list since it might end up being more of a "I just don't get it" issue instead of a problem with the code (or docs). https://groups.google.com/forum/#!topic/freebayes/fZIjPPuVEkM

I'm having difficulties making sense of GQ.

I'm using freebayes to do multi-sample calling (with the oddity of assigning each sample to it's own pop) to do some population genetics analysis. The metrics are computed by doing pairwise comparisons between samples, so no-call genotypes are fine... in fact, missing genotypes are very much preferred to removing a variant site (vcf line) entirely. Also, the reference allele isn't a very good prior, so calling 0/0 only when it is actually supported is important.

What I want to do is generate a reasonably permissive set of calls, then filter that down to just a set of high-confidence calls. Seems like filtering on GQ would be the obvious way to go. Unfortunately, the GQ values I'm seeing make very little sense to me. Here is a small example (a simple biallelic SNP):

GT:GQ:DP:RO:QR:AO:QA:GL 1) 0/0:8.95264:24:24:791:0:0:0,-7.22472,-71.5196 2) 0/0:140.745:44:44:1507:0:0:0,-13.2453,-135.973 3) 0/0:140.745:2:2:70:0:0:0,-0.60206,-6.65 4) 0/0:140.745:12:12:408:0:0:0,-3.61236,-37.06

1 only has a GQ of 9 despite having 24 reads all being ref. For comparison, 4 has 12 reads and a GQ of 141. 3 has a GQ of 141 with 2 reads. That can't be right, can it? The RQ values pretty indicate the MQ and BQ are comparable across samples, so it isn't that. GL scores also look sensible, unlike GQ.

Maybe there is strand bias or something else which isn't reported per genotype call making the differences... but shouldn't that be reflected in GL? Anyway, I'm at a loss and need some help to get my analysis moving along.

Is there some other approach (other than GQ) which is preferred for setting low quality GT calls to no-call?

ETA: For comparison, here is the same variant calls generated by bcftools call -m: (it doesn't output GQ for hom ref, and it has slightly different quality filters... on the other hand, it has DP4 which shows some strand bias.) GT:DP:DV:SP:DP4:DPR:PL:GP:GQ 0/0:20:0:0:16,4,0,0:20,.:.:.:. 0/0:43:0:0:25,18,0,0:43,.:.:.:. 0/0:2:0:0:1,1,0,0:2,.:.:.:. 0/0:12:0:0:11,1,0,0:12,.:.:.:.

travc commented 9 years ago

Erik has addressed this well in the mailing list... so look there is you have a similar issue ;) https://groups.google.com/forum/#!topic/freebayes/fZIjPPuVEkM

ekg commented 9 years ago

I'm considering moving things to GitHub... Although following here requires everyone to have a GitHub account. Maybe that's not too high a bar? On Dec 7, 2014 8:59 PM, "Travis Collier" notifications@github.com wrote:

Closed #122 https://github.com/ekg/freebayes/issues/122.

— Reply to this email directly or view it on GitHub https://github.com/ekg/freebayes/issues/122#event-203904890.

travc commented 9 years ago

I'd vote for putting everything on github... though I don't know if there is a nice way of separating usage questions from technical/bug issues. I'd also vote for setting up a wiki for freebayes/vcflib here.

Thanks again for being so responsive. I'm trying to put together a workflow I more-or-less understand and trust, which means I'm in 'skeptical curmudgeon' mode and being especially annoying to everyone.