dnanexus-rnd / GLnexus

Scalable gVCF merging and joint variant calling for population sequencing projects
Apache License 2.0
142 stars 37 forks source link

QUAL score correlating to number of variant samples? #224

Open ghost opened 4 years ago

ghost commented 4 years ago

Hello, we have an inlab discussion about that, basically, that QUAL value is somehow directly related to the number of samples having variant sites.

In practice though, it doesn't seem to be the case, as I have many instances of an allele shared by many sample but with a poor QUAL score.

Could you elaborate a bit on how exactly QUAL is computed? if I understand well, if among 10 samples, 9 homozygote reference, if one has a SNP well supported from it's original vcf, that whole line will get a relatively high QUAL score, right?

In other words, let's say I am interested in SNP occurring in only 1 of my 11 samples, should I then not filter on QUAL?

I made a quick plot of the QUAL per number of shared variants (1 means variant is private to a single sample, 2 means the variants is shared by 2 samples, etc )

QUAL_shared_Variant

It seems to me not filtering on QUAL would enrich drastically in poor quality calls, while not really increasing the number of private variants.

mlin commented 4 years ago

Cool plot! There's a per-(allele,individual) quantity called the Allele Quality (defined here). The GLnexus site-level QUAL is simply the maximum of these AQ values across the alleles and individuals at the site. So you'd expect some correlation with allele frequency due to the extreme value effect, but less than if QUAL was a true "joint" probability value (which would be more like summing the AQ values rather than taking the max). The idea was to quickly calculate something conservative (biased to underestimate quality).

I have many instances of an allele shared by many sample but with a poor QUAL score.

I'm not sure what's "many" based on the plot, but certainly these happen and they should be inspected closely. I suspect you'll find a lot of indels in difficult regions (lower-complexity etc.) which may make the aforementioned bias seem pretty good. Seeing an allele called recurrently in numerous samples is helpful of course, but there are still numerous artifactual ways to get there (especially when you focus the tail of a quality value).

In other words, let's say I am interested in SNP occurring in only 1 of my 11 samples, should I then not filter on QUAL?

The idea to apply less stringent filters to singletons is counterintuitive to me & I'm not sure I follow the argument to get there. They're lacking the supporting evidence (albeit not proof) of recurrent observations in other individuals, so if anything we should raise the bar on them?

ghost commented 4 years ago

Interesting! what's the difference between AQ and GQ? Is it simply that AQ is a ratio?

I realise something was left of out of my explanation. Actually, stringent filtering with QUAL >= 30 revealed that most candidates are false positive and the true ones are very few. Now, it could be the organism I am working with has a lower than expected mutation rate.

I am just trying to increase the sensitivity (I don't have that many samples candidates, so inspecting and removing loads of FP is not too problematic).

mlin commented 4 years ago

They're both (log likelihood) ratios derived from the genotype likelihoods. GQ is that between the first and second most likely genotypes, AQ is between genotypes including or excluding the allele in question.

The application sounds interesting and not what we're used to. Did you find the "*_unfiltered" configurations to turn off all of GLnexus' default quality filters? You can find some other issues here discussing the pros and cons of that mode.

ghost commented 4 years ago

No I hadn't delved into that yet.

But thanks for the insights, really helpful

ghost commented 4 years ago

There is no Deep_variant_unfiltered option, I have it for GATK, atlas and Wecall only