broadinstitute / gatk-protected

Obsolete/Legacy GATK repository -- go to https://github.com/broadinstitute/gatk instead
BSD 3-Clause "New" or "Revised" License
33 stars 20 forks source link

Reduce scope in ExactAFCalculator uses a poor criterion to choose what alternative alleles to keep; we can do better than that. #950

Closed vdauwera closed 7 years ago

vdauwera commented 7 years ago

@vruano commented on Wed May 18 2016

Bug Report

Affected tool(s)

All tools that use AFCalculators (HC, GenotypeGVCFs etc).

Affected version(s)

When a alternative allele number reduction is needed (e.g. when the number of alt.alleles is larger than --maxAltAlleles). ExactAFCalculator descendants reduce the number of alt alleles using the reduceScope method. Different implementation of the exact-af-calculator may differ slightly on this but for the most part all of them apply the following algorithm:

For each sample, get its PLs, get the best genotype based on those. Then for the alleles included in that genotype increase their "best allele score" by the GQ of the genotype.

Then we chose those alleles that have the highest scores.

I guess often this is ok when we are dealing with many samples and the "good" alleles are present in the top genotypes with high confidence in a few samples and the "bad" alleles are not. However one can see how this criterion fails when either we are working with just a few samples (e.g. 1 sample in HC GVCF mode) or with low coverage data.

For example with a single sample only the alleles in the best genotype may have a score different than 0. All the rest have the same probability of been picked up if maxAltAlleles gives us room for more. Despite that the likelihoods of other genotypes may indicate which ones are a better choice amongst the "loosers" we throw that info away.

Proposed solution.

Simply do a quick and dirty AF estimation and choose the alleles with the larger frequencies. This estimate should use all the genotype likelihoods rather than just the top genotype giving a nominal score for all the alleles that would allow us to sort them all and make a better and less arbitrary selection.


@vdauwera commented on Mon Nov 14 2016

I believe this was done by @vruano and @SHuang-Broad already -- right guys? Can we close this?


@SHuang-Broad commented on Tue Nov 15 2016

My understanding of the current state is that there are several possible places with alt allele reduction in HC, in order:

  1. The fix I put in, to prevent the calculator from becoming too slow or blow up, so downstream steps won't even include these alleles in their likelihood calculations;
  2. The fix Valentine put in, which happens after the read likelihoods are calculated (and optionally down-sampled).

The relevant code is in HaplotypCallerGenotypingEngine.java around lines 267-279.

This is the state in GATK3, in GATK4 the second possibility is not ported yet.

Regarding alt allele reduction in AF calculator, has this been ported back to GATK3?


@SHuang-Broad commented on Tue Nov 15 2016

By "possible place" I mean they don't always remove alt alleles, just when certain conditions are met, and are independent.


@vdauwera commented on Tue Nov 15 2016

I don't think https://github.com/broadinstitute/gatk/pull/1918 has been backported, no.


@SHuang-Broad commented on Mon Dec 19 2016

Looking more closely, isn't this done already in #1377 ? @vruano ?


@vdauwera commented on Mon Mar 20 2017

@SHuang-Broad @vruano Status update on this?


@vruano commented on Mon Mar 20 2017

@SHuang-Broad this is not fixed by #1377 as this makes reference to the selection executed by the AFCalculator... it might be that @davidbenjamin now AF calculator addressed the issue, but is also possible that he avoid it it entirely and just focused in the new QUAL calculation.


@vruano commented on Mon Mar 20 2017

Looking at the code I made reference in GATK3, it seem that it is still faulty... I guess we need to take a look on whether in GATK4 has been fixed and then back-ported if people are interested.


@vdauwera commented on Mon Mar 20 2017

Alright, thanks for the update. At this point we don't care too much about fixing it in GATK3; we're all about moving forward with GATK4. Do you want me to move this issue to GATK or do you already have an issue for this there?

davidbenjamin commented 7 years ago

The new qual score doesn't subset alleles at all because it doesn't need to. AlleleSubsettingUtils handles this upstream of the new qual. We're waiting on the HaplotypeCaller tie-out to eliminate the old qual from GATK 4, however.

vdauwera commented 7 years ago

Ah, do I understand correctly that if the new qual checks out and the old one can be eliminated, this issue no longer applies?

davidbenjamin commented 7 years ago

Well, it's possible that people aren't satisfied with allele reduction in general (@vruano and @SHuang-Broad have thoughts on this, I believe), but it won't have anything to do with the AFCalculator.

davidbenjamin commented 7 years ago

@vdauwera And just to be clear, if I understood correctly everyone was satisfied in November with the new qual and we are just waiting for HaplotypeCaller to check out in GATK 4 before making any changes like pulling the plug on the old qual. The reason is that we're looking for GATK 4 results to match as exactly as possible and subbing in the new qual would complicate that.

vdauwera commented 7 years ago

Ah yes my bad -- we are indeed just waiting on the tie outs to get rid of the old qual and make the new one default. I commented before the full context had resurfaced in my memory... Which means it's bedtime for my brain cells.

davidbenjamin commented 7 years ago

Seeing as I just pressed the "close" button by mistake my brain cells could benefit from the same.

droazen commented 7 years ago

Issue moved to broadinstitute/gatk #2958 via ZenHub