Open freeseek opened 5 years ago
I have just realized that this is indeed a regression from GATK 4.0 to GATK 4.1:
wget https://github.com/broadinstitute/gatk/releases/download/4.0.12.0/gatk-4.0.12.0.zip
unzip gatk-4.0.12.0.zip
Running this:
gatk-4.0.12.0/gatk \
Mutect2 \
-R GCA_000001405.15_GRCh38_no_alt_analysis_set.fna \
-I input.bam \
-O output.vcf.gz \
-L chr1:233443225-233443225 \
--tumor SM
Generates a VCF file with one variant:
GT:AD:AF:DP:F1R2:F2R1:SAAF:SAPP
0/1:3,15:0.799:18:3,7:0,8:0.818,0.818,0.833:0.028,0.025,0.948
So that the AD counts make sense with the previous version.
@freeseek This is a regrettable but temporary regression done for the sake of making Mutect2 much more principled ultimately. Let me try to explain with a timeline
ReadLikelihoods
matrix forces mates to support the same haplotype and the entire likelihood framework is rewritten to allow pairs (or indeed, arbitrary groups of reads) as the atomic unit of data.@davidbenjamin thank you for your work and thank you for the clarification. Is this going to be code shared by the HaplotypeCaller as well? A lot of the analyses I do are dependent on the variance of the AD counts to be properly modeled, so I look forward to this improvement being implemented.
@davidbenjamin Perhaps as a stopgap measure while we wait for the ultimate fix we could add an opt-in argument to M2/HC to toggle the old behavior (throw away one read whenever mates overlap)?
@droazen I'm loathe to do that because the new behavior of Mutect2 is objectively more correct as far as calls are concerned. It's just that for now it doesn't play nicely with the annotation engine. I can make the annotations happen a lot faster than 1-2 months if it's a priority for you, @freeseek.
Is this going to be code shared by the HaplotypeCaller as well?
This and other recent improvements are currently just for Mutect2 but I have tried to keep the software engineering respectable enough to transfer them trivially to HaplotypeCaller given the green light to do so.
My main issue now is that I need both a fix (or workaround) for https://github.com/broadinstitute/gatk/issues/6045 and an annotation engine that counts fragments rather than reads. What I am afraid of is that even if vruano finds a fix for the other bug, I would still need to have it backported to GATK 4.0.12.0 which is the latest release currently capable of counting fragments.
I would be okay with just dropping evidence if necessary. Is there a way to drop one of two pair ended reads (regardless of whether they overlap or not) on the fly?
@davidbenjamin Pinging you on this one -- any thoughts on how we could resolve this?
@droazen How hard would it be to have a plug-in framework for fragment-based likelihoods parallel to what we already have for read likelihoods? It would be nice to specify all annotations with -A
. Barring that, it won't be hard to write some fragment-based likelihoods and hard-code them into M2 and HC, but it seems clumsy to do so.
@davidbenjamin If it's just a matter of adjusting what's passed in to the annotation framework, it shouldn't be that hard, but you probably have a better sense than I do of what needs to be done here.
It ought to be totally parallel to the current framework, but with an AlleleLikelihoods<Fragment, Allele>
replacing AlleleLikelihoods<GATKRead, Allelle>
. I don't see any other change to the interface. It would be nice to do this without duplicating the class hierarchy of INFO
and FORMAT
annotations, however.
Can I ask what the current status of this bug is?
@davidbenjamin and/or @jamesemery, could you please comment?
After digging around a little I see that a FragmentDepthPerAlleleBySample
annotator now exists which produces an FAD
format annotation for fragment depth. Running this on @freeseek 's example, adding -A FragmentDepthPerAlleleBySample
, gives
GT:AD:AF:DP:F1R2:F2R1:FAD:SB 0/1:6,21:0.808:27:3,6:0,8:3,16:2,4,12,9
With fragment allelic depths of 3 for the reference and 16 for the alt. This could be the desired result if one of the 4 reference fragments was uninformative -- checking the bamout I see that one fragment has one read that supports the ref with low quality but the other read supports the alt with higher quality.
I obtain this reproducible issue with gatk 4.1.3.0:
First of all, you have to download the mini input.bam file from this dropbox link: https://www.dropbox.com/sh/78rz5wrhu9zkfzh/AACW9ZPhl4WnD-wmAkKcdHT3a?dl=0
Then setup a GATK working environment:
Now if I run Mutect2:
This will generate a VCF file with one variant:
With an allelic depth of six supporting the reference.
However, there are only four fragments supporting the reference. If I remove those for fragments from the BAM file:
And I run Mutect2 again:
It will generate a VCF with the same variant:
With an allelic depth of zero supporting the reference.
The same problem exists with the HaplotypeCaller.
I believe this was not the intended behavior and it is really messing up with my use case. Short of removing half of the reads from my BAM file, is there a solution for this?