broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.68k stars 587 forks source link

Mutect2 reports a SNP that should be included in a nearby MNP #5513

Open meganshand opened 5 years ago

meganshand commented 5 years ago

Bug Report

Affected tool(s) or class(es)

Mutect2

Affected version(s)

Description

The output vcf for a few samples looks like this:

chrM    151    .    CT    TC    .    PASS    DP=3420;ECNT=23;POP_AF=4.000e-03;P_CONTAM=0.00;TLOD=14304.21    GT:AD:AF:DP:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:ORIGINAL_CONTIG_MISMATCH:SA_MAP_AF:SA_POST_PROB:SB    0/1:27,3242:0.992:3269:13,1545:14,1697:30,30:317,335:60:26:0:0.990,0.990,0.992:0.045,0.015,0.940:9,18,1541,1701
chrM    152    .    T    C    .    chimeric_original_alignment    DP=3358;ECNT=23;POP_AF=4.000e-03;P_CONTAM=0.00;TLOD=46.40    GT:AD:AF:DP:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:ORIGINAL_CONTIG_MISMATCH:PGT:PID:POTENTIAL_POLYMORPHIC_NUMT:SA_MAP_AF:SA_POST_PROB:SB    0/1:250,25:0.099:275:119,13:131,12:30,30:336,317:60:29:25:0|1:8660_C_T:true:0.091,0.061,0.091:2.722e-03,0.035,0.962:112,138,7,18
chrM    151    .    CT    TC    .    PASS    DP=1867;ECNT=20;POP_AF=4.000e-03;P_CONTAM=0.00;TLOD=6145.34    GT:AD:AF:DP:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:ORIGINAL_CONTIG_MISMATCH:SA_MAP_AF:SA_POST_PROB:SB    0/1:13,1792:0.993:1805:7,879:6,913:30,30:441,442:60:39:0:0.990,0.990,0.993:0.026,0.024,0.950:5,8,745,1047
chrM    152    .    T    C    .    PASS    DP=1847;ECNT=20;POP_AF=4.000e-03;P_CONTAM=0.00;TLOD=12.96    GT:AD:AF:DP:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:ORIGINAL_CONTIG_MISMATCH:SA_MAP_AF:SA_POST_PROB:SB    0/1:0,1755:0.999:1755:0,862:0,893:0,30:0,442:60:39:6:0.990,0.990,1.00:0.027,0.027,0.946:0,0,726,1029

Note that site 152 is a T->C that is also captured in the MNP at site 151 CT->TC. In one case site 152 is filtered, but in the other it passes, but in both cases the MNP passes.

Steps to reproduce

@klaricch Could you please post the input BAMs into the Mutect task as well as the output VCFs from that task? Could you also post the "script" generated by Cromwell that will show what command Cromwell actually ran at this point? Thanks!

Expected behavior

I'm not sure what should happen in this case, but the two options would be to include just the MNP or just the separate SNPs (with their separate filter status/annotations).

Actual behavior

Both the MNP and the SNP are included making it unclear what the final call is for this site.

meganshand commented 5 years ago

We have access to data that should reproduce this issue, but can't post it here because it's not publicly shareable BAMs. Please let me know how I can share the data if someone decides to look at this issue. Thanks!

davidbenjamin commented 5 years ago

Are we positive that there is a problem with this call? It's certainly possible that some haplotypes carry both the C->T and T->C while some haplotypes only carry the latter.

ldgauthier commented 5 years ago

This is why I was so loathe to output MNPs for so long. I believe David is right - especially since in the first sample the SNP is phased with another SNP on the other side of the control region (which might be a little confusing to the first time MT analyst, but we put a pin in that), but not the other MNP position. The implication here is that there are three alleles: CT, TC and CC, but the representation is showing two different positions. It would be good to confirm that the ADs make sense. (I haven't looked at the bams.)

meganshand commented 5 years ago

Ah I see, I think you're both right: we have three alleles CT, TC and CC. The ADs don't really make sense to me though.

sample CT allele depth at site 151 TC allele depth at site 151 *T allele depth at site 152 *C allele depth at site 152
sample 1 27 3242 250 25
sample 2 13 1792 0 1755

There isn't a huge drop in coverage between the two sites, so I'm assuming that the ADs for sample 1 at site 152 shouldn't add up to the total depth there because most of the reads are accounted for at site 151? Except that's not happening for sample 2, which has depth around 1700X at both sites according to IGV. @ldgauthier does that make any sense?

I'm confused because these two samples look virtually identical at this site in IGV, but the AF for sample 1 at site 152 is .1 while the AF for sample 2 at site 152 is .999.

davidbenjamin commented 5 years ago

When the engine "marginalizes" haplotype likelihoods into allele likelihoods it avoids double-counting of both the MNP at 151 and the SNP at 152. That is, the CT->TC MNP haplotype is consistent at 152 with the T->C SNP, but it has a different start position and therefore is not marginalized into the evidence for the SNP. So the fact that the DP in sample1 is much less at 152 than at 151 makes sense.

I am also confused about sample2. If we're both still confused tomorrow, let's take a look in IGV. Might even need IntelliJ. Might even be a bug -- if you look in AssemblyBasedCallerGenotypingEngine.createAlleleMapper, you'll see that the overlapping event logic assumes we're dealing with upstream spanning deletions. Maybe MNPs need to be treated differently.

cwhelan commented 5 years ago

As the party responsible for recently re-writing AssemblyBasedCallerGenotypingEngine.createAlleleMapper feel free to loop me in to see if I introduced a bug for MNPs here, and help figure out how to fix it.

davidbenjamin commented 5 years ago

I doubt it's your fault, but it could easily be mine for blithely giving M2 a GGA mode and figuring this kind of thing would just work out somehow.

cwhelan commented 5 years ago

If there is a bug (I haven't quite figured that out yet), it seems like it could also affect HaplotypeCaller in GGA mode if there are MNPs in the given alleles file.

davidbenjamin commented 5 years ago

Although HaplotypeCaller's GGA mode overrides discovered alleles, whereas Mutect2's GGA mode adds to them.