Open meganshand opened 5 years ago
We have access to data that should reproduce this issue, but can't post it here because it's not publicly shareable BAMs. Please let me know how I can share the data if someone decides to look at this issue. Thanks!
Are we positive that there is a problem with this call? It's certainly possible that some haplotypes carry both the C->T and T->C while some haplotypes only carry the latter.
This is why I was so loathe to output MNPs for so long. I believe David is right - especially since in the first sample the SNP is phased with another SNP on the other side of the control region (which might be a little confusing to the first time MT analyst, but we put a pin in that), but not the other MNP position. The implication here is that there are three alleles: CT, TC and CC, but the representation is showing two different positions. It would be good to confirm that the ADs make sense. (I haven't looked at the bams.)
Ah I see, I think you're both right: we have three alleles CT, TC and CC. The ADs don't really make sense to me though.
sample | CT allele depth at site 151 | TC allele depth at site 151 | *T allele depth at site 152 | *C allele depth at site 152 |
---|---|---|---|---|
sample 1 | 27 | 3242 | 250 | 25 |
sample 2 | 13 | 1792 | 0 | 1755 |
There isn't a huge drop in coverage between the two sites, so I'm assuming that the ADs for sample 1 at site 152 shouldn't add up to the total depth there because most of the reads are accounted for at site 151? Except that's not happening for sample 2, which has depth around 1700X at both sites according to IGV. @ldgauthier does that make any sense?
I'm confused because these two samples look virtually identical at this site in IGV, but the AF for sample 1 at site 152 is .1 while the AF for sample 2 at site 152 is .999.
When the engine "marginalizes" haplotype likelihoods into allele likelihoods it avoids double-counting of both the MNP at 151 and the SNP at 152. That is, the CT->TC MNP haplotype is consistent at 152 with the T->C SNP, but it has a different start position and therefore is not marginalized into the evidence for the SNP. So the fact that the DP in sample1 is much less at 152 than at 151 makes sense.
I am also confused about sample2. If we're both still confused tomorrow, let's take a look in IGV. Might even need IntelliJ. Might even be a bug -- if you look in AssemblyBasedCallerGenotypingEngine.createAlleleMapper
, you'll see that the overlapping event logic assumes we're dealing with upstream spanning deletions. Maybe MNPs need to be treated differently.
As the party responsible for recently re-writing AssemblyBasedCallerGenotypingEngine.createAlleleMapper
feel free to loop me in to see if I introduced a bug for MNPs here, and help figure out how to fix it.
I doubt it's your fault, but it could easily be mine for blithely giving M2 a GGA mode and figuring this kind of thing would just work out somehow.
If there is a bug (I haven't quite figured that out yet), it seems like it could also affect HaplotypeCaller in GGA mode if there are MNPs in the given alleles file.
Although HaplotypeCaller's GGA mode overrides discovered alleles, whereas Mutect2's GGA mode adds to them.
Bug Report
Affected tool(s) or class(es)
Mutect2
Affected version(s)
Description
The output vcf for a few samples looks like this:
Note that site 152 is a T->C that is also captured in the MNP at site 151 CT->TC. In one case site 152 is filtered, but in the other it passes, but in both cases the MNP passes.
Steps to reproduce
@klaricch Could you please post the input BAMs into the Mutect task as well as the output VCFs from that task? Could you also post the "script" generated by Cromwell that will show what command Cromwell actually ran at this point? Thanks!
Expected behavior
I'm not sure what should happen in this case, but the two options would be to include just the MNP or just the separate SNPs (with their separate filter status/annotations).
Actual behavior
Both the MNP and the SNP are included making it unclear what the final call is for this site.