broadinstitute / gatk-protected

Obsolete/Legacy GATK repository -- go to https://github.com/broadinstitute/gatk instead
BSD 3-Clause "New" or "Revised" License
33 stars 20 forks source link

MuTect2 misses Sanger validated deletion #789

Closed vdauwera closed 7 years ago

vdauwera commented 7 years ago

@chandrans commented on Mon May 23 2016

Bug Report

Affected tool(s)

MuTect2

Affected version(s)

3.5 nightly build

Description

A user has reported MuTect2 is not detecting a Sanger validated deletion. The major issue is that the SNPs 1 base and 3 bases away from the deletion are called. However, the deletion is not reported. Here is the bamout of the three sites. It looks like the deletion should be called.

screen shot 2016-05-23 at 5 29 51 pm

Also, the user says the issue may be in: "SomaticGenotypingEngine.java, after it outputs Tumor LOD at to the log. (This log line still seems to contain the deletion; see my first post.) There, more precisely below the comment //reconcile multiple alts, if applicable, it iterates through all detected alternate alleles and sets lodInd to the highest index of an alternate allele that passes the INITIAL_TUMOR_LOD_THRESHOLD. A few lines below that, only this alternate allele seems to be used to create the call; the line reads: tumorAlleles.add(mergedVC.getAlternateAllele(lodInd));. All alternate alleles for the same start position with an index <lodInd seem to be lost. In my case, the lower lodIndex (presumably, the missing deletion) even has a higher tumor LOD than the allele used in the call (TLOD=161.87 versus TLOD=157.07; see first post)."

He also proposes a fix: "I think the easiest fix would be to cut all code between line 278 (final double tumorLod = tumorLods[lodInd];) and line 358 (returnCalls.add( annotatedCall );) and paste it below line 275 (lodInd = altInd;) to make use of the existing for loop for altInd. (Then, we could also use altInd directly, remove the lodInd variable and consolidate the two redundant >= MTAC.INITIAL_*_LOD_THRESHOLD checks.)"

Hopefully his insight is helpful.

Steps to reproduce

Test data and commands below.

Original Forum Post


@chandrans commented on Mon May 23 2016

Test files here: /humgen/gsa-scr1/schandra/Michael7_MuTect2MissesDeletion

Command: java -jar /humgen/gsa-hpprojects/GATK/private_unstable_builds/GenomeAnalysisTK_latest_unstable.jar -T MuTect2 -R /humgen/gsa-scr1/schandra/Michael7_MuTect2MissesDeletion/a\=MuTect2\ reference\ input/Homo_sapiens.GRCh38.dna.primary_assembly.fa -I:tumor /humgen/gsa-scr1/schandra/Michael7_MuTect2MissesDeletion/b\=MuTect2\ sample\ input/9-3.sorted.markedDuplicates.realignedIndels.BQSR.snippet4Broad.bam -L 1:23558000-23560000 -o Sheila.MuTect2.vcf --artifact_detection_mode -bamout Sheila.MuTect2.bam


@ldgauthier commented on Tue May 24 2016

First off, the triallelic site, just refers to that single site at 1:23559237. It shows some G reads, some A reads, and a few deletion reads, so it's mildly dubious. Might be contamination or sequencing error, but we just don't know what to believe. Still, that's a red herring not related to the deletion not being output.

That being said, I'm not sure why the deletion isn't called. It shows up in a lot of the haplotypes, although no haplotype has all three events, but that's okay because no read does either. In fact the called SNP and the uncalled deletion phase perfectly, so I would expect them both to be called. I'll dig in more on Friday.


@chandrans commented on Tue May 24 2016

Thanks! Fixed in the report.


@chandrans commented on Tue Jun 14 2016

Hi Laura. The user is asking whether his team should wait until this fix is in or just continue on. He says: "We are currently waiting with the DNA-seq analysis of about 100 samples for which we validated our analysis pipeline based on previous Sanger results for three genes. Besides this missed deletion, all Sanger variants were rediscovered by the pipeline and by MuTect2. However, since this bug seems to systematically lose variants that start at the same base pair position, we decided to wait before we apply the pipeline to all measured genes. Based on your experience, how long will this bug fix still take? Should we continue to wait? I mean, how severe is this bug in your estimation? Many thanks, Michael."

What do you think?


@chandrans commented on Thu Jun 16 2016

Turns out, he submitted a fix himself! Hey Laura @ldgauthier , I pointed him to this article: http://gatkforums.broadinstitute.org/wdl/discussion/1267/how-can-i-submit-a-patch-to-the-gatk-codebase because he submitted a zip file on the forum. I assume it is okay for him to submit the patch as described in the article even though it says the article is retired?


@ldgauthier commented on Fri Jun 17 2016

What is the fix? I don't want to turn on --allowNonUniqueKmers by default because it introduced a lot of false positives in the benchmarking data I looked at.


@ldgauthier commented on Fri Jun 17 2016

Oops, I'm thinking of a different issue. Let me give your notes another read.


@chandrans commented on Fri Jun 17 2016

Oh, yes. Different issue. I think the best thing to do is read the thread, Laura @ldgauthier . The user had some useful insights in the posts.


@ldgauthier commented on Fri Jun 17 2016

So part of what this comes down to is that we make the assumption that there will not be two real mutations at any given site. That's why the triallelic filter is being applied to his site. Admittedly there will be cases (maybe specific tumor types even) when there will be recurrent mutations at the same site, but given that we weren't going to call anything with two real mutations I guess we skipped over some of the details of the output.

Would the user be satisfied if both alleles were output but the site was still filtered?


@vdauwera commented on Fri Jun 17 2016

Yes that sounds like a good solution


@ldgauthier commented on Tue Jun 21 2016

@takutosato While you're refactoring SomaticGenotypingEngine, maybe you could give this a look?


@takutosato commented on Tue Jun 21 2016

Yup I'm on it


@takutosato commented on Thu Jun 23 2016

@ldgauthier and I discussed this issue. We need to think more carefully about how to handle the multi-allelic site in M2. But in the meantime I updated the code such that at a multi-allelic site we take the ALT allele with the highest tumor LOD. This means that, in Michael's example, the INDEL makes it in the vcf (although filtered), but the SNP at the same locus does not.

I plan to further modify the code in the next month or so such that both alternate alleles make it in the vcf (but will probably only show TLOD and NLOD of the allele with highest TLOD).


@vdauwera commented on Mon Nov 14 2016

@takutosato Did you ever implement the further code modifications you were thinking of? If so I'd like to notify the user that his issue has been resolved. If not, do we need to move this issue to the GATK4 repo?

davidbenjamin commented 7 years ago

No longer an issue -- as of PR #816 we emit all alleles.