Mutect2: drop in SNV specificity/precision between v.4.1.8.1 and 4.1.9.0

ddrichel commented 2 years ago

We noticed a strong drop in precision for SNVs (~10% for tumor-normal mode, ~20% in tumor-only mode) between releases 4.1.7.0 and 4.2.6.0. With more testing using the HCC1395 somatic benchmark (https://pubmed.ncbi.nlm.nih.gov/34504347/) and sequencing data provided by the Somatic Mutation Working Group (Fudan university WES tumor-normal data set, 2 x 100x coverage), the drop in performance can be traced to changes between 4.1.8.1 and 4.1.9.0. Here are the performance metrics for selected gatk releases:

FD_TN_4170_filter_FD_TN_4181_filter_FD_TN_4190_filter_FD_TN_4200_filter_FD_TN_4260_filter

The calling was done with essentially default parameters: tools/gatk-${version}/gatk Mutect2 --normal-sample WES_FD_N --output $outvcf --intervals $wesbed --interval-padding 0 --input $inbam_t --input $inbam_n --reference $ref

tools/gatk-${version}/gatk FilterMutectCalls --output ${outvcf%.vcf}_filtered.vcf --variant $outvcf --intervals $wesbed --reference $ref --stats ${outvcf}.stats --threshold-strategy OPTIMAL_F_SCORE --f-score-beta 1.0

som.py was used for calculating performance metrics.

Curiously, we do not observe a such a substantial drop in precision in WGS data, neither in tumor-only nor in tumor-normal mode. In the foillowing, our "v04" corresponds to gatk 4.1.7.0 and out "v05" corresponds to gatk 4.2.6.0:

Tumor-normal:

WGS_FD_tumor-normal_reference_workflow_v04_WGS_FD_tumor-normal_reference_workflow_v05

Tumor-only:

WGS_FD_tumor_reference_workflow_v04_WGS_FD_tumor_reference_workflow_v05

In my opinion, the small gains in recall between 4.1.8.1 and 4.1.9.0. do not justify the drop in precision. This and the fact that WES data is affected, while WGS data is not, suggests that this might be a bug rather than a feature.

Any suggestions on how to get the WES precision back to v4.1.8.1 levels are appreciated.

Thanks in advance,

Dmitriy

schelhorn commented 2 years ago

From what I have heard from Dmitriy, this loss in Mutect2 precision between Mutect2 v.4.1.8.1. and v.4.1.9.0 seems to be perfectly reproducible using default parameters on the HCC1395 somatic benchmark gold-standard from the Sequencing Quality Control Phase II Consortium. The drop in performance seems to still persist in the current Mutect2 version.

If true, then this could have dramatic affects on Mutect2 clinical variant calling quality since v.4.1.9.0 until now. It would appear that isolating the root cause of this drop in performance has high importance for maintaining the clinical validity of Mutect2.

It seems quite a number of Mutect2 code details have changed between v.4.1.8.1. and v.4.1.9.0, as for instance this commit concerning quality filtering, or this one concerning soft-clipped bases. Perhaps the circumstance of the drop in precision affecting WES but not WGS data may point to the culprit?

Also, may I ask whether the GATK team is doing real-world regression test on gold-standard variant callsets between version releases, and have these tests passed between v.4.1.8.1. and v.4.1.9.0?

crutching commented 2 years ago

Glad to see we're not the only one's noticing this. We have been using 4.0.11.x for quite a while with success. During a recent revalidation effort, we have been working with 4.2.x versions and missing quite a few calls we consider "truth", while making these calls with other variant callers. We have since verified v4.1.8.0 is performing acceptably, so our data supports the issue occurring during the 4.1.9.0 update.

schelhorn commented 2 years ago

Same here, @jhl667 - good to get additional confirmation. May I ask whether you used the same gold-standard call set or another one, and whether yours was WGS or WES?

I’m linking @droazen here as well since he acted as the release manager of the affected releases. I think it would be good to get clarity soon, since probably tens of thousands of clinical samples have been processed with the affected versions in the last two years, and a reduction in precision of 10-20% (absolute) may have been relevant for clinical decisions in at least some of these cases.

Also, I know how hard it is to avoid such things in what is essentially research software (and such a great one to boot), so this is not about blaming anyone but about fixing the root cause as soon as possible. If it turns out that the issue only affects this particular reference call set and no patients (for some arcane reason), then all the better in my view.

crutching commented 2 years ago

@schelhorn We have a large clinical "truth" set we utilize during workflow validations. We also utilize spike-in samples from SeraCare and perform dilutions using a couple of the common Coriell cell lines. We noticed the Mutect2 calling inconsistency while validating a small targeted panel.

ddrichel commented 2 years ago

To clarify our experimental setup: for the benchmark we are using the latest release (v.1.2) somatic "ground truth" of the HCC1395 cell line from

https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/seqc/Somatic_Mutation_WG/release/latest/

It contains ~40k SNVs and ~2k INDELs in ~2.4Gb high-confidence regions. In high-confidence regions intersected with WES bed, we still have ~1.1k SNVs and ~100 INDELs.

Therefore, even in the WES analysis scenario, the SNV counts should be high enough to draw reliable conclusions when comparing performance between different callers and releases. For WES INDELs, the counts are indeed rather low and results should be interpreted with caution.

xiucz commented 2 years ago

I think this issue is very important to me. When I deal with the CAP data (high capture cfDNA data), GATK4.1.0.8's performance is much better than GATK4.2 ...

droazen commented 2 years ago

Thanks for the report! I am contacting the lead developer of Mutect2, who will reply here.

schelhorn commented 2 years ago

Hi @droazen and @davidbenjamin, could you please give us an estimate about when you will be able to look at this issue in more detail to identify the root cause? Thank you.

droazen commented 2 years ago

@davidbenjamin is away at the moment, but should be able to look into this issue when he returns next week

schelhorn commented 2 years ago

Hi @droazen, is there perhaps already any news on the matter? It’s been a month, and we are wondering about when this may be looked into. Thanks!

droazen commented 2 years ago

@schelhorn @davidbenjamin has been out for most of this past month, but he's promised to look into this report as soon as he can find the time. I anticipate an initial reply from him next week. Thanks for your patience.

I'll add that much of our recent development work on the somatic variation front has gone into a new ML-based tool we're calling "Mutect3" (but which may end up with a different name once it's released). This new tool is already giving much better results than Mutect2, but is not quite ready for prime time. I expect it to come out this year, however. A side effect of the focus on "Mutect3" development work has been that some reports of issues in Mutect2 have taken longer than usual to address, unfortunately.

davidbenjamin commented 2 years ago

@ddrichel @jhl667 @schelhorn @xiucz I have a theory that, depending on the answers to the following questions, I may pursue further.

What are you using for a panel of normals in both WES and WGS?
What reference assembly are you using?
Were the new false positives present in the unfiltered Mutect2 VCF before version 4.1.9.0 (and then flagged by FilterMutectCalls) or absent entirely?

ddrichel commented 2 years ago

Hi David,

Here, we must distinguish between the WES minimal example (default parameters, command lines in my original bug report) and our "production pipeline".

In the minimal example, we do not use PON.

In the production pipeline, we use the exome PON file from: gs://gatk-best-practices/somatic-hg38/1000g_pon.hg38.vcf.gz as recommended here: https://gatk.broadinstitute.org/hc/en-us/articles/360035890631-Panel-of-Normals-PON- In the production pipeline, we also use gnomAD 3.1.2, filtered for AF>0 and FILTER=PASS as "germline resource".

Some more data from the production pipeline, just in case it could help. Since the PON is exome-only, we were not sure whether including the PON would improve the performance of our WGS workflow in the "production pipeline" with gatk version 4.1.7.0, so we ran experiments:

calling without germline resource and without PON
calling with germline resource only, without PON
calling with both germline resource and PON

The results:

WGS_FD_tumor-normal_reference_workflow_v04_WGS_FD_tumor-normal_reference_workflow_v04_gnomAD_only_WGS_FD_tumor-normal_reference_workflow_v04_no_gnomAD_no_PON

We also conducted experiments with a subset of the WES FD sample (FD_1, 1/3 of the full ~100x FD dataset):

calling without germline resource and without PON
calling with PON only, without germline resource
calling with germline resource only, without PON
calling with both germline resource and PON

The results (please note that the labeling conventions are now different compared to WGS experiments, apologies for the inconvenience):

FD_1_T_tumor-normal_WES_muTect2_FD_1_tumor-normal_muTect2_PON_FD_1_tumor-normal_muTect2_gnomAD_FD_1_tumor-normal_muTect2_PON_gnomAD

The reference is GRCh38.primary_assembly.genome.fa The benchmark ("gold standard") call set contains only variants on chr1-chr22, which AFAIK are identical or almost identical between the different b38 versions.
In my minimal WES example, most of the new FPs (150/158) from v4.1.9.0 are not present in raw, unfiltered calls from v.4.1.8.1

This was found the following way. Compare FPs by CHROM, POS, REF, ALT using the scratch output of som.py (ran with --keep-scratch):

comm -23 <(bcftools query -f "%CHROM %POS %REF %ALT\n" WES_FD_TN_4190_filter_som_py/fp.vcf.gz | sort) <(bcftools query -f "%CHROM %POS %REF %ALT\n" WES_FD_TN_4181_filter_som_py/fp.vcf.gz | sort) | sed 's/ /\t/g' > new_FPs.txt wc -l new_FPs.txt 158 new_FPs.txt

Find new FPs present in the raw output of v.4.1.8.1, again matching by CHROM, POS, REF, ALT:

awk 'NR==FNR{a[$1"_"$2"_"$3"_"$4]=1; next;}{if(substr($0,1,1)!="#" && a[$1"_"$2"_"$4"_"$5]==1) print $0}' new_FPs.txt WES_FD_TN_4181.vcf | wc -l 8

For the 8 new FPs that have been detected, but filtered in v.4.1.8.1 (all of which are SNVs), find FILTER classification: awk 'NR==FNR{a[$1"_"$2"_"$3"_"$4]=1; next;}{if(substr($0,1,1)!="#" && a[$1"_"$2"_"$4"_"$5]==1) print $0}' new_FPs.txt WES_FD_TN_4181_filtered.vcf | cut -f7 | sort | uniq -c 2 clustered_events 1 normal_artifact 1 strand_bias 4 weak_evidence

Hope this helps,

Dmitriy

davidbenjamin commented 2 years ago

Thank you @ddrichel! That's really helpful and seems to support my suspicion.

The most relevant changes to Mutect2 between versions 4.1.8.1 and 4.1.9.0 are #6520, #6696, and #6821, which have in common the property that they are bug fixes restoring sensitivity in edge cases. I don't think that any of these are regressions because the job of Mutect2 is to output any candidate variant. Filtering is the responsibility of FilterMutectCalls. Because our panel of normals is generated by running Mutect2, a panel of normals generated before 4.1.9.0 is blind to artifacts that occur in these edge cases. I suspect that re-generating our panels of normals using a more recent release of Mutect2 will solve the problem.

I'm going to need to see how long it will take to gather the samples to re-generate our hg38 panel of normals (it might not take long at all) and triage that against the timetable for Mutect3, which does away with the panel of normals entirely.

schelhorn commented 2 years ago

Excellent, thank you @davidbenjamin and @ddrichel!

Since it has been three weeks, are there any news on whether the regenerated PoN works, and where from it is available?
Also, since there are likely many people (some of which have spoken up in the thread) who have been using the affected releases throughout the years, possibly affecting patients' diagnoses, does the Broad have a communication plan to notify Mutect2 users of this serious performance degradation in the specific cases you outlined?
Last, are there plans to incorporate fully functional benchmark regression tests such as @ddrichel's into the Mutect2 (and later Mutect3) CI pipeline, or at least into the build process for each version that is released to the public?

crutching commented 2 years ago

What are you using for a panel of normals in both WES and WGS?

What reference assembly are you using?

Were the new false positives present in the unfiltered Mutect2 VCF before version 4.1.9.0 (and then flagged by FilterMutectCalls) or absent entirely?

Hi, my apologies for not answering these questions sooner. We are utilizing a panel of normals that was created in house from about 40 or so samples. Reference assembly is still hg19. In 4.1.9.0, we noticed multiple high confidence variants that were no longer being called using equivalent parameter sets. So anyways, in our case we are dealing with a false negative problem.

schelhorn commented 2 years ago

Hi @davidbenjamin,

may I ask whether there has been any progress on addressing the bug and/or generating the new PoN? Thank you.

schelhorn commented 1 year ago

Sorry for not letting this go, but this a severe bug with repercussions on clinical diagnostics that still has not been communicated to other users, nor fixed, for more than three months.

@droazen, is there anything else that we could do to hasten this along? Is there any workaround other than downgrading Mutect2 to the version from more than two years ago?

davidbenjamin commented 1 year ago

I am going to be able to identify samples for a new panel of normals this week, after which generating and validating the panel will take another few days.

schelhorn commented 1 year ago

Excellent news, thank you @davidbenjamin!

schelhorn commented 1 year ago

Happy New Year, @davidbenjamin! Coming back to your comment from October, did you perhaps have the chance to look at the new panel of normals and whether is solves the issue? It has been almost exactly half a year now, since we reported the bug, and given the severity of the issue regarding clinical applications it would be important to get a better understanding of whether the updated PON fixes it. Thanks a bunch!

crutching commented 1 year ago

@schelhorn We also develop clinical workflows, and came to the conclusion we would hard-stop at 4.1.8.0 and wait on Mutect3 to be ready for primetime. My only comment is that it would be strange to me if a PON update solved this, considering we use a custom in-house PON...

schelhorn commented 1 year ago

@jhl667, yep, I understand. Still, this issue is too big to be left alone. I think the Broad has to act here, since patient's lives are at stake and we didn't receive any actionable response for half a year. Mutect2 has a good reputation, and the Broad profits from that, but it is also medical software and it should be treated as such. Mutect2 will continue to be used for some time, and this has to be fixed.

@droazen, is there anything else you people can do? Is the wider GATK dev team aware of the issue? Is this something I should escalate to someone else so that you get support from your management to fix it? We have pharma collabs with the Broad in place, I could try doing it that way, or via the genomics/pharma community. I'm not trying to be threatening or anything, just thinking out loud how we can help you to get the resources to solve this.

droazen commented 1 year ago

@schelhorn Sorry for the long wait on this issue, which we do take seriously! Unfortunately as mentioned above we have extremely limited resources for Mutect2 maintenance at the moment due to our focus on Mutect3, which we see as the long-term solution to this and other issues. You should definitely continue to run 4.1.8.0 until this is resolved, if at all possible. Having said that, we are generating a new M2 PON now, and should know soon whether it resolves this issue as @davidbenjamin hopes. Stay tuned for an update on the outcome of this evaluation within the next week or two.

ury commented 1 year ago

@droazen +1 for being affected by this issue in production. As this is in production (same as @schelhorn , with big pharma which have very strict security requirements), and as 4.1.8.0 contains critical security vulnerabilities that were mitigated in subsequent releases, we are in a serious pickle here.

@jhl667 how did you conclude that 4.1.8.0 performs better than newer versions? What we see is that it emits more variants, but after filtering and intersecting with other callers (i.e. Strelka), we get more variants and a "better" result (we can't really define "better" - it's merely an observation) with 4.2.4.1.

ddrichel commented 1 year ago

@ury the performance evaluation is based on the HCC1395 benchmark, as described in my original report (see above). More variants in newer versions is in line with our analysis, but the new variants are likely to be mostly false positives.

ury commented 1 year ago

@ddrichel so in our case it seems to be the other way around: 4.1.8.0 produces much more variants than 4.2.4.1. After filtering and intersection, it's the other way around.

droazen commented 1 year ago

Update: @davidbenjamin has generated a new panel of normals, and is in the process of running some analysis on it.

davidbenjamin commented 1 year ago

Okay, I have a new panel for hg38 here: gs://broad-dsde-methods-davidben/mutect2-2023-panel-of-normals/mutect2-hg38-pon.vcf.gz

It has all the variants of the old panel, plus more that arose in more recent versions of Mutect2. It is also generally somewhat more conservative, with a greater bias toward precision than the previous one.

This panel is intended to be used at your own risk. I can vouch that it doesn't wreck the results of our own validations but I do not have time to vet it thoroughly enough to put it in the best practices google bucket. Likewise, I cannot promise that it will improve specificity in any particular set of samples.

Within several months I hope we are all running the next version of Mutect and never need to see a panel of normals again.

ddrichel commented 1 year ago

@davidbenjamin @droazen unfortunately the new PON does not make up for the precision loss introduced in v4.1.9.0. In v4.4.0.0 we get just 2 fewer FP SNVs in our performance evaluation, compared to the old PON. Benchmarking results in WES tumor-normal mode, HCC1395 benchmark, and:

v4.1.8.1 (last release with high SNV precision), v4.1.9.0 (first release affected by precision drop), v4.4.0.0 (current release)
oldPON: 1000g_pon.hg38.vcf.gz, newPON: mutect2-hg38-pon.vcf.gz

Any chance to get this issue fixed? With Mutect3 not being available and v4.1.8.1 being affected by the log4j vulnerability, it is quite regrettable to be stuck with inferior precision.

Extended methods, code, and data to reproduce the issue are here: https://github.com/ddrichel/Mutect2_calling_performance_bug

schelhorn commented 1 year ago

Thanks a lot, @ddrichel.

Given that the current Mutect2 release is still broken on both tumor-normal and tumor-only WES data, and downgrade is not possible on production systems due to the log4j vulnerability: is there any path forward for users that care for both accuracy and security, @davidbenjamin and @droazen ?

I fear waiting for Mutect3 isn't an option since even when it is finished there won't be independent benchmarks available for it for quite a while. Also, I suspect (as any other software product) the new version will have bugs, too, until it has matured in production.

Therefore I'd suggest that identifying, understanding and fixing the bug in the current Mutect2 release would be the wisest path forward - do you agree?

droazen commented 1 year ago

@schelhorn @ddrichel Thanks for evaluating the new PON, and sorry that it didn't resolve the precision issue! I agree that in an ideal world where grant-imposed deadlines didn't exist, and we had unlimited developer resources, fixing the issue in M2 would be the best path forward. Since we unfortunately don't live in that world, and are unlikely to have developer bandwidth to work on this issue in the near future, let me suggest an alternate path:

Since you are satisfied with the output of 4.1.8.1, and are only prevented from running that version by the log4j issue, I think your best option for now is to run a build of 4.1.8.1 with the log4j vulnerability patched out. This is very simple to create, and just involves changing the log4j version in our build file and rebuilding GATK. If you'd like to pursue this option, we'd be happy to create such a build for you, or provide instructions for creating it yourself if you prefer.

schelhorn commented 1 year ago

Thanks, @droazen.

While I understand the effects of the funding landscape on academic resources, it seems to me this is a full capitulation of the GATK developer team given a serious bug, especially in light of the fact that the team seems to have enough resources to continue working on Mutect3.

Mutect2 has been one of the best performing variant callers of the last years and is a major reason for the Broad's good reputation in the oncology bioinformatics field. GATK and Mutect2 are used by hundreds of institutions in clinical practice, affecting thousands of real patients' lives. Almost all of these institutions are likely to use clinical WES assays due to cost reasons and will thus have been directly affected by this issue for the last three years. Also, almost all of these institutions will never learn of this bug since they likely trusted in the developers to have proper functional regression tests in place.

If this is indeed the best the Broad can do as an institution, then I will take your offer of providing a build of Mutect2 4.1.8.1 with the log4j vulnerability patched out - thank you. The one thing that I am asking for in addition (for the sake of the overall oncology bioinformatics community), however, is that you conduct a best effort to notify organizations (universities, hospitals, and biotechs/pharmaceuticals that you know are using Mutect2) and best-practise workflow owners (Nextflow, Snakemake, WDL, CWL etc. that include Mutect2) of the forced downgrade. Also, I think it makes sense to include a very prominent warning into the Mutect2 READMEs and GATK best practice documentations and guides.

I know that this is work, too, but with success comes responsibility, and I can just hope that providing proper warnings uses less developer bandwidth than applying binary search to find out which of these 10 commits between 4.1.8.1 and 4.1.9.0 that are touching variant filtering (see below) broke your flagship product enough to abandon it.

(For anyone looking at this issue later, these are the commits I think are most likely to be related to this issue, and which I would propose to systematically leave out of the 4.1.9.0 build to test whether variant calling specificity is restored; assuming the 10 commits are independent and leaving each out in turn produces a working build, this would mean producing 10 Mutect2 builds for functional regression testing (the latter of which @ddrichel could do if we would receive the 10 builds from the GATK team)):

ddrichel commented 1 year ago

@droazen @schelhorn Let me try to identify the problematic commit with git bisect. I tried to build commit-specific versions of GATK, seems to work.

ddrichel commented 1 year ago

Done, the problematic commit was https://github.com/broadinstitute/gatk/commit/a304725a60f5000ec6381040137043a557fc3dc1

Below are the results with versions suggested by git bisect. Calling performance was evaluated with chr22 in the query vcf and full truth vcf, therefore the calculation of the recall is invalid (but irrelevant for the purpose): FD_TN_4181_chr22_FD_TN_-4 1 8 1-28-g6d8cdfc_chr22_FD_TN-4 1 8 1-42-g851c840_chr22_FD_TN-4 1 8 1-49-g4e8b73e_chr22_FD_TN-4 1 8 1-50-ga304725_chr22_FD_TN-4 1 8 1-51-g66570bf_chr22_FD_TN-4 1 8 1-52-g1238565_chr22_FD_TN_4190_chr22

As stated in the pull request https://github.com/broadinstitute/gatk/pull/6821, the change was evaluated with the DREAM3 benchmark and made sense at the time. To be fair, in the HCC1395 benchmark there is still a gain in recall (<2%, see the first figure in my original report above), which I assume to be mostly due to this commit, but the recall/precision tradeoff looks different now with the better benchmark callset.

I have not figured out what exactly the code in the commit does, maybe we can fine-tune the changes @davidbenjamin ?

droazen commented 1 year ago

@schelhorn / @ddrichel: I've tagged a patched version of 4.1.8.1 with the log4j fix applied: https://github.com/broadinstitute/gatk/releases/tag/4.1.8.1.log4jfix

A build of this version is available here: https://github.com/broadinstitute/gatk/releases/download/4.1.8.1/gatk-4.1.8.1.log4jfix.zip

You may also build it yourself by running git fetch && git checkout -f 4.1.8.1.log4jfix && ./gradlew clean bundle -Drelease=true in a clone of this repository. Note that since this is an older version of GATK, it requires Java 8 to build. Please let us know if you have any issues running this build.

I've forwarded your git bisect findings to @davidbenjamin -- thank you for sharing them. As I said earlier, due to other commitments we'll be unable to dig further into this issue in the short-term, but longer-term it will be addressed one way or another, either via Mutect3 or via a patch to Mutect2. I'll discuss with David the best way to communicate this issue to other users.

In the meantime, if you wish to propose a patch to Mutect2, we'd be happy to consider it. Or if continued Mutect2 development is important enough to your organization that you would consider funding its development, we can certainly discuss opportunities for collaboration.

ddrichel commented 1 year ago

Thanks @droazen for the patched 4.1.8.1, the effort is appreciated.

FYI I ran an experiment by setting private static final int ONE_THIRD_QUAL_CORRECTION = 5; which was introduced in https://github.com/broadinstitute/gatk/commit/a304725a60f5000ec6381040137043a557fc3dc1, to 0, which effectively should revert the commit.

In v4.4.0.0. for SNVs, this results in ~2% loss of recall and a striking ~13% gain in precision (from 0.8286 to 0.9609), ~4% improvement over v4.1.8.1.

FD_TN_4181_FD_TN_4190_FD_TN_4400_FD_TN-4 4 0 0-18-g7518611

davidbenjamin commented 1 year ago

@ddrichel It is regrettable that the tradeoff between precision and recall wasn't beneficial in your data, but in other data it has been. I can't call it a bug because mathematically speaking it is an improvement. At worst it can be considered an arbitrary movement along the ROC curve. Regardless, we have always developed Mutect2 and its successor under the principle that theoretical soundness is the best long-term policy. That is not to deny or disregard that some changes harm performance on some data.

There is a chance that adjusting the -f-score-beta, which controls the relative contribution of recall and precision to the weighted F score that FilterMutectCalls seeks to optimize, will be able to reverse this shift and give up the gained recall in exchange for the lost precision. I can't promise that this will help, but fortunately you have the patched 4.1.8.1 that @droazen has kindly produced.

ddrichel commented 1 year ago

@davidbenjamin the benchmark data is still not ours, it is the current somatic "gold standard" HCC1395 from https://www.nature.com/articles/s41587-021-00993-6, based on data from multiple WGS sequencing runs with combined 1,500x coverage, etc....

Apart from that, we are on the same page here. Now that the change leading to the apparent differences in performance is found, and holds up to scrutiny so far (thanks for looking into it), it is a more real possibility than ever that this is not a bug but a case of Mutect2 outperforming the "gold standard".

Let's keep the issue open for now, I am still investigating and would like to have a place to report my progress.

schelhorn commented 1 year ago

Hey @davidbenjamin,

just a heads-up that we have generated some additional inights on this issue that relate to the effects of sample prep on false-positive rates under the changed error model. In short, it seems as if the error model is now much more (perhaps overly?) sensitive to sample-prep noise typical in real-world clinical sequencing data, leading to a 2-3x increase of false-positives in such WES (and, assuming, also other high-coverage) samples. This may have been an undesired side effect of overfitting the error model to the synthetic (and thus much cleaner) DREAM3 benchmark two years ago.

We are presenting the results today at ISMB in Lyon, the world bioinformatics conference. Let me know if any of you guys are participating so that we can have a chat over a glass of wine, if you like!

schelhorn commented 1 year ago

Dear @davidbenjamin and @droazen,

please find here the poster that we presented this Monday at ISMB 2023 in Lyon. It sheds some additional light on this issue and explains why we believe that fine-tuning of the error model introduced in Mutect2 4.1.9.0 led to overcalling of variants in samples with increased (but not pathologic) DNA degradation. Presentation of the poster led to several interesting discussions with other users of Mutect2 at the conference.

Given these results, I would propose that you could expose the parameter you newly set in commit a304725 private static final int ONE_THIRD_QUAL_CORRECTION = 5; as a user-facing command line parameter. In this manner, you could leave the parameter at 5 by default but allow users who work on clinical (especially FFPE) samples to change it to 0 instead, thus effectively restoring the behaviour shown (and benchmarked) prior to 4.1.9.0.

Would that be an acceptable compromise?

droazen commented 1 year ago

@schelhorn Exposing this parameter (perhaps as an @Advanced argument) seems like a reasonable solution to me -- would you or someone on your team be able to submit a PR with this change, and confirm that it works for your use case?

davidbenjamin commented 1 year ago

@schelhorn @droazen That sounds extremely reasonable to me!

droazen commented 1 year ago

@ldgauthier has volunteered to open a PR to expose this parameter -- should be part of the next GATK release

broadinstitute / gatk

Mutect2: drop in SNV specificity/precision between v.4.1.8.1 and 4.1.9.0 #7921