Are our default SW parameters reasonable?

davidbenjamin commented 5 years ago

We recently updated (PR #4858) the default Smith-Waterman parameters for realigning reads to their best haplotype. Although there is no reason for the alignment of haplotypes to the reference to use the same parameters, it seems like we are similarly favoring indels too much. There is a forum discussion to this effect:

https://gatkforums.broadinstitute.org/gatk/discussion/23230/gatk-haplotypecaller-mnp-output-problem

Here are the parameters we use:

match: 200
substitution: -150
indel start: -260
indel extend: -11

These parameters, which are essentially a prior on biological variation, prefer an indel, with a cost of 260, to a SNP, with a cost of 350. This does not seem correct. It almost never comes up because the correct alignment is usually unambiguous, but when it does, shouldn't we break the tie in favor of the SNP?

davidbenjamin commented 5 years ago

@ldgauthier what do you think?

ldgauthier commented 5 years ago

I've always thought these were weird. I'm told they were an empirical fit to NA12878 data, but I'm not sure I believe that 100%. They're also an order of magnitude larger than other implementations I've seen, though that probably doesn't matter.

davidbenjamin commented 5 years ago

The thing about empirical fits is that 99.99% of the time it won't matter because trying to align a SNP read to an indel haplotype will give you a long run of mismatches and a vanishingly small likelihood, so there's hardly any signal pushing the parameters in the right direction.

davidbenjamin commented 5 years ago

Update: there's another example on that thread now. If no one objects I'm going to do this.

ldgauthier commented 5 years ago

Let's do it. There are a lot of weirdly represented complex events in the gnomAD browser that have come up and I expect this should fix 99% of them.

On Sat, Mar 30, 2019 at 11:28 PM David Benjamin notifications@github.com wrote:

Update: there's another example on that thread now. If no one objects I'm going to do this.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/gatk/issues/5564#issuecomment-478308694, or mute the thread https://github.com/notifications/unsubscribe-auth/AGRhdEXlYsOHiNeV5drfdrjIuDbiY6fpks5vcCtLgaJpZM4Zyez7 .

-- Laura Doyle Gauthier, Ph.D. Associate Director, Germline Methods Data Sciences Platform gauthier@broadinstitute.org Broad Institute of MIT & Harvard 320 Charles St. Cambridge MA 0214

davidbenjamin commented 5 years ago

So, I tried this out and it's harder than it seems. Some calls get better, some get worse. There probably exists a simple set of heuristics to make reasonable alignments, but I don't think that it can be represented as a single set of SW parameters. We might need to have some kind post-processing to judge whether a better alignment than the Smith-Waterman one exists.

samuelklee commented 4 years ago

Consolidated with #2498.

Now that #6885 is done, I'm going to kick off a Bayesian optimization (using the pipeline-optimizer from https://github.com/broadinstitute/gatk-evaluation/tree/master/pipeline-optimizer I presented on long ago) over all 3 sets of SW parameters using unfiltered HaplotypeCaller -> vcfeval F1 on NA12878 chr22 (with F1 optimized on GQ threshold and enabling decomposition of variants) as the target. This is probably not what we want to ultimately do in practice---we may want to more heavily weight sensitivity after calling, hook up variant filtering, stratify on high/low confidence regions or variant characteristics, etc.---but I'm just curious to see what happens.

I see two potentially useful outcomes: 1) we demonstrate that parameters don't have much of an effect and can be consolidated, or 2) we find more optimal sets of parameters. Potentially we could also show that 3) our parameters are already optimal (I'd say this would be by pure dumb luck), in which case we could at least demonstrate and document some justification for them.

If the parameters don't have much of an impact on NA12878, I'm curious to see whether this holds for low coverage or messier data---and ultimately, in malaria. Just starting with NA12878 because of the availability of truth and the potential impact for the primary use case of calling in human data.

Some preliminary results: I ran the aforementioned comparison on chr22 with 1) 4.1.8.1 master and 2) 4.1.8.1 with haplotype-to-reference SW parameters changed from NEW_SW_PARAMETERS to STANDARD_NGS on two replicates of NA12878 (O1D1 and O2D2 from the 2018 NovaSeq snapshot experiment). On each replicate, 2) demonstrated slightly lower performance, but it was well within the sample-to-sample variation between these two replicates. Here are the corresponding vcfeval summaries:

::::::::::::::
NA12878/O1D1/4.1.8.1/summary.txt
::::::::::::::
Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
   84.000              40778          40780      35116       1412     0.5373       0.9665     0.6907
     None              41994          41994      43760        196     0.4897       0.9954     0.6564
::::::::::::::
NA12878/O1D1/STANDARD_NGS/summary.txt
::::::::::::::
Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
   84.000              40743          40745      35255       1447     0.5361       0.9657     0.6895
     None              41955          41954      43903        235     0.4886       0.9944     0.6553
::::::::::::::
NA12878/O1D2/4.1.8.1/summary.txt
::::::::::::::
Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
   99.000              40951          40954      35877       1239     0.5330       0.9706     0.6882
     None              42036          42039      45223        154     0.4818       0.9963     0.6495
::::::::::::::
NA12878/O1D2/STANDARD_NGS/summary.txt
::::::::::::::
Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
   99.000              40914          40919      36005       1276     0.5319       0.9698     0.6870
     None              41995          41998      45340        195     0.4809       0.9954     0.6485

Probably shouldn't draw too many conclusions from this limited comparison, but it's something.

samuelklee commented 4 years ago

Now that I look at https://github.com/broadinstitute/gatk-evaluation/tree/master/pipeline-optimizer it will take a bit of work to get up to date (it was tied to a particular version of cromwell-tools and Advisor, which is python2-based, has not been actively maintained---exactly why I wanted to move to a more well supported solution...).

If it's not too much trouble, I'll try to get everything running locally on this older stuff for the first round of optimization, but this problem is probably a perfect candidate for the Neptune-based solution that @dalessioluca recently finished at https://github.com/dalessioluca/cromwell_for_ML.

samuelklee commented 4 years ago

I ran a few Bayesian optimizations, mostly focusing on various subsets/ranges of the read-to-haplotype and haplotype-to-reference SW parameters. I also varied whether I used part/all of chr22 and chose either F1/sensitivity as the optimization target.

These experiments show that varying the SW parameters can definitely move the needle in terms of performance, even after variant normalization. For example, here are the SNP precision/sensitivity curves for an optimization of F1 over about half of chr22:

snp_f1

And the same for indels:

non_snp_f1

This all raises the question of what the best optimization strategy should be. It looks like the SW parameters aren't the limiting factor for sensitivity (I would guess that might be MQ or other read filters) but can probably help improve precision. So we may want to relax some filters or optimize them jointly. We may also want to look at optimizations that include filtering, as I mention above.

samuelklee commented 4 years ago

One observation that illustrates the need for care when optimizing metrics: for a few of the F1 optimizations, the haplotype-to-reference match-value parameter gets driven to its minimal value (1). Not 100% sure, but I'm guessing this might effectively boost precision by somehow cutting down on the complexity of proposed haplotypes---it depends on what the exact behavior of our SW algorithm is for negative scores. @davidbenjamin any thoughts on this behavior?

Something I don't quite understand yet is if we can impose some effective constraints on the parameters or otherwise reduce the number of independent dimensions. For example, it seems reasonable to me to fix the gap-extend penalties to -1 and let all other parameters be defined w.r.t. them. But perhaps we can also fix the match values similarly?

fleharty commented 4 years ago

@samuelklee Very interesting, what are the different colors? Different samples?

samuelklee commented 4 years ago

@fleharty sorry, each curve is a different set of parameters run on the same sample, and I'm plotting all curves generated over the course of optimizing those parameters.

fleharty commented 4 years ago

@samuelklee How do these curves compare with the current default parameters?

samuelklee commented 4 years ago

You can probably eyeball it from the 4.1.8.1 results above, which provide two points on the respective curves (although note those numbers are not broken down by SNP/indel). I’ll generate nicer plots once things are finalized, this is just what is spit out by rtg rocplot.

There are a few regions of parameter space where precision seems to improve at the expense of a bit of sensitivity. Again, this might not be optimal if we are going to filter afterwards.

I guess the main takeaway at this stage is that changing the parameters can result in non-negligible performance changes, probably beyond the level of sample-to-sample variation, which wasn't obvious from the first few runs I posted.

dalessioluca commented 4 years ago

It seems to me that producing some parallel coordinate plots is the way to go to gain some intuition about how the SW and filters parameters affect the Area under the curve of the precision/recall curve. I assume that, overall we have about 10 parameters and 2-3 metrics (AUC for SNP, AUC for short Indel, AUC for long Indel). This is small enough that we should be able to gain intuition by looking at parallel coordinate plots.

This is a nice visualization package to produce parallel coordinate plots: https://facebookresearch.github.io/hiplot/index.html

samuelklee commented 4 years ago

Hmm, guess I didn’t realize how important it was to subset to high-confidence regions. Most of the false positives above are coming from the low-confidence regions, so lower precision may actually indicate higher sensitivity to any real variants that may be there and missing in the truth set.

When optimizing parameters in the past, have we always just focused on the high confidence regions and called it a day?

Also, do we typically run HC/M2 identically in low/high confidence regions? I’m noticing that it’s taking much longer to get through the low confidence regions.

ldgauthier commented 4 years ago

To be perfectly clear, do you mean high/low confidence or high/low complexity?

samuelklee commented 4 years ago

Either is fine for the purposes of the point I’m trying to make, as I would guess there’s decent overlap. But to be precise, I have started to stratify the above runs on GIAB NA12878 high/low confidence regions, which prompted the question.

Also note that I haven't run any formal benchmarks of runtime---these are just informal observations---but the point probably still stands regardless of that as well.

ldgauthier commented 4 years ago

We don't ever look at calls outside the GiaB high confidence region, in which case there's no reason to spend compute there. I believe that there are no truth set variants outside the high confidence region, so there's not a lot to be gained by looking at our calls there. There are a variety of reasons sites are excluded from the high confidence regions.

We don't have different parameters for the high versus low complexity regions, but the new DragSTR model will effectively do that.

samuelklee commented 4 years ago

Do you mean in production? In the ~2.92Gbp of the WGS calling regions, I see all ~2.44Gbp of the GIAB HCR, as well as ~485Mbp / ~778Mbp of the GIAB "LCR" (naively, just the complement of the HCR). Or did you mean in another context?

I'm kicking off runs with the CHM now. Hopefully, optimizations of sensitivity outside the CHM HCR will be more meaningful, since I see a decent amount of calls outside of it.

ldgauthier commented 4 years ago

For evaluations, we don't look at calls outside the high confidence regions. For production we deliver calls over the whole genome (except for Y in WGS, which is another story).

Even for CHM, I wouldn't optimize outside the high confidence regions. For example, seg dupes are excluded from the high confidence regions. There likely will be calls there, but there's no guarantee that even a real variant should map to that copy of the segment.

Of course if either @davidbenjamin or @fleharty has a differing opinion I'm open to discussion.

fleharty commented 4 years ago

@ldgauthier I'm in complete agreement, I don't think we should optimize outside the high confidence regions.

samuelklee commented 4 years ago

@takutosato had a good suggestion: to stratify to low-complexity regions in the high-confidence regions. Not sure how many variants are there, but will take a look. EDIT: looks like it's ~5k / ~54k on chr22 in CHM.

More generally, I think that defining the appropriate loss function for optimization to set "default" parameter values obviously has no unique answer. The problem is also made a little more complicated by our current strategy of sensitive calling + non-trivial filtering. But it would be great to come up with some hard constraints (e.g., we never want runtime/cost to exceed X, we always want to maintain Y metrics in these regions on these samples) and general procedures, then apply them as equitably as possible across all method/parameter changes.

Also generally, I'm a bit wary of focusing too hard on the high-confidence regions, as this might lead to overfitting or could understate the potential of method/parameter changes in more difficult regions. But probably we'll have to downweight the loss or do more manual checks in low-confidence regions until we improve truth resources there.

One naive question, just want to double check: is it correct that the overall scaling of each set of SW parameters is inconsequential? E.g., if I multiply each by a constant, should I expect the same results? I would expect this to be the case (unless my hazy recollection of the details of SW scoring is off) and simple experiments bear this out, but I'm not sure if there are some edge cases or idiosyncrasies in our implementation or use of the scores that I might be missing.

davidbenjamin commented 4 years ago

@samuelklee I would run these experiments with the -linked-de-bruijn-graph option, which will reduce the number of haplotypes in a more principled way.

samuelklee commented 4 years ago

Thanks @davidbenjamin, I can try that out. Any other parameters or modes that you feel might be gating any of these metrics/optimizations, which should be explored jointly with the SW parameters?

I guess the same question applies for linked-de-bruijn-graph, which is currently marked as experimental: what would be the procedure/criteria for changing the default behavior? Hopefully, we can answer this question for the case of a binary parameter before tackling 12 parameters! In general, I'm interested in establishing clear processes so it's easier for anybody to propose improvements.

If there's no clear answer just yet, I'm happy to stop at exposing these parameters, perhaps consolidating defaults to one of the current sets if that is not too disagreeable (which is just slightly more complicated than a binary decision). Don't want to rabbit hole if there's no need. Hopefully, at the least, the blog-like documentation above will provide useful pointers to anyone that might want to tackle similar efforts in the future.

samuelklee commented 4 years ago

weighted

Here is the result of optimizing for sensitivity in the high-confidence, low-compexity region of chr22 in CHM, allowing haplotype-to-reference and read-to-haplotype (match, mismatch, gap open) to range over ([1, 20], [-20, -1], [-20, -1]) and fixing gap extend penalties to -1.

The optimal (match, mismatch, gap open) parameters found in this run appear to be:

haplotype-to-reference: 2, -8, -19 read-to-haplotype: 1, -4, -3

I wouldn't put much stock in interpreting these parameters or their exact values for now, but it does appear that the match values and the haplotype-to-reference gap-open penalty might be saturating the bounds of the search. Plots of the type suggested by @dalessioluca might be more illuminating.

Compare with default performance:

Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
    9.000               4003           4019        494       1036     0.8905       0.7944     0.8397
     None               4009           4025        511       1030     0.8873       0.7956     0.8390

That the corresponding curve with a precision/sensitivity endpoint of (0.8873, 0.7956) above isn't at the top of the pack means that we could squeeze out some extra calls by varying the SW parameters.

Of course, this doesn't account for negative impact elsewhere. One could imagine writing a loss where this sensitivity is optimized while putting minimum constraints on precision, sensitivity, and/or F1 in the high-confidence, high-complexity regions (the assumption being the truth set is complete in those regions), or some weightings/variations thereof.

EDIT: Actually, looks like overall performance in the high-confidence region improves:

Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
   12.000              49955          49955       1932       2065     0.9628       0.9603     0.9615
     None              49988          49988       1994       2032     0.9616       0.9609     0.9613

vs. defaults:

Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
   12.000              49837          49865       2012       2183     0.9612       0.9580     0.9596
     None              49870          49898       2077       2150     0.9600       0.9587     0.9594

broadinstitute / gatk

Are our default SW parameters reasonable? #5564