broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.68k stars 587 forks source link

Read filtering based on MQ is not a substitute for "mappability" filtering #4558

Closed mbabadi closed 5 years ago

mbabadi commented 6 years ago

As a step toward productionizing gCNV, I explore the idea of aggressive filtering of reads (or targets, or bins) based on known annotations, such as genome mappability, complexity, SINEs, LINEs, repeats, etc.

This post is about mappability filtering, and in particular, for WGS samples. Let us consider coverage data collected by CollectFragmentCounts (default tool args: well-formed paired, FR orientation, MQ > 30) for a several WGS samples (hg38, BWA-MEM). Here's how the coverage distribution looks like for a random sample: image

Notice the main peak at ~ 90 fragments/bin, the diffuse counts below 90, and the peak at 0.

To proceed, I calculated the average mappability score of each bin using the recently published multi-read Umap track. It is not an ideal mappability scoring scheme (since it does not consider Illumina-like errors and paired-end reads), but it is the best I could find for hg38. I filtered out all bins that had a mappability score short of 100% (very aggressive). Here's how the filtered coverage distribution looks like: image

Evidently, the diffusive low coverage part of the distribution has completely disappeared, along with the peak at 0.

Here's a plot of the number of retained bins after filtering with various mean mappability cutoffs: image

Except for chrY, the most aggressive filtering only gets rid of 20% of the genome.

Let us fit different distributions to the filtered coverage data: image

As we expect, negative binomial is almost a perfect fit! this is very reassuring, because gCNV model is based on negative binomial (based on purely theoretical reasoning).

Now, let us look at unfiltered coverage and regress coverage with mappability. Here's how it looks like for 3 different contigs and 5 different samples: image image image image image

Clearly, there is a strong correlation here. In particular, low mappability bins show two distinct modes: a mode at the coverage mode (~ 100 fragments/bin), and a mode that bifurcates to lower values.

I conjecture that the bimodality results from heterogeneity of mappability scores at different positions in the same bin. The bins are 1k wide and it is feasible that some positions are highly mappable and other positions are not. This conjecture can be tested by collecting coverage on smaller bins and to check whether the bimodality weakens. If it does, I suggest filtering based on read position, similar to Genome STRiP, as opposed to filtering bins.

Also, there is little sample-to-sample variation in coverage-mappability scatter plots (as opposed to, let's say, GC). Therefore, there is no reason to consider mappability as a bias covariate. The mappability coverage bias can be captured by a cohort-wide mean bias.

Finally, let us study the NB overdispersion of different samples for different contigs: image

There's a clear structure here: some samples have higher overdispersion than the others. This could be due to degraded samples, less even GC curve, different chemistry, etc. In any event, we can regress the residual variance $psi_sj$ (for sample s, contig j) with a linear model:

psi_sj ~ N(a_s * psi_j + b_s, \beta)

Here's how the regression looks like: image

Pretty much everything is explained by the linear model. This provides support for our choice of linear-NB model in gCNV.

Finally, let us examine whether there is a correlation between $a_s$, $b_s$, and depth of coverage:

image image

There is absolutely no correlation, which is again expected.

mbabadi commented 6 years ago

@samuelklee mappability filtering is highly likely to get rid of a lot of the deletion calls in the ModelSegments pipeline.

samuelklee commented 6 years ago

A few comments to add on to what we discussed in person:

I think the coverage distribution is indeed the correct summary statistic to model for this problem. Total coverage just doesn't provide enough information, but subsampling bins or fitting a per-bin bias model is overkill.

However, I think a straightforward, self-contained modeling or masking approach (which need not rely on a mappability track) within the ploidy-determination tool is still quite feasible. I think that if we can easily solve the problem without requiring a mappability track then we should try to do it, as that is a relatively expensive resource to create.

For example, some very naive hard filtering (red) of the histogram yields a peak that is easily fit by a negative binomial (green)---even a Poisson fit does not appear to bias the depth estimates, and certainly does not result in incorrect ploidy estimates:

masked_fit

(Incidentally, it is helpful to plot on a log scale when checking the fit of these distributions.)

This strategy also gives us a way to ignore low-level mosaicism or large germline events, which filtering on mappability may not address:

mosaic

So let's try to encapsulate changes to the ploidy tool. I agree that the histogram creation can be easily done on the Java side, to save on intermediate file writing. We can probably just cap the maximum bin to k and pass a samples x contig TSV where each entry is a vector with k + 1 elements.

I agree that there is still a lot of important work to be done in exploring our best practices for coverage collection, and I know that you have been interested in improving them for a while. Ultimately, we may want to consider incorporating mappability or other informative metadata, as we've discussed.

However, this will require some non-trivial investment in method/tool development time. Since our preliminary evaluations show that even with the current, naive strategies the tool is performing reasonably well, I am prioritizing cutting a release and improving/automating the evaluations. As we discussed, this will both allow users to start using the tool (which will hopefully result in useful feedback) and establish a baseline for us. This will ultimately provide the necessary foundation for future exploratory work and method development---which always takes more time than we think it will!

samuelklee commented 6 years ago

A few more plots. These show the histograms across all contigs for various samples. The filtered data looks quite straightforward (and I've purposely chosen samples that exhibit a few interesting quirks due to mosaicism, etc.):

xy xx weird weird2

samuelklee commented 6 years ago

Also, I think that being able to express joint contig-contig ploidy priors will help immensely. That is, rather than giving X a prior that evenly weights CN = 1 and CN = 2 and Y a separate prior that evenly weights CN = 0 and CN = 1, we'd want to be able to specify that the combinations XX and XY should be evenly weighted (otherwise we are equally likely to get X and XXY).

mbabadi commented 6 years ago

Thanks for sharing your analysis.

What is your criterion for choosing the main peak? I have seen quite a few WGS samples, which before mappability filtering, have their coverage peak at 0, and/or the median count is significantly away from the main peak (due to the abundance of low mappability bins with small counts).

Also, the second peak of chrX coverage in XY samples that you show above is not a large germline event -- it is simply a low mappable PAR-like region that borrows reads from chrY. Here's how the X coverage distribution looks like on an XY sample after mappability filtering (which removes most of all approximate homologies): chrx

The second spurious peak is gone and range of NB-like behavior is pretty much perfect. Without mappability filtering, all of the bins on the second mode will show up as CN = 2 events (in fact, if you look at gCNV calls on a typical XY samples, there are tons of CN = 2 calls).

Most, if not all, of the non-NB-like coverage before/after the main peak in your plots are reads from unmappable regions, many of which show up as real CNV events if we do not filter them (reads in these regions do not follow from the coverage model and we are at the mercy of BWA). I strongly believe Genome STRiP has achieved ~ 99% experimental validation accuracy because of aggressive filtering, not because of a superior model (it's an elementary Gaussian mixture mix). Garbage in, garbage out.

Anyhow, I am not comfortable at all with cutting a non-Beta release without taking care of about:

  1. Mappability-based bin/read filtering (for WGS), and
  2. Trying out and evaluating a bait-based coverage collection (for WES), so that the raw coverage distribution is more NB-like to begin with.

These are both perfectly achievable goals before May 15. I'd be happy to leave stuff such as different coverage collection strategies (e.g. base call coverage) and fragment-based per-sample GC content estimation for later. These are other areas where significant improvements come from.

For the record -- I am working full steam on evaluations, as we discussed.

mbabadi commented 6 years ago

Just to clarify -- your proposed approach is certainly useful for filtering large germline events, etc, and must be part of the solution (for more robustness). What I'm saying here is that some elementary filtering gives us great mileage, both in terms of our ease of mind in determining the hard-filtering region in the ploidy tool, and for reducing false calls.

samuelklee commented 6 years ago

Thanks for kicking those off! A few more comments:

Just to summarize so far---I am arguing that we probably don't need mappability filtering for ploidy calling, as naive bin filtering suffices to clean up the relevant histograms. And, as you showed, we can hopefully rely on per-bin bias modeling to at least partially account for mappability in gCNV calling (and we certainly wouldn't want to filter out a significant fraction of the genome, in any case). Do we agree?

To answer your first question, the criterion for choosing the peak is quite hacky at the moment, but I found that filtering low-count bins to first check for the presence of a high peak and then falling back to the peak at zero works perfectly fine in practice.

We can certainly try to do something smarter, since, as you say, bin filtering may be desirable---even if we implement mappability filtering---to remove large germline events (it's true that the "example" I showed above is indeed from the PAR-like region on X, as you point out, but this is roughly how a large arm-level event would appear even after mappability filtering.) Although the model I fit above, which is simply a sparse mixture of NBs with regularly-spaced means (modulo some sample-specific and contig-specific jitter), could conceivably capture such events as well, we want to avoid models where a single NB might try to capture two or more peaks.

Also, just to clarify, the weird mosaic examples are the bottom two plots out of the four above---you can see the shifted (non-X, in one of the examples) single peaks. However, it's interesting that the PARs are still showing up in XY---I'm pretty sure I used the blacklist you provided, although I will double check. Did that only include the "official" PARs, or also the additional ones you found?

In any case, are we comfortable calling in those regions (here I'm talking about gCNV, not ploidy)? As I show above, I don't think we need mappability to nail the baseline ploidy. Can we then rely on the per-bin bias to account for these regions in gCNV (pinning them back to the correct CN) without mappability filtering? And with mappability filtering, how substantial is the hit to coverage in these regions? Should we blacklist them for the time being?

To summarize, I think the order of events I'd like to see is this:

  1. Cut an initial Beta release that incorporates CollectReadCounts, streamline evaluations for the AACR poster, do a bit of tuning, establish a baseline. Hopefully the current ploidy calls suffice, if not, maybe issue a quick PR that implements the naive bin filtering (or whatever is necessary to get good ploidy calls). At the same time, get preliminary feedback from some users running on small test cohorts after we have some parameter recommendations.
  2. Do a round of method/model improvement. Start with quick and dirty fixes (e.g., blacklisting PARs) and work our way to more non-trivial changes. This will include many of the suggestions you have brought up, but we should also review user feedback and prioritize accordingly---they may find something that is not even on our radar. Demonstrate improvement (hopefully substantial!) over baseline, cut second Beta release.
  3. Run on larger cohorts, iron out remaining minor issues, and then productionize. By this time, @asmirnov239 will have hopefully made some progress on the PoN clustering front as well. When we are ready, then we will take gCNV out of Beta. With our current staffing situation, I do not expect this to happen before May 15, but I do enjoy pleasant surprises. :)
  4. Run on gnomAD, world domination, etc.

Again, getting a initial Beta release and some reasonable parameters to users is a high priority, so thanks for kicking off the evaluations, and thanks for your willingness to discuss our options. Let me know if you agree with the rest of the plan!