broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.69k stars 588 forks source link

Consider adding CollectReadCounts and further evaluate count-collection strategies. #4519

Closed samuelklee closed 6 years ago

samuelklee commented 6 years ago

In the process of unifying CalculateTargetCoverage / SparkGenomeReadCounts for the rewrite of the CNV pipeline, we decided to experiment with switching over to fragment-based counts due to a request from CGA. For each fragment, CollectFragmentCounts adds a count to the bin that overlaps with the fragment center. We filter to properly-paired, first-of-pair reads in order to have well formed fragments and avoid double counting. We also filter out duplicates.

In contrast, CalculateTargetCoverage added a count to all bins that overlapped with a read and SparkGenomeReadCounts added a count to the bin that contained the read start. These tools kept duplicates.

However, none of these collection strategies have been rigorously evaluated.

Using a small set of WGS SV tandem-duplication calls from @mwalker174 as a truth set, I did some experimenting with changing the count-collection strategy. (We initially thought we were missing some of these simply due to over-denoising/filtering by the PoN, but as we'll see below, the count-collection strategy plays a non-trivial role.)

Subsetting to chr3, I built a small PoN of 12 normals (including the case normal) at 100bp and denoised using bin medians only (i.e., --number-of-eigensamples 0) to avoid denoising away common events. In chr3, the case sample had three events:

chr3    8559423     8560126
chr3    64547471    64549936
chr3    90414457    90415989

I tried the following, running ModelSegments using fairly sensitive parameters (--number-of-changepoints-penalty-factor 0.1 --maximum-number-of-segments-per-chromosome 10000 --window-size 16 --window-size 32 --maximum-number-of-smoothing-iterations 0 in copy-ratio-only mode:

1) CollectFragmentCounts. This only recovered event 2. 2) CollectReadCounts - same as CollectFragmentCounts, but removing the properly-paired and first-of-pair filters and adding a count for each read to the bin containing its start. This recovered all 3 events. 3) CollectFragmentOverlaps - same filters as CollectFragmentCounts, but adding counts to all bins overlapping each fragment. Note that we need to implement a filter on maximum fragment length, otherwise we get some strange artifacts from (incorrectly mapped?) extremely long fragments; I arbitrarily chose a cutoff of 10000bp. This recovered events 1 and 2.

Event 3 seemed to be the most difficult to recover. Plotting the copy ratios surrounding this event (which spans ~15 100bp bins) yields some insights:

CollectFragmentCounts: image

CollectReadCounts: image

CollectFragmentOverlaps: image

The increased statistical noise in the CollectFragmentCounts result (due to the lower overall count because of the pairing of reads) probably causes us to miss this event. Also, although CollectFragmentOverlaps initially looks pretty good, I think the bin-to-bin correlations that are evident here negatively affect segmentation.

This is not an extremely rigorous evaluation, but it suggests that we should consider switching over to a CollectReadCounts-like strategy. To appease CGA (in case they still feel strongly, which they might not), we can make this a separate tool, or perhaps just have a single tool called CollectCounts that can toggle between the two (we just have to be careful about filters in the latter case). We should evaluate further once more rigorous automatic evaluations are in place.

@mbabadi @LeeTL1220 @asmirnov239 @sooheelee might find this of interest. Also, I have a question for engine team, @droazen. Is there a read filter I should be using for fragment length, and if not, can we add one (or is there already a preferred way to do this type of filtering)?

mbabadi commented 6 years ago

Thank you for looking into this and sharing the analysis. I always believed that there is a lot to be gained from a decent coverage collection strategy. For example, I have seen Genome STRiP cleanly resolving cases that are essentially impossible to resolve from our raw data. Perhaps we should:

(1) Include genome mappability analysis tracks as a filtering strategy w/ or w/o MQ-based filtering. We can download fairly accurate mappability data based on noisy Illumina-like paired-end reads from here: https://sourceforge.net/p/gma-bio/wiki/Home/ They have a decent publication too.

(2) While a simple fragment-based coverage collection has major pitfalls, I am not quite convinced that one must throw away fragment information altogether. By theoretically considering various SV events (tandem duplication, disperse duplication, deletion, inversion, inter- and intra-contig translocation, etc.), and studying paired-end reads coming from various parts of such SVs case by case and how they would theoretically align to the reference, we can come up with a heuristic counting strategy that gives the most consistent signal for downstream tools. This analysis requires taking into account basic summary statistics such as read and fragment length distribution in order to resolve anomalous fragments to putative SV events. I have worked out a few cases and this is fairly doable.

mbabadi commented 6 years ago

(1) I am studying GMS mappability scores. To the best of my knowledge, it is the only such analysis that considers both paired-end reads and base calling error rate characteristic of Illumina machines. We could feed the GMS score as a feature file to the coverage collector tool for filtering.

(2) I am also working on the "optimal strategy" for different SV types.

(3) @samuelklee, do we get the same wavy pattern in other samples in the same region? in other words, it is sample-specific or region-specific?

(4) While fragment-based GC correction is difficult (and probably unnecessary) to perform without keeping a full index of aligned reads (like GS), it might be worthwhile to at least collect per-sample per-interval fragment-based average GC content (perhaps along with other summaries such as average fragment length, MQ, etc). It is easy to show that that the difference between full fragment-based GC correction and correction only using the observed average fragment GC content for the pile-up is of the order of the curvature of the GC curve, which is presumably small.

We could collect these statistics either on-the-go during coverage collection, or from the sparse counts table as you suggested before (most sensible approach, once we figure out a way to represent sparse tensors).

samuelklee commented 6 years ago

After discussion with @mwalker174 @mbabadi @SHuang-Broad I think we can stick with the CollectReadCounts strategy for now, perhaps treating soft clips a little more carefully as @mwalker174 suggested. I am currently re-running the CHM-SV cohort at 250bp with CollectReadCounts and will also run a version with a tweak for soft clips so we can compare.

@mbabadi Can we also look at some of the GS events you mentioned? It might also be good to look at some examples of regions with low mappability to see if we can identify any common undesired behavior.

samuelklee commented 6 years ago

IGV screenshot of the above event, sorted by start position, viewed as pairs, no MQ threshold (note the genomic range is not exactly the same as the plots above): igv_panel

SHuang-Broad commented 6 years ago

I'm not sure if the whole range as indicated by the outter boundaries of the outteies pairs is duplicated.

I'm suspecting it is the "left region" that is duplicated and inserted at the "right region", because when looking at the coverage in between the region appear to be linked by the outties, they don't seem to be elevated. In addition, there doesn't appear to be any clipped read in the "right region", as there are in the "left region".

Assuming this is indeed what happened at this site (looks like the duplicated region would be about 200-300 bp long), I'm not sure how this could relate to your experiment and decision.

SHuang-Broad commented 6 years ago

A pacbio call set would be good for verifying if what I suspected is correct or not @samuelklee @mwalker174

mwalker174 commented 6 years ago

Yes this is one of those borderline cases, but there is pacbio support for this call and presence in the father. Also there are assembled breakpoints at either end and no large innie fragments that would be consistent with a dispersed duplication.

On Tue, Mar 13, 2018 at 2:19 PM, Steve Huang notifications@github.com wrote:

A pacbio call set would be good for verifying if what I suspected is correct or not @samuelklee https://github.com/samuelklee @mwalker174 https://github.com/mwalker174

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/gatk/issues/4519#issuecomment-372767594, or mute the thread https://github.com/notifications/unsubscribe-auth/AFbGXUf-VBlX4UAHHTnbZmjxashgKCccks5teA2jgaJpZM4SlViN .

samuelklee commented 6 years ago

Hmm, looks like we lose events 1 and 3 with CollectReadCounts at 250bp using analogous ModelSegments parameters. However, I experimented with tweaking the segmentation to work on the copy ratios (rather than the log2 copy ratios), which seems to recover them.

Although one of the goals of having evaluations backed by SV truth sets is to tune such parameters/methods, I'm beginning to think that SV integration might benefit from using the CNV tools in a more customized pipeline---especially if maximizing sensitivity at resolutions of ~100bp jointly with breakpoint evidence is the goal. For example, you might imagine a tool that directly uses CNV backend code to collect coverage over regions specified by -L, builds a PoN, denoises, and segments on the fly. Or we can put together a custom WDL optimized for sensitivity. Let's discuss in person?