broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.71k stars 594 forks source link

GATK CNV on WGS. #2858

Closed droazen closed 6 years ago

droazen commented 7 years ago

@LeeTL1220 commented on Mon May 23 2016

Need to show some measure of performance for the WGS evaluation (GATK CNV only. Not ACNV)

samuelklee commented 7 years ago

Reviving this. This will essentially be a major refactor/rewrite of CreatePanelOfNormals to make it scalable enough to handle WGS.

samuelklee commented 7 years ago

@vruano may also have some insight on the PCOV vs RAW issue.

samuelklee commented 7 years ago

FYI @sooheelee I pointed you at the docs recently, but realized they're slightly out of date. CreatePanelOfNormals currently expects proportional coverage, upon which it initially performs a series of filtering steps. The docs state that these steps are performed on integer counts, which is incorrect. The fact that filtering yields different post-filter coverage for the two types of input ultimately results in slightly different denoising. Not a big deal, but we should decide what the actual method should be doing and clarify in the code/docs.

sooheelee commented 7 years ago

I'll keep an eye out for the "integer counts" passage in the docs and change this to reflect our recommendations.

samuelklee commented 7 years ago

To be specific, I'm referring to Sec. IIB in https://github.com/broadinstitute/gatk-protected/blob/master/docs/CNVs/CNV-methods.pdf.

samuelklee commented 7 years ago

Can someone explain this pseudoinverse business to me? We standardize the counts to give a matrix C, calculate the SVD to get the left-singular matrix U and the pseudoinverse of C, but then perform another SVD on U to get the pseudoinverse of U. Why do we need these pseudoinverses? @LeeTL1220 @davidbenjamin?

samuelklee commented 7 years ago

The memory footprint used by ReadCountCollectionUtils.parse is extremely large compared to the size of the file. I suspect there may be memory leaks in TableReader.

Speed is also an issue; pandas is 4x faster (taking ~10s on my desktop) on a 300bp-bin WGS read-count file with ~11.5 million rows if all columns are read, and 10x faster if the Target name field (which is pretty much useless) is ignored.

(Also, for some reason, the speed difference is much greater when running within my Ubuntu VM on my Broad laptop; what takes pandas ~15s to load takes ReadCountCollectionUtils.parse so long that I just end up killing it...not sure why this is?)

vruano commented 7 years ago

TableReader is a dumb-dumb one-record at a time reader so it shouldn't suffer from memory leaks.

In contrast the parser() method uses a incremental "buffer" that accumulates the counts until the end when the actual returned table is created... The reason for this is to keep the ReadCountsCollection class constant.

So at some point you need at least twice the amount of memory as compare to a solution that would simply use the returned object as the accumulator.

samuelklee commented 7 years ago

You might be right---and I think it's worse, in that the ReadCountCollection argument validation (for uniqueness of targets, etc.) also adds to the overhead.

vruano commented 7 years ago

By default we aim for correctness over efficiency... for what you're saying about the target being useless it sounds that perhaps you should consider to generalize the ReadCountCollection so that in some sub implementations targets are implicit based on their index in the collection and so for those tools that don't care about target-names this operation would go much faster.

So RCC could be an interface rather than a concrete class and some child class would implement the current RCC s behavior, whereas some new child class would implement a more efficient solution for WG- fix size interval collections.

vruano commented 7 years ago

As for TableReader.... we could make it a bit more efficient by reusing DataLine instances. Currently it creates one per each input line, but same instance could be reused loading each new line data onto it before calling createRecord.

We are delegating to a external library to parse the lines into String[] arrays (one element per cell) .... we could save on that by implementing it ourselves more efficiently but of course that would be take one of our some of his/her precious development time...

In any case I don't know what the gain would be considering that these operations are done close to I/O that typically should be dominating the time-cost.

The DataLine reuse may save some memory churning and wouldn't take long to code.

samuelklee commented 7 years ago

First cut at a rewrite seems to be working fine and is much, much leaner.

Building a small PoN with 4 simulated normals with 5M bins each, CombineReadCounts took ~1 min, CreatePanelOfNormals (with no QC) took ~4.5 minutes (although ~1 minute of this is writing target weights, which I haven't added to the new version yet) and generated a 2.7GB PoN, and NormalizeSomaticReadCounts took ~8 minutes (~7.5 minutes of which was spent composing/writing results, thanks to overhead from ReadCountCollection).

In comparison, the new CreateReadCountPanelOfNormals took ~1 minute (which includes combining read-count files, which takes ~30s of I/O) and generated a 520MB PoN, and DenoiseReadCounts took ~30s (~10s of which was composing/writing results, as we are still forced to generate two ReadCountCollections).

Resulting PTN and TN copy ratios were identical down to 1E-16 levels. Differences are only due to removing the unnecessary pseudoinverse computation. Results after filtering and before SVD are identical, despite the code being rewritten from scratch to be more memory efficient (e.g., filtering is performed in place)---phew!

If we stored read counts as HDF5 instead of as plain text, this would make things much faster. Perhaps it would be best to make TSV an optional output of the new coverage collection tool. As a bonus, it would then only take a little bit more code to allow previous PoNs to be provided via -I as an additional source of read counts.

Remaining TODO's:

Evaluation:

samuelklee commented 7 years ago

I created a panel of normals from 90 WGS TCGA samples with 250bp (~11.5M) bins, which took ~57 minutes total and produced an 11GB PoN (this file includes all of the input read counts---which take up 20GB as a combined TSV file and a whopping 63GB as individual TSV files---as well as the eigenvectors, filtering results, etc.). The run can be found in /dsde/working/slee/wgs-pon-test/tieout/no-gc. It completed successfully with -Xmx32G (in comparison, CreatePanelOfNormals crashed after 40 minutes with -Xmx128G). The runtime breakdown was as follows:

Clearly making I/O faster and more space efficient is the highest priority. Luckily it's also low hanging fruit. The 8643028 x 30 matrix of eigenvectors takes <2 minutes to read from HDF5 when the WGS PoN is used in DenoiseReadCounts, which gives us a rough idea of how long it should take to read in the original ~11.5M x 90 counts from HDF5. So once #3349 is in, then I think that a ~15 minute single-core WGS PoN could easily be viable.

I believe that a PoN on the order of this size will be all that is required for WGS denoising, if it is not already overkill. To go bigger by more than an order of magnitude, we'll have to go out of core, which will require more substantial changes to the code. But since the real culprit responsible for hypersegmentation is CBS, rather than insufficient denoising, I'd rather focus on finding a viable segmentation alternative.

samuelklee commented 7 years ago

With HDF5 read-count file input enabled by #3349, a WGS PoN with 39 samples x ~1M bins (i.e., 3kb) took <1 minute to build. Thanks @LeeTL1220!

LeeTL1220 commented 7 years ago

No problem. What was original timing, roughly?

On Jul 27, 2017 12:26 PM, "samuelklee" notifications@github.com wrote:

With HDF5 read-count file input enabled by #3349 https://github.com/broadinstitute/gatk/issues/3349, a WGS PoN with 39 samples x ~1M bins (i.e., 3kb) took <1 minute to build. Thanks @LeeTL1220 https://github.com/leetl1220!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/gatk/issues/2858#issuecomment-318414214, or mute the thread https://github.com/notifications/unsubscribe-auth/ACDXkzOHyLu0ARxwy8dOWqgJSWCO7St5ks5sSLo0gaJpZM4NwWt_ .

LeeTL1220 commented 7 years ago

Sorry did not see previous comment

samuelklee commented 7 years ago

Note the number of samples and bin size are different from that comment.

LeeTL1220 commented 7 years ago

@samuelklee I certainly agree about focusing on CBS alternative.

samuelklee commented 7 years ago

I've implemented the Gaussian-kernel binary-segmentation algorithm from this paper: https://hal.inria.fr/hal-01413230/document This method uses a low-rank approximation to the kernel to obtain an approximate segmentation in linear complexity in time and space. In practice, performance is actually quite impressive!

The implementation is relatively straightforward, clocking in at ~100 lines of python. Time complexity is O(log(maximum number of segments) number of data points) and space complexity is O(number of data points dimension of the kernel approximation), which makes use for WGS feasible.

Segmentation of 10^6 simulated points with 100 segments takes about a minute and tends to recover segments accurately. Compare this with CBS, where segmentation of a WGS sample with ~700k points takes ~10 minutes---and note that these ~700k points are split up amongst ~20 chromosomes to start!

There are a small number of parameters that can affect the segmentation, but we can probably find good defaults in practice. What's also nice is that this method can find changepoints in moments of the distribution other than the mean, which means that it can straightforwardly be used for alternate-allele fraction segmentation. For example, all segments were recovered in the following simulated multimodal data, even though all of the segments have zero mean:

baf

Replacing the SNP segmentation in ACNV (which performs expensive maximum-likelihood estimation of the allele-fraction model) with this method would give a significant speedup there.

Joint segmentation is straightforward and is simply given by addition of the kernels. However, complete data is still required.

Given such a fast heuristic, I'm more amenable to augmenting this method with additional heuristics to clean up or improve the segmentation if necessary. We can also use it to initialize our more sophisticated HMM models, as well.

@LeeTL1220 @mbabadi @davidbenjamin I'd be interested to hear your thoughts, if you have any.

LeeTL1220 commented 7 years ago

@samuelklee This looks really good and the paper is a great find. I am unclear on a couple of things.

mbabadi commented 7 years ago

@samuelklee this looks awesome!

samuelklee commented 7 years ago

How accurate was CBS on the 700k points? 100 ground-truth segments became how many in CBS?

The 700k came from a real WGS sample (as opposed to the simulated data with 100 segments and 1M points), so there is no ground truth there. I was just trying to get an idea of real-world runtime for CBS.

When I run CBS on the simulated data, the results are pretty much the same (both miss a changepoint where the adjacent segment means are close by chance; the runtime is also ~1.2 minutes in both cases):

CBS: 1m-cbs

Kernel segmentation 1m

However, compare how CBS does on the simulated zero-mean multimodal data (terrible, as expected!): baf-cbs

I'm a little unclear on how we would derive a segment mean for this approach. If I am missing something obvious, then I'd ask: How were the "segment means" in this approach vs. CBS?

I'm pretty sure that the segment means in CBS are literally just that. So if the breakpoints are good, then the segment means are the same.

Would it be easier to make a version that only addressed total copy ratio segments (GATK CNV) and then implement joint segmentation later?

Sure, but joint segmentation is much cheaper and easier with this method than our previous probabilistic approaches. Even SNP segmentation will be much cheaper.

What is the name of this approach? "KernSeg"?

Not sure...I couldn't find an R package, although an R/C implementation is mentioned in the paper. But the python implementation is straightforward and a pure Java implementation should not be so bad. There are some cythonized numpy methods that my python implementation used, but I think equivalent implementations of these methods should be relatively fast in pure Java as well.

What variant of the algorithm did you implement? the paper lists several.

I implemented what they call ApproxKSeg. It's an approximate version that combines binary segmentation with the low-rank approximation to the Gaussian kernel.

I haven't read the paper in detail yet, but is it possible to choose a conservatively large number of possible break points and then filter bad break points, possibly based on the rapid decline of the change point probability? i.e. does the algorithm naturally produce change point probabilities?

Yes, you can oversegment and then choose which breakpoints to retain. However, there are no proper changepoint probabilities, only changepoint costs. Adding a penalty term based on the number of changepoints seems to perform relatively well in simple tests, but one could certainly devise other ways to filter changepoints (some of which could yield probabilities, if you are willing to assume a probabilistic model). I think we should just think of this as a fast, heuristic, non-parametric method for finding breakpoints in multidimensional data.

Is it possible to throw in additional change points incrementally, without doing extra work, until a certain criterion is met? (see above)

The version I implemented adds changepoints via binary segmentation. The time complexity required to split a segment is linear in the number of points contained in the segment, although some care must be taken in the implementation to ensure this.

samuelklee commented 7 years ago

I naively cythonized part of my python implementation to use memoryviews, resulting in a ~60x speedup!

The simulated data with 100 segments and 1M points ran in 1.3 seconds!!

THIS IS BANANAS!!!

mbabadi commented 7 years ago

WOW! this is unacceptably too fast!

samuelklee commented 7 years ago

After more experimentation, one issue I was running into with the ApproxKernSeg method was failure on small and "epidemic" events. This is because 1) the segment cost function used in that paper is extensive (growing with the number of points in a segment), and 2) binary segmentation is a global, greedy algorithm. These both cause long events to be preferred over short events, and thus the first changepoints found (and retained after applying the penalty) may not include those for small, obvious events. For example, see performance on this simulated data, which includes events of size 10, 20, 30, and 40 within 100,000 points at S/N ratio 3:1 in addition to sine waves of various frequency at S/N ratio 1:2 (to roughly simulate GC waves). Changepoints arising from the sine waves will be found first, since these give rise to longer segments:

wave-kern-no-local

CBS similarly finds many false positive breakpoints:

wave-cbs

However, when we tune down the sine waves to 1:10, ApproxKernSeg still gets tripped up, but CBS looks better:

wave-kern-no-local-small-waves wave-cbs-small-waves

To improve ApproxKernSeg, we can 1) make the cost function intensive, by simply dividing by the number of points in a segment, and 2) add to the cost function a local term, given by the cost of making each point a changepoint within a local window of a determined size. This local term was inspired by methods such as SaRa (http://c2s2.yale.edu/software/sara/). The reasoning is that with events at higher S/N ratio, we typically don't need to perform a global test to see whether any given point is a suitable changepoint; using the data locally surrounding the point typically suffices.

With these modifications, ApproxKernSeg can handle both scenarios: wave-kern wave-kern-small-waves

This local window approach is still linear in time, so runtime is still ~1s for the above (about ~10x faster than CBS).

One issue still remains, which is that even this improved approach tends to find directly adjacent possible changepoints around a true changepoint before moving on to another true changepoint. We can probably clean this up with some simple postprocessing.

samuelklee commented 7 years ago

I've explored some more refinements to the method and have settled on the following procedure:

Assume we have N data points: 1 To find segments, we: 1) Select Cmax, the maximum number of changepoints to discover. In practice, Cmax = 100 per chromosome should more than suffice. 2) Select a kernel (linear for sensitivity to changes in the distribution mean, Gaussian with a specified variance σ2 for multimodal data, etc.) and a subsample of p points to approximate it using SVD. 3) Select window sizes wj for which to compute local costs at each point. To be precise, we compute the cost of a changepoint at the point with index i, assuming adjacent segments containing the points with indices [i - wj + 1, i] and [i + 1, i + wj]. Selecting a minimum window size and then doubling up to relevant length scales (noting that longer window lengths allow for more subtle changepoints to be detected) works well in practice. For example, here are what the cost functions look like for window sizes of 8, 16, 32, and 64:

2

3

4

5 4) For each of these cost functions, find (up to) the Cmax most significant local minima. The problem of finding local minima of a noisy function can be solved by using topological persistence (e.g., https://people.mpi-inf.mpg.de/~weinkauf/notes/persistence1d.html and http://www2.iap.fr/users/sousbie/web/html/indexd3dd.html?post/Persistence-and-simplification). A straightforward watershed algorithm can sort all local minima by persistence in linear time after an initial sort of the data. 5) These sets of local minima from all window sizes together provide the pool of candidate changepoints (some of which may overlap exactly or approximately). We perform backwards selection using the global segmentation cost. That is, we compute the global segmentation cost given all the candidate changepoints, calculate the cost change for removing each of the changepoints individually, remove the changepoint with the minimum cost change, and repeat. This gives the global cost as a function of the number of changepoints C. 6) Add a penalty a C + b C log(N / C) to the global cost and find the minimum to determine the number of changepoints. For the above simulated data, a = 2 and b = 2 works well, recovering all of the changepoints in the above example with no false positives: 6

7

In contrast, CBS produces two false positives (around the third and seventh of the true changepoints):

8

We can change the penalty factor to smooth out less significant segments (which may be due to systematic noise, GC waves, etc.). Setting a = 10, b = 10 gives: 9 10

(Note that the DNAcopy implementation of CBS does not allow for such simple control of the "false-positive rate," as even setting the relevant p-value thresholds to zero still yields segments.)

Although the above procedure has a number of parameters that need to be chosen, in practice they are all straightforward and relatively easy to understand. Being a combination of local and global methods, it allows for multiscale sensitivity to small events while still allowing for sensible control of the final number of segments via the BIC-like penalty on the global cost. All algorithms used are linear complexity and are straightforward to implement.

samuelklee commented 7 years ago

@LeeTL1220 I have a fast python implementation of the above. It'll take a little bit of additional code to make it output segment files. I can add that and start running some validation data, or I can just go ahead and start coding up the Java implementation, depending on how long you think it'll take to put together some validation runs up through DenoiseReadCounts.

Do we want to improve the ReCapSegCaller behind CallSegments while we're at it? @davidbenjamin perhaps you can briefly remind me of the idea behind your initial caller and of the issues it had.

davidbenjamin commented 7 years ago

@samuelklee The idea of the caller was just to be similar to ReCapSeg. I would support throwing it out in favor of anything you want.

samuelklee commented 7 years ago

@davidbenjamin I thought you had implemented something a little more sophisticated initially, but then reverted to the ReCapSeg caller for some reason?

Anything that is relatively simple to implement yet sufficiently more principled than the ReCapSeg caller would be reasonable for this rewrite. Thought you might've had something that fit the bill originally, but maybe I'm remembering wrong. If so, then we can try leaving it as is for now.

sooheelee commented 7 years ago

This all looks rather exciting. Especially the bananas bit.

samuelklee commented 7 years ago

Java implementation of segmentation is now in the sl_wgs_segmentation dev branch, with a few simple unit tests. I'll expand on these and add tests for denoising in the future, but for now we have a working revised pipeline up through segmentation.

The CLI is simply named ModelSegments (since my thinking is that it could eventually replace ACNV). I ran it on some old denoised exomes. Runtime is <10s, comparable to CBS. Here's a particularly noisy exome:

CBS found 1398 segments: cbs

Kernel segmentation with a penalty given by a = 1, b = 0 found 1018 segments: kern

Kernel segmentation with a penalty given by a = b = 1 (which is probably a reasonable default penalty, at least based on asymptotic theoretical arguments) reduced this to 270 segments : kern-smooth

The number of segments can similarly be controlled in WGS. WGS runtime is ~7min for 250bp bins, ~30s of which is TSV reading, and there is one more spot in my implementation that could stand a bit of optimization, which might bring the runtime down. In contrast, I kicked off CBS 45 minutes ago, and it's still running...

@LeeTL1220 this is probably ready to hand off to you for some WDL writing and preliminary evaluation. Although I can't guarantee that there aren't bugs, I ran about ~80 exomes with no problem. We can talk later today.

samuelklee commented 7 years ago

The WGS 250bp CBS run just finished after ~65 min. It found 9201 segments in a normal sample!

In contrast, kernel segmentation found 476, 264, and 74 segments for a = b = 1, a = b = 2, and a = b = 10, respectively, each run taking <7 min.

samuelklee commented 7 years ago

Bringing the panel, pair, and copy-ratio WDLs up to date. Made some changes to how defaults are represented and some other stylistic changes; we'll need to bring the germline and allele-fraction WDLs in line at at some point.

samuelklee commented 7 years ago

Somatic WDLs should be up to date (they're passing on Travis, at least) and ready for running validations in the sl_wgs_segmentation branch @LeeTL1220.

samuelklee commented 7 years ago

Whoops, just a heads up @LeeTL1220, I tried to change how task-level defaults are expressed in the somatic WDLs, but I'm not sure if this works properly. For example, if I have Int? mem = 4 defined in a task named SomeTask in SomeWorkflow, it appears that setting SomeWorkflow.SomeTask.mem in the json doesn't actually change this value.

Not sure if this is intended Cromwell/WDL behavior, but I can change things back to the way they were tomorrow. I just found the ${default="4" mem} expressions everywhere to be a bit messy...

(EDIT: I think this is consistent with https://github.com/broadinstitute/cromwell/issues/2446.)

samuelklee commented 7 years ago

@LeeTL1220 I've restored the defaults.

samuelklee commented 7 years ago

The new pipeline is in a complete state. Nearly all tools and scripts were rewritten, many from scratch. I've tried to minimize interaction with old tools/exome code (notably, ReadCountCollection and SegmentUtils). There are still some improvements that can be made (especially in segment union and the subsequent modeling), but we should be ready to go for a first validation. Some notes:

WDL:

Denoising:

Segmentation/modeling:

Calling:

Plotting:

Other:

Documentation:

sooheelee commented 7 years ago

Yes, I'd like to test out the new tools in the workflow. All of these changes seem major and I'm excited to see how the workflow performs. I'll stop by to chat with you.

samuelklee commented 7 years ago

The HCC1143 WES sample from the tutorial data runs in < 1.5 minutes and looks pretty good!

Copy-ratio only (using a PoN with 40 samples): hcc1143_t_clean no-ac modeled

Matched-normal mode, allele-fraction only (note that there are not very many hets in this data, since the common sites list we used was relatively small): hcc1143_t_clean matched-normal no-cr modeled

Matched-normal mode (copy ratio + allele fraction): hcc1143_t_clean matched-normal modeled

sooheelee commented 7 years ago

Thanks @samuelklee!

samuelklee commented 7 years ago

TCGA WGS at 250bp (~8.5M copy ratios after filtering during the denoising step from ~11.5M read counts, ~840k hets after filtering on count totals <30 and homozygosity from ~5.5M allelic counts) took ~27 minutes to run in matched-normal mode. Results look very reasonable. You can see the result at /dsde/working/slee/CNV-1.5_TCGA_WGS/TCGA-05-4389-01A-01D-1931-08.matched-normal.modeled.png

sooheelee commented 7 years ago

Here's a first pass draft of the figure for the ASHG poster. For the good/bad PoN illustration at bottom, I use figures I had originally generated with what is now the older workflow.


somatic_cnv

samuelklee commented 6 years ago

Just noticed an updated version of the kernel segmentation paper: https://hal.inria.fr/hal-01413230v2/document The original version referenced in this issue can be found at: https://hal.inria.fr/hal-01413230v1/document

A couple of noteworthy changes include the form of the penalty function and the fact that they use an estimate of the variance to normalize the data and set the kernel variance to unity (they also symmetrize the BAF, which I guess is required for the latter). I don't think either change will make a huge difference to the final segmentation.