Test DetermineGermlineContigPloidy on known aneuploidy samples

broadinstitute / gatk

Official code repository for GATK versions 4 and up

https://software.broadinstitute.org/gatk

Other

1.71k stars 590 forks source link

Test DetermineGermlineContigPloidy on known aneuploidy samples #4371

Open ldgauthier opened 6 years ago

ldgauthier commented 6 years ago

After extensive QC, Ryan Collins with the Talkowski lab has a set of ~20 samples that he believes to have sex chromosome genotypes that are not XX or XY. It would be great to run our tool on them and see what it predicts.

Normals could probably come from any of the same projects: G100345, G68758, G81032, G94818, etc. Case data has already been copied to gs: //broad-dsde-methods/testdata/aneuploidy_samples/

samuelklee commented 6 years ago

@ldgauthier Excellent, these will definitely be a good test set.

ldgauthier commented 6 years ago

The "truth data" is here: /humgen/gsa-hpprojects/dev/gauthier/gCNVploidyTests/sexChrAneuploidyCalls.txt Pretty sure the ID is `` and then there are extras I couldn't find.

samuelklee commented 6 years ago

OK, the first test run I tried was with 1kb bins and no additional normals. Coverage takes about an hour to collect per BAM and ploidy inference takes about 10 minutes. A few things:

1) Looks like we are concordant with the truth CN on X for all but 3/40 of the samples. The GQs for these discordant calls are low (~3, 23, and 25 compared with ~400 for most of the others).

2) However, we are striking out on over half of the samples on Y. We mostly call 1 copy when the truth calls 0. Mehrtash thinks this is because a) I didn't mask out any PARs or otherwise troublesome regions on Y and b) I didn't include any other normals. I'll try rerunning with a mask first, then with other normals, and then with both. Hopefully this should clear up with just the mask.

3) There are a few samples where we strike out because the truth calls 2 copies on Y and we call 1. Mehrtash pointed out that this is most likely because the prior table we put together assumes Y can have at most 1 copy. So hopefully these are trivially recovered once we relax this.

4) The GQs are weirdly high on 1, X, and Y compared to the rest of the autosomes. @ldgauthier any idea why this might be? If there's no reason, then something funny is going on within the tool.

I haven't gotten a chance to plot any of the counts data yet, either, which may make things more obvious. I'll do this today.

ldgauthier commented 6 years ago

I could probably make up something hand wavey about X and Y but I have no clue about 1.

samuelklee commented 6 years ago

Still rerunning, but had a chance to plot up some basic summary statistics of the counts.

1) On the 3/40 that disagree on X, we call CN = 2 and the truth calls CN = 1. These samples all have a mean count on X that is ~75% that on the autosomes. Most likely some mosaicism?

2) A significant fraction of the samples look to be expressing some sort of mosaicism on Y as well, which is probably throwing off the model (along with the misspecified priors).

Still not sure what is up with GQs on chr1; @mbabadi will look into it.

@ldgauthier What would you like to do with mosaic samples in general? Ideally the GQ would reflect that something funny is going on, as it does for the calls above on X. However, you could imagine a tool that estimates the level of mosaicism, etc. (Eventually, such samples would just be a relatively easy cases for the full blown tumor-heterogeneity model.)

samuelklee commented 6 years ago

CRAMs for 20 normals are in gs: //broad-dsde-methods/testdata/aneuploidy_samples_panel. These are all from the same project (G100345) so it will be interesting to see if there are any batch effects that show up in the GQs.

samuelklee commented 6 years ago

Built a PoN with those normals. Some tuning of parameters was required to get reasonable results. After discussion with @mbabadi, we decided that the current model has a little too much freedom and can probably be made simpler (negative binomial -> Poisson). This should result in more robust results and decrease the amount of tuning needed.

Also note that normal sample 8007540251 has something going on in chr12.

samuelklee commented 6 years ago

Just some notes before I forget:

Using these test samples, I made some tweaks to the ploidy model that made it more robust to incorrect ploidy calls and added a simple modeling of mosaicism:

        # per-contig bias
        bias_j = Gamma('bias_j',
                       alpha=100.0,
                       beta=100.0,
                       shape=(ploidy_workspace.num_contigs,))
        norm_bias_j = bias_j / tt.mean(bias_j)

        # per-sample depth
        depth_s = Uniform('depth_s',
                          lower=0.0,
                          upper=10000.0,
                          shape=(ploidy_workspace.num_samples,))

        # per-sample probability of mosaicism
        pi_mosaicism_s = Beta(name='pi_mosaicism_s',
                              alpha=1.0,
                              beta=50.0,
                              shape=(ploidy_workspace.num_samples,))

        # per-sample-and-contig mosaicism factor
        f_mosaicism_sj = Beta(name='f_mosaicism_sj',
                              alpha=10.0,
                              beta=1.0,
                              shape=(ploidy_workspace.num_samples, ploidy_workspace.num_contigs,))
        norm_f_mosaicism_sj = f_mosaicism_sj / tt.max(f_mosaicism_sj, axis=1).dimshuffle(0, 'x')

        # per-contig mapping error
        eps_j = HalfNormal('eps_j', sd=0.01, shape=(ploidy_workspace.num_contigs,))

        # negative-binomial means
        mu_sjk =  depth_s.dimshuffle(0, 'x', 'x') * t_j.dimshuffle('x', 0, 'x') * norm_bias_j.dimshuffle('x', 0, 'x') * \
                  (ploidy_workspace.int_ploidy_values_k.dimshuffle('x', 'x', 0) + eps_j.dimshuffle('x', 0, 'x'))
        mu_mosaic_sjk = norm_f_mosaicism_sj.dimshuffle(0, 1, 'x') * mu_sjk

        # "unexplained variance"
        psi = Uniform(name='psi', upper=10.0)

        # convert "unexplained variance" to negative binomial over-dispersion
        alpha = tt.inv((tt.exp(psi) - 1.0))

        def _get_logp_sjk(_n_sj):
            _logp_sjk = logsumexp([tt.log(1 - pi_mosaicism_s.dimshuffle(0, 'x', 'x')) + commons.negative_binomial_logp(mu_sjk, alpha.dimshuffle('x', 'x', 'x'), _n_sj.dimshuffle(0, 1, 'x')),
                                   tt.log(pi_mosaicism_s.dimshuffle(0, 'x', 'x')) + commons.negative_binomial_logp(mu_mosaic_sjk, alpha.dimshuffle('x', 'x', 'x'), _n_sj.dimshuffle(0, 1, 'x'))],
                                  axis=0)[0]
            return _logp_sjk

        DensityDist(name='n_sj_obs',
                    logp=lambda _n_sj: tt.sum(q_ploidy_sjk * _get_logp_sjk(_n_sj), axis=2),
                    observed=n_sj)

Briefly, the model includes 1) per-contig bias (normalized to unit mean for identifiability), 2) per-sample depth, 3) per-sample probability of mosaicism, 4) per-sample-and-contig mosaicism factor f (in [0, 1], normalized by the per-sample max for identifiability), 5) per-contig mapping error. The likelihood is then a negative-binomial mixture of non-mosaic and mosaic contigs, where the latter have their mean count depressed by the corresponding factor f.

This model still requires some tuning of priors (which are currently hard coded above), but seems to correctly capture most of the mosaicism in the test samples. Also, I found that it was better to run the aneuploid samples as a cohort or to run them in combination with the 20 panel samples as a cohort, rather than to run them in case mode against the panel.

We don't necessarily have to emit anything on the mosaicism inferences for the first revision of this model (or we may end up stripping those parts of the model out for now), but I thought it would be good to record this version of the model for posterity. However, note that this model differs from the one currently in master in the treatment of depth. I think the treatment here is quite natural and may be more robust than the current treatment.

@mbabadi is going to take over tuning and tweaking the model from this point in the sl_simple_ploidy branch. Note that I haven't cleaned up some of the code and comments yet, but hopefully the changes are relatively clear. I believe I rebased on one of your other branches, so you should remove the corresponding commit and rebase.

ldgauthier commented 6 years ago

@samuelklee does that mean you would recommend including some mosaic samples in the PoN or do you want to see if Mehrtash's tuning improves the situation?

samuelklee commented 6 years ago

I think @mbabadi may first want to run with a larger and more diverse PoN to see if that resolves the issues I saw running in case mode. We will typically use more than 20 samples, and the ones I used were all from the same project. If not, then some more tweaking may be required.

We may also want to consider using more than just the total count per contig, to address some of the concerns about large events. Perhaps a stochastic subsampling of bins would be a good approach. Hopefully, GermlineCNVCaller should be able to handle these events (although if they are ever larger than a shard, we may run into other issues related to the treatment of depth alluded to above). Might be good to get a few of those samples at some point.

samuelklee commented 6 years ago

Motivated by some of the discussion and work by @mbabadi in #4558, I quickly revisited the revision of the ploidy model. The key difference is now we use the per-contig coverage histogram (rather than just the per-contig total coverage). This histogram conveys a lot more information and, with some naive filtering (see more discussion in #4558), provides relatively easy peaks to fit. I think this is a better solution than subsampling intervals and fitting a model that would require modeling per-interval bias.

Another key change I added was to provide per-genotype priors, rather than per-contig priors. For the autosomes, this is immaterial, but it's extremely useful for the allosomes. That is, we currently provide per-contig priors like so:

CONTIG    PLOIDY_0    PLOIDY_1    PLOIDY_2    PLOIDY_3
1         0.0         0.0         1.0         0.0
...
X         0.01        0.48        0.48        0.01
Y         0.48        0.48        0.01        0.01

However, note that this implies that X and XXY are just as probable as XX and XY! It's much more powerful to be able to specify per-genotype priors (although this requires a bit more bookkeeping when translating to implications for per-contig histograms):

CONTIG_SET    PLOIDY_STATE    RELATIVE_PROBABILITY
(1)           (2)             1.0
...
(X,Y)         (2,0)           1.0
(X,Y)         (1,1)           1.0
(X,Y)         (1,0)           0.01
(X,Y)         (2,1)           0.01
(X,Y)         (1,2)           0.01
(X,Y)         (3,0)           0.01

We then fit the per-contig coverage histograms across all samples with the appropriate negative-binomial distributions corresponding to a sparse mixture of genotypes, while accounting for 1) sample-specific depth (which determines the means of the negative-binomial distributions), 2) multiplicative contig-specific bias (which is mild, at least for WGS), and 3) additive sample-contig-specific mosaicism or bias (note that the above genotype priors imply that mosaicism/bias on top of a baseline of CN = 2 is the only deviation allowed for the autosomes, which is somewhat restrictive but greatly aids convergence).

I put together a pure PyMC3 prototype that seems to work relatively well. Here are the per-contig coverage histograms (unfiltered bins in blue, bins retained after filtering in red, and negative-binomial fit in green) and a heatmap of per-contig ploidy probabilities. Both the panel (first 20) and case (remaining) samples are shown:

prototype-result

Although the prototype model is clearly a good fit to the filtered data, some care in choosing the optimizer and its learning parameters is required to achieve convergence to the correct solution. This is because the problem is inherently multimodal and thus there are many local minima. I found that using AdaMax with a naive strategy of warm restarts (to help kick us out of local minima) worked decently; we can achieve convergence in <10 minutes for 60 samples x 24 contigs x 250 count bins:

elbo

I expect that @mbabadi's annealing implementation in the gcnvkernel package will handle the local minima much better.

The course of action needed to implement this model should be as follows:

1) Alter Java code to emit per-contig histograms. Change python code to consume histograms, perform filtering, and fit using the above model (or some variation). 2) Choose learning parameters appropriate with annealing and check that results are still good. 3) Update gCNV model to consume the depth emitted by this model properly, if necessary, and rerun evaluations.

Other improvements enabled by mappability filtering (as discussed in #4558) or coverage collection can follow this initial model revision.

In the meantime, we will continue the first round of evaluations using the old ploidy model, spot checking genotype calls as necessary. This will allow us to tune gCNV parameters (which will hopefully be largely unaffected by any changes to the ploidy model).

How does this sound, @ldgauthier @mbabadi @asmirnov239?

samuelklee commented 6 years ago

For the record, here is the (uncommented and raw) code for the prototype model:

mask_sjm, _ = construct_mask(hist_sjm)

d_s = pm.Uniform('d_s', upper=1000., shape=num_samples, testval=40.)

b_j = pm.Bound(pm.Gamma, lower=0.8, upper=1.2)('b_j', alpha=100, beta=100, shape=num_contigs)
b_j_norm = pm.Deterministic('b_j_norm', var=b_j / tt.mean(b_j))

f_js = pm.Bound(pm.Normal, lower=-0.9, upper=0.5)('f_js', sd=0.01, shape=(num_contigs, num_samples))

pi_i_sk = [pm.Dirichlet('pi_%d_sk' % i, a=ploidy_concentration_scale * ploidy_priors_ik[i], 
                       shape=(num_samples, len(ploidy_states_ik[i])), 
                       transform=pm.distributions.transforms.t_stick_breaking(eps))
            if len(contig_set) > 1 else pm.Deterministic('pi_%d_sk' % i, var=tt.ones((num_samples, 1)))
           for i, contig_set in enumerate(contig_sets)]

e_js = pm.Uniform('e_js', lower=0., upper=0.1, shape=(num_contigs, num_samples))

mu_j_sk = [d_s.dimshuffle(0, 'x') * b_j_norm[j] * \
            (tt.maximum(ploidy_jk[j][np.newaxis, :] + f_js[j].dimshuffle(0, 'x') * (ploidy_jk[j][np.newaxis, :] > 0),
                        e_js[j].dimshuffle(0, 'x')))
           for j in range(num_contigs)]
alpha_js = pm.Uniform('alpha_js', upper=10000., shape=(num_contigs, num_samples))
p_j_skm = [tt.exp(pm.NegativeBinomial.dist(mu=mu_j_sk[j].dimshuffle(0, 1, 'x') + eps,
                                           alpha=alpha_js[j].dimshuffle(0, 'x', 'x'))
                .logp(tt.arange(min_count, max_count + 1).dimshuffle('x', 'x', 0)))
           for j in range(num_contigs)]

def _logp_sjm(_hist_sjm):
    num_occurrences_tot_sj = tt.sum(_hist_sjm * mask_sjm, axis=2)
    logp_j_skm = [pm.Poisson.dist(mu=num_occurrences_tot_sj[:, j].dimshuffle(0, 'x', 'x') * p_j_skm[j] + eps) \
                      .logp(_hist_sjm[:, j, :].dimshuffle(0, 'x', 1))
                  for j in range(num_contigs)]
    return tt.sum(
        [pm.math.logsumexp(
            mask_sjm[:, j, np.newaxis, :] * (tt.log(pi_i_sk[i][:, :, np.newaxis] + eps) + logp_j_skm[j]),
            axis=1) 
         for i, contig_set in enumerate(contig_sets) for j in contig_set])

pm.DensityDist(name='hist_sjm', logp=_logp_sjm, observed=hist_sjm)

We can probably tidy this up in several ways (both in terms of code cleanliness and model cleanliness), but it works. Some of the priors can probably be cleaned up---especially note the awkward bounded prior for the additive mosaic factor f_js. Also note that I used list comprehension because of the differing number of ploidy states across contig sets, which results in a ragged tensor; this does seem to add some overhead to theano, so we can probably pad the tensor and vectorize instead.

mbabadi commented 6 years ago

Looks great!

One quick note: I don't get the idea behind Poisson -- shouldn't we simply use negative binomials w/ modeled mu_sj and alpha_sj, evaluated at observed counts (tt.arange(min_count, max_count + 1)), and weighted with the number bins for each count (_hist_sjm)? i.e. if one observes an empirical distribution P_obs(x) rather than x draws, then the appropriate max likelihood objective function is \sum_x P_obs(x) log P_model(x | \theta). Perhaps this is exactly what you've done and I don't get it.

Another quick note: what I had in mind was either modeling mu_sj at quantized ploidy states, or let the ploidy state be unrestricted w/ a penalty via. a Bernoulli process (possibly w/ different per-contig penalties to account for e.g. higher rate of X/Y loss). We have enough samples in the cohort to select the quantized model (and those samples pin down the per-contig biases b_j). The samples that do not conform to quantized ploidy states can then choose whatever (variable) ploidy state they wish by paying a (hefty) price. We would also need to mask contigs that have variable ploidy calls from gCNV.

mbabadi commented 6 years ago

A technical note: as an alternative to equal-sized binning of the coverage distribution, we could use a non-uniform binning strategy that utilizes the available number of bins to represent the variations in the CDF in the most accurate way. Here's what I had written earlier:

def get_counts_summary(counts, lo_cutoff=0.01, hi_cutoff=0.99, num_divisions=50):
    sorted_counts = np.sort(counts)
    num_points = len(counts)
    lo_index = int(np.floor(lo_cutoff * num_points))
    hi_index = min(int(np.ceil(hi_cutoff * num_points)), num_points - 1)
    lo_count = sorted_counts[lo_index]
    hi_count = sorted_counts[hi_index]
    abscissa_indices = np.round(np.linspace(lo_index, hi_index, num=num_divisions + 1)).astype(int)
    abscissa = np.asarray([sorted_counts[idx] for idx in abscissa_indices])
    abscissa_counts = abscissa_indices[1:] - abscissa_indices[0:-1]
    collapsed_abscissa, collapsed_abscissa_counts = collapse_abscissa_triplets(abscissa, abscissa_counts)
        return collapsed_abscissa, collapsed_abscissa_counts

def collapse_abscissa_triplets(abscissa, abscissa_counts):
    if len(abscissa) < 3:
        return abscissa, abscissa_counts
    else:
        pos = 0
        collapsed_abscissa = [abscissa[pos]]
        collapsed_abscissa_counts = []
        while pos < len(abscissa) - 1:
            first = abscissa[pos]
            last = abscissa[pos + 1]
            count = abscissa_counts[pos]
            if first != last:
                collapsed_abscissa += [last]
                collapsed_abscissa_counts += [count]
                pos += 1
            else:
                j = 1
                while pos + j + 1 < len(abscissa):
                    if abscissa[pos + j + 1] == last:
                        count += abscissa_counts[pos + j]
                        j += 1
                    else:
                        break
                collapsed_abscissa += [last]
                collapsed_abscissa_counts += [count]
                pos += j
        return np.asarray(collapsed_abscissa), np.asarray(collapsed_abscissa_counts)

Here, get_counts_summary returns a tuple of (abscissa, occurrences). The summary is interpreted as follows: there are occurrences[m] bins with counts >= abscissa[m] and <= abscissa[m+1], for m = 0, ..., up to num_divisions (but could be smaller if some of the redundant abscissa are collapsed). In the PyMC3 code, we could evaluate the NegativeBinomial the the midpoint of abscissa[m] and abscissa[m+1].

samuelklee commented 6 years ago

The Poisson arises because we want to our model to generate the occurrences, assuming that each count bin provides equal weight---rather than the counts themselves. As usual, modeling each bin as Poisson is close enough to modeling all bins as multinomial for our purposes. If we directly use the NB likelihood and simply weight the count likelihood by occurrences, occurrences in the peak will strongly affect the result, adversely so if the count likelihood is actually misspecified there. As an example, consider trying to fit a Poisson to data that is actually zero-inflated Poisson---fitting the histogram will actually result in a more robust estimate for the mean. Another benefit is that truncated data (as we have here) is straightforwardly handled in an unbiased way. In the special case of complete, trivially-binned data, the full, unbinned likelihood is recovered. I think this sort of histogram fitting is pretty standard in the astro/particle community.

We can certainly change up the model to include strictly quantized + free-floating states as you describe (rather than the "fuzzily quantized" states I use here), but I just wanted to avoid having another level of mixtures/logsumexps for this quick prototype. However, note that modeling mosaicism on the autosomes is desirable, but there we also want the strong diploid prior to nail down the depth and per-contig bias. So we will have to be a little careful about how we introduce free-floating states.

Also, since I was not using gcnvkernel, I had to integrate out all discrete parameters. It may be that we can write down a nice model with discrete parameters if we use your inference framework.

Finally, I did not further bin the counts here (or rather, the bin size is 1), which already yields the maximum information, but I did use a maximum-count cutoff. If we use the same cutoff for all samples, this allows us to simply pass a non-ragged matrix from Java (with dimensions of samples x contigs x maximum count) as a TSV.

However, we may run into trouble if we hit a case sample with very high depth. So some sort of sparse representation of the histogram might indeed be desirable, but I think it should be an exact representation of the full histogram. This would require us to sync up code to emit and consume the representation in both Java and python, so I'd like to avoid it if possible---I think I'd prefer just emitting the ragged matrix, in that case.

ldgauthier commented 6 years ago

Beautiful results @samuelklee -- and some of that data was really squirrely too!

I think continuing with the current round of evaluations makes sense. Those unusual sex genotype samples are pretty rare so spot checking should be fine. I'm excited about having a robust set of learning parameters so we can get this tool to users. We can work on catching the tricky and interesting but rare cases as a second phase.

samuelklee commented 6 years ago

@ldgauthier @mbabadi Revisiting this in detail in preparation for a Methods meeting presentation, I discovered that much of the initial poor model performance and the issue with the strange GQs on chr1 was actually fixed in #4335 via a seemingly innocuous change.

Prior to the change, the ploidy model used lexicographical sorting to determine contig order (see #4374), but the contig order of the per-contig counts matrix was determined by the sequence dictionary. This understandably leads to shenanigans, since the model needs to know things like the number of intervals on each contig. After the change, the sequence-dictionary order is properly used everywhere and most of the bad behavior seems to be resolved (not all of the unusual sex genotypes are resolved, but this may be partly due to mosaicism in those samples) . (@mbabadi, not sure if we actually realized this before? I don't recall nor do I see it discussed elsewhere, but if so, then consider this a note of it!)

So the problems with this model are not as severe as we initially thought, and thus we can keep this issue at relatively low priority. Nevertheless, the improved model should still yield additional benefits, such as better depth estimates and ploidy qualities, as well as more robustness to unusual sex genotypes.

ldgauthier commented 6 years ago

Great news! I'm looking forward to seeing your detailed results tomorrow.

This can be low priority unless it's a dramatic improvement over what's in the current Talkowski Lab pipeline. (I remember when we met with Ryan and Harrison a while ago that their approach wasn't as sophisticated, but it seems to get the job done.) I'll talk to Harrison about it tomorrow.

cccnrc commented 4 years ago

Hi guys I am very interested in this discussion.

It would be great to know:

what were the results?
is this contig-prior-probablity file you created available somehow?
are the data for the samples you tested the pipeline on available?

Thank you so much in advance!

samuelklee commented 3 years ago

Hi @cccnrc, glad you found this discussion interesting and apologies for the very late reply. Unfortunately, there hasn't been much movement on this front, as I think we decided that the current model sufficed.

I think we are moving in a direction so that we can obviate the need for a separate step to fit global (e.g., depth) and per-contig (e.g., contig ploidy) parameters for each sample. Recall that this is only necessary because each gCNV genomic shard needs these quantities to perform inference, but cannot infer them from the data they are responsible for fitting (which typically covers less than a contig).

We are looking to reimplement gCNV in a more modern inference framework that could allow us to do away with such sharding entirely. We could thus fit global/per-contig quantities of interest jointly with the rest of the gCNV model. The timeframe for this work may be relatively long (~year), but I think it'll be worth it.

That said, I think a key takeaway from this work is that genotype priors can be more powerful for breaking degeneracies than contig-ploidy priors, so we will probably try to incorporate that insight in future work.

To answer your questions: 1) There are no additional results much further beyond what is shown above. 2) You can see snippets/comparisons of the genotype and contig-ploidy prior file above. If you're just looking for information about the contig-ploidy prior table used in the current pipeline, see an example at https://gatk.broadinstitute.org/hc/en-us/articles/360051304711-DetermineGermlineContigPloidy and feel free to modify it to be more/less strict as desired. 3) Unfortunately I believe the samples discussed above are not publicly available.

If you are having difficulties with the current model, it would probably be useful to hear about them!