jaclyn-taroni commented 4 years ago

I am filing this issue to replace #636 with more detailed steps required for completing the de novo mutational signatures analysis and to surface some discussion on #799 (or other, already merged PRs). Note that this issue is too expansive in scope to be completed with a single pull request. It likely should be broken up into smaller issues with more detail, but in an effort to reduce the cognitive burden associated with tracking it exclusively in my head and to get some feedback, I am filing this one large issue to start.

The current state of mutational signatures

Right now, the de novo part of the mutational-signatures module extracts signatures from the WGS samples only for a range of number of signatures (k), using a low number of iterations. There is a script analyses/mutational-signatures/scripts/de_novo_signature_extraction.R that has command line options for the value(s) to use for k during extraction and the number of iterations.

What needs to happen next

Of the things I am currently aware of 😅

Select the number of signatures to extract (k). This will be a notebook where we include goodness-of-fit plots from sigfit and examine the mutational spectra. We will use this thread of discussion as our guide: https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/799/files#r502923369
Assess the stability/reproducibility of the extracted signatures after selecting k, by running with a higher number of iterations, using multiple seeds, and measuring average silhouette width https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/799/files#r503268952
- Potential gotcha: How many iterations do we need for sigfit::extract_signatures()? Part of how we assess is with reproducibility.
- Maybe: Add an option for when you are only specifying a single number for nsignatures to analyses/mutational-signatures/scripts/de_novo_signature_extraction.R per https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/806#discussion_r504933953
Once we have a set of signatures to move forward with, we must do the following:
- Match extracted signatures to known/reference signatures (03-match_de_novo in #799). Potential gotcha: Comparing to known/reference signature can aid in selecting k https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/799/files#r503205836.
- Calculate de novo mutational signatures per samples. This step would also include the bubble plots (04-de_novo_per_sample in #799).
- Re-fit signatures to WXS samples (see #811 for additional context)

sjspielman commented 3 years ago

References useful for this analysis:

Common pitfalls in (de novo and otherwise) mutational signature identification https://www.nature.com/articles/s41467-019-11037-8
- Suggests approaches to ameliorate FPR and inter-sample bleeding, but much of the paper emphasizes a pipeline of a) de novo, b) compare to COSMIC, which isn't useful for us here.
- Suggests using all Alexandrov 2013 (NMF), MutationalPatterns R package method, deconstructSigs methods jointly and comparing results
Currently used: deconstructSigs paper https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0893-4
- "determine the contributions of each mutational process from a set of published signatures in a single tumor sample."
- Emphasizes that Alexandrov's method (NMF) needs (shown by simulations) 200 WGS samples to call 20 patterns. We do not have 200 WGS per tumor type (not even considering coverage), so Alexandrov method may not be useful for us at any stage
- "With exome sequencing covering only ~1 % of the human genome, resulting in fewer mutations identified, they estimated that it would take thousands of samples to extract the majority of mutational processes that have been functional during tumor life histories."
MutationalPatterns paper https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-018-0539-0
- Offers several approaches for de novo...
- "Mutational- Patterns uses the implementation of R packageNMF, which can also be used to estimate the optimal number of different mutational signatures that can be extracted from the data. "
- "Alternatively, novel probabilistic methods for identifying mutational signatures [22, 23] can be used to extract signatures de novo, and subsequent analyses can be carried out with MutationalPatterns."
  - ref 22: signeR: an empirical Bayesian approach to mutational signature discovery.
  - ref 23: A Simple Model-Based Approach to Inferring and Visualizing Cancer Mutation Signatures, R package pmsignature
signeR seems pretty darn promising
- "Current methods based on optimization techniques are strongly sensitive to initial conditions due to high dimensionality and nonconvexity of the NMF paradigm. " Suggests this approach may be helpful for potential convergence problems with NMF, which I think we can expect due to (relatively) small sample sizes, eg we don't have 5k samples per type.
- "In this article we present a new method to identify mutational signatures based upon an empirical Bayesian treatment to the Poissonian NMF model. The empirical approach requires minimal intervention on the part of the user and is specially suited for the applied practitioner. A key aspect of our analysis is that it addresses the model selection problem directly, i.e. the estimation of the underlying number of signatures, without using further approximations or ad hoc heuristics previously considered."
pmsignature
- "We demonstrate the benefits of this simplification in data analyses. These benefits include more stable estimation of signatures from smaller samples, refinement of the detail and resolution of many mutation signatures, and possibly identification of novel signatures." Smaller samples!!!!
- Figure 1 in this paper is an aspect ratio abomination; can't not comment on it.

jaclyn-taroni commented 3 years ago

We used deconstructSigs in our initial approach, which is now analyses/mutational-signatures/01-known_signatures.Rmd (and is also what we described in the README because I neglected to update it 😬 ). The published signatures we used are the COSMIC signatures and Alexandrov et al, 2013 signatures.

sjspielman commented 3 years ago

@jaclyn-taroni It seems to me after some lit review that we really need to compare this approach with some of the newer probabilistic methods that are more suitable for small sample sizes, aka anything less than 1000 specimens. This approach is the gold standard, but it may not be right for us, so I will explore!

jaclyn-taroni commented 3 years ago

It seems to me after some lit review that we really need to compare this approach with some of the newer probabilistic methods that are more suitable for small sample sizes, aka anything less than 1000 specimens.

When you say this approach, which approach are you referencing? deconstructSigs or the method we use for de novo mutational signature currently (sigfit)?

Here's where sigfit gets applied currently: analyses/mutational-signatures/scripts/de_novo_signature_extraction.R. sigfit reference, which I neglected to link to in this issue: Gori and Baez-Ortega. bioRxiv.

sjspielman commented 3 years ago

"This approach" = a probabilistic approach. I haven't yet dug through the code associated with this analysis, so it sounds like sigfit is one of those! Sounds like we're already doing it; very nice to see when my thoughts match up with what we're doing. May end up comparing to signeR once I start really digging into analyzing.

sjspielman commented 3 years ago

Update on analysis, very much before filing any PRs:

Memory requirements for sigfit
- Goodness-of-fit analyses to determine the optimal k require significant memory, depending on how many k's you test and how many iterations are performed. Allocating 50 GB RAM to evaluate goodness of fit for k=3:15 with 2500 iterations each, the script crashes every time, and I don't have much more RAM over here to spare. The limiting factor here seems to be the number of k's one tests at a time. So, we are somewhat limited to perform benchmarking for k; 30GB RAM for testing k \in 3:8 with 2500 iterations is about right, or 20 GB RAM for 1000 iterations.
But, we have no k likely due to MCMC convergence problems
- I've examined 12 different seeds and two different sigfit models (poisson, comparable to EMu, and multinomial, default analogous to NMF). Results are all over the place - goodness-of-fit elbow plot (and sigfit output message) yields one of k=3, 4, or 5. I haven't dug into this further to see if they are the same signatures being identified every time.
- Convergence issues from the underlying MCMC sampler stan are relayed by consistent messages such as:
```
There were 43 divergent transitions after warmup. Increasing adapt_delta above 0.8 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
8: Examine the pairs() plot to diagnose sampling problems
```
  Convergence issues are observed even for as many as 5000 iterations with 1000 warmup (burnin). Do we have any stan experts on board here? Dealing with this is a two-fold issue; 1) Specifying stan parameters to sigfit (this I got), and 2) Extracting convergence plots and other information we can used to assess convergence directly from sigfit objects (this I do not got). That said, I am concerned that fighting with the sampler will yield diminishing returns since I suspect our mutation data is just too sparse and/or by grouping all molecular subtypes together we are "blending" too many faint distinct signals to identify any overarching shared patterns.
- Have also examined different stan search algorithms besides MCMC, but as it turns out, these are not well-suited for convenient goodness-of-fit analyses (maximum a posteriori approach), or are experimental (variational bayes) and just error out because, well, it's still experimental.

So, how to proceed?

One idea is to limit analyses to a given molecular subtype that has "majority" samples, but that subtype appears to be NA, which suggests this is not the right approach..?

metadata %>% count(molecular_subtype) %>% arrange(-n)
# A tibble: 16 x 2
molecular_subtype           n
<chr>                   <int>
1 NA                       2340
2 DMG, H3 K28               141
3 Group4                    118
4 HGG, H3 wildtype           58
5 SHH                        54
6 CNS Embryonal, NOS         33
7 Group3                     26
8 WNT                        21
9 HGG, H3 G35                 9
10 BRAF V600E                  8
11 ETMR, C19MC-altered         8
12 CNS NB-FOXR2                5
13 DMG, H3 K28, BRAF V600E     5
14 HGG, IDH                    5
15 CNS HGNET-MN1               1
16 ETMR, NOS                   1

Another idea is just push forward with k \in 3-5 and do a full analysis with each k. Maybe the results will be nested 🤞
Examine other approaches like that from the signeR package I had previously found. I don't see a compelling reason not to do this? signeR came out in 2016, and sigfit is currently on bioRxiv but does not cite signeR at all, so it's not clear whether these methods have previously been compared.

AlexsLemonade / OpenPBTA-analysis

De novo mutational signature extraction: next steps #818

The current state of mutational signatures

What needs to happen next

974

1018

1100

1220

1227

1248