Closed jaclyn-taroni closed 2 years ago
References useful for this analysis:
Common pitfalls in (de novo and otherwise) mutational signature identification https://www.nature.com/articles/s41467-019-11037-8
MutationalPatterns
R package method, deconstructSigs
methods jointly and comparing results Currently used: deconstructSigs
paper https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0893-4
MutationalPatterns
paper https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-018-0539-0
NMF
, which can also be used to estimate the optimal number of different mutational signatures that can be extracted from the data. "MutationalPatterns
."
signeR
: an empirical Bayesian approach to mutational signature discovery.pmsignature
signeR
seems pretty darn promising
pmsignature
We used deconstructSigs
in our initial approach, which is now analyses/mutational-signatures/01-known_signatures.Rmd
(and is also what we described in the README because I neglected to update it 😬 ). The published signatures we used are the COSMIC signatures and Alexandrov et al, 2013 signatures.
@jaclyn-taroni It seems to me after some lit review that we really need to compare this approach with some of the newer probabilistic methods that are more suitable for small sample sizes, aka anything less than 1000 specimens. This approach is the gold standard, but it may not be right for us, so I will explore!
It seems to me after some lit review that we really need to compare this approach with some of the newer probabilistic methods that are more suitable for small sample sizes, aka anything less than 1000 specimens.
When you say this approach, which approach are you referencing? deconstructSigs
or the method we use for de novo mutational signature currently (sigfit
)?
Here's where sigfit
gets applied currently: analyses/mutational-signatures/scripts/de_novo_signature_extraction.R
. sigfit reference, which I neglected to link to in this issue: Gori and Baez-Ortega. bioRxiv.
"This approach" = a probabilistic approach. I haven't yet dug through the code associated with this analysis, so it sounds like sigfit
is one of those! Sounds like we're already doing it; very nice to see when my thoughts match up with what we're doing. May end up comparing to signeR
once I start really digging into analyzing.
Update on analysis, very much before filing any PRs:
sigfit
k
require significant memory, depending on how many k's you test and how many iterations are performed. Allocating 50 GB RAM to evaluate goodness of fit for k=3:15
with 2500 iterations each, the script crashes every time, and I don't have much more RAM over here to spare. The limiting factor here seems to be the number of k
's one tests at a time. So, we are somewhat limited to perform benchmarking for k
; 30GB RAM for testing k \in 3:8 with 2500 iterations is about right, or 20 GB RAM for 1000 iterations.k
likely due to MCMC convergence problems
sigfit
models (poisson
, comparable to EMu
, and multinomial
, default analogous to NMF). Results are all over the place - goodness-of-fit elbow plot (and sigfit
output message) yields one of k=3, 4, or 5. I haven't dug into this further to see if they are the same signatures being identified every time.stan
are relayed by consistent messages such as:
There were 43 divergent transitions after warmup. Increasing adapt_delta above 0.8 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
8: Examine the pairs() plot to diagnose sampling problems
Convergence issues are observed even for as many as 5000 iterations with 1000 warmup (burnin). Do we have any stan
experts on board here? Dealing with this is a two-fold issue; 1) Specifying stan
parameters to sigfit
(this I got), and 2) Extracting convergence plots and other information we can used to assess convergence directly from sigfit
objects (this I do not got). That said, I am concerned that fighting with the sampler will yield diminishing returns since I suspect our mutation data is just too sparse and/or by grouping all molecular subtypes together we are "blending" too many faint distinct signals to identify any overarching shared patterns.
stan
search algorithms besides MCMC, but as it turns out, these are not well-suited for convenient goodness-of-fit analyses (maximum a posteriori approach), or are experimental (variational bayes) and just error out because, well, it's still experimental.So, how to proceed?
NA
, which suggests this is not the right approach..?
metadata %>% count(molecular_subtype) %>% arrange(-n)
# A tibble: 16 x 2
molecular_subtype n
<chr> <int>
1 NA 2340
2 DMG, H3 K28 141
3 Group4 118
4 HGG, H3 wildtype 58
5 SHH 54
6 CNS Embryonal, NOS 33
7 Group3 26
8 WNT 21
9 HGG, H3 G35 9
10 BRAF V600E 8
11 ETMR, C19MC-altered 8
12 CNS NB-FOXR2 5
13 DMG, H3 K28, BRAF V600E 5
14 HGG, IDH 5
15 CNS HGNET-MN1 1
16 ETMR, NOS 1
signeR
package I had previously found. I don't see a compelling reason not to do this? signeR
came out in 2016, and sigfit
is currently on bioRxiv but does not cite signeR
at all, so it's not clear whether these methods have previously been compared.Closed with PRs:
I am filing this issue to replace #636 with more detailed steps required for completing the de novo mutational signatures analysis and to surface some discussion on #799 (or other, already merged PRs). Note that this issue is too expansive in scope to be completed with a single pull request. It likely should be broken up into smaller issues with more detail, but in an effort to reduce the cognitive burden associated with tracking it exclusively in my head and to get some feedback, I am filing this one large issue to start.
The current state of mutational signatures
Right now, the de novo part of the
mutational-signatures
module extracts signatures from the WGS samples only for a range of number of signatures (k), using a low number of iterations. There is a scriptanalyses/mutational-signatures/scripts/de_novo_signature_extraction.R
that has command line options for the value(s) to use for k during extraction and the number of iterations.What needs to happen next
Of the things I am currently aware of 😅
Select the number of signatures to extract (k). This will be a notebook where we include goodness-of-fit plots from
sigfit
and examine the mutational spectra. We will use this thread of discussion as our guide: https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/799/files#r502923369Assess the stability/reproducibility of the extracted signatures after selecting k, by running with a higher number of iterations, using multiple seeds, and measuring average silhouette width https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/799/files#r503268952
sigfit::extract_signatures()
? Part of how we assess is with reproducibility.nsignatures
toanalyses/mutational-signatures/scripts/de_novo_signature_extraction.R
per https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/806#discussion_r504933953Once we have a set of signatures to move forward with, we must do the following:
03-match_de_novo
in #799). Potential gotcha: Comparing to known/reference signature can aid in selecting k https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/799/files#r503205836.04-de_novo_per_sample
in #799).