dami82 / mutSignatures

mutSignatures R package - updated (dev) version - 2.1.4
http://www.mutsignatures.org
14 stars 0 forks source link

How to set num of signatures? #15

Open eibol1 opened 2 years ago

eibol1 commented 2 years ago

Hi there,

I do not know if I understand but you can create as many signatures as you set in the parameter num.sign. I am following the tutorial described in https://cran.r-project.org/web/packages/mutSignatures/vignettes/get_sarted_with_mutSignatures.html where there are 50 samples and num.sig is 4. I do not understand that. I would have thought that you need as many signatures as samples you have, but some samples may share similar signatures.

How can I solve this? There is a maximum number of signatures? How can I know the optimal number of this parameter?

Thanks

dami82 commented 2 years ago

Unfortunately, selecting a reasonable number of signatures to extract is no trivial task. NMF (NMF wiki page) works differently than PCA, so I do NOT recommend extracting as many signatures as number of samples (a W matrix [i.e. these signatures] is NOT a list of eigenvectors). Typically, a mutational signature analysis starts with a set of cancer samples from one (or few) indications and the assumption is that a small number k of mutational processes may be active in those samples and contribute to the accumulation of DNA mutations. Now, back to your question... What is a reasonable k to select for your de novo extraction analysis? Unfortunately there is no easy answer, nor a definitive answer to this question. Some algorithms in this space can select the best k automatically, but a few independent papers suggested that often the automatic selection does not pick the expected number. My suggestion is to proceed by trial and error. You should try a few different reasonable k values and see which one works best. You should select a k-value range based on: 1) the size of the dataset, 2) the extent of inter-sample heterogeneity, 3) preliminary information about similar cancer types (see https://doi.org/10.1038/nature12477), and 4) critical evaluation of results from each iteration, including: i) how similar / different are the profiles of de novo-extracted mutational signatures and ii) stability results (silhouette plots) from each run. The mutSignatures library provides a function to perform a preliminary assessment of the deconvolution error (% of max error on the training set) with respect to the number of signatures being extracted. This function may provide some insights into selecting an optimal range of k values to extract.

For example, look at the code below:

# prelimProcessAssess (this may take a while)
test <- prelimProcessAssess(input = my_mut_counts, maxProcess = 10, plot = FALSE)

# Build a plot
ggplot(test, aes(x=numProcess, y=percentErr)) +
  geom_point() + geom_line() +
  scale_x_continuous(breaks = seq(1, 12, by = 1)) +
  theme_bw() + xlab('number of signatures') + ylab('% Error')

image

In the previous case, I ended selecting the range (3...7). Turns out that k=5 was the best number. My sIlhouette plots (with k=5) looked rather stable (see below) and the profiles of the extracted signatures were NOT overlapping.

image

I hope this helps / makes sense.

eibol1 commented 2 years ago

Thanks for the reply and the explanation.

I tried this on my data. Firstly, with mut.sign <- 4, I got this Silhouette Plot:

image

(I think is not a well-stable Silhouette Plot, but correct me)

Then, I run the PrelimProcessAsses as you said:

image

I think that mut.sign >= 10 is the best option to get the less error, isn't it? The Silhouette Plot is this one with mut.sign = 10:

image

I think is more stable than before, but not sure. What do you think? Should I increase the signatures number and check the Silhouette Plots? Try and error the best to find my mut.sign (?)

dami82 commented 2 years ago

Here are my comments:

  1. The first silhouette plot with k=4 is actually not that great (I agree with your assessment, no profiles get close to the value of +1), so the configuration looks sub-optimal.
  2. Note: you can run PrelimProcessAsses() setting maxProcess = 14 (or any value bigger than 10) if you think that the curve did not reach a minimum (by k=10); of course the profile of that curve may never get to a minimum (for any value of k in the selected range), depending on several factors, such as the number of samples and complexity of the dataset.
  3. I think that the results you got from PrelimProcessAsses() suggest that you may want to test k values bigger than k=4 (as you did). Selecting k=10 was reasonable, but I would recommend testing a range of k values, for example between k=7 and k=11. This is exactly what I meant by "trial and error". You may want to test multiple k's: k=7, k=8, k=9 ... and so on, review the results, and pick the value of k that looks the best. Just keep the number of bootrapping iterations small (50-100) at this stage.
  4. The silhouette plot you obtained using k=10 has a few clusters (i.e., signatures) looking good (high silhouette values); however, three clusters look pretty weak. This suggests that you may want to try a smaller k. I think you may want to try k=7 or k=8 next.

How to review results from each cycle with respect to k.

  1. At the end of each cycle (k=7, k=8 ...), you should inspect the silhouette plots (as you did)
  2. you should also look the profiles of the mutational signatures you just extracted. Overlapping profiles raise red flags.

I used the following code at the end of each cycle (using a toy dataset) and obtained the results shown below. The number of samples was n=30, the optimal k was k=5.

# increase the number of signatures, k=5, k=6, ... at each cycle
tmp_analysis_ki <- decipherMutationalProcesses(input = my_counts,
                                               params = my_params_ki)

my_sigs_ki <- tmp_analysis_ki$Results$signatures
my_silhouette_ki <- tmp_analysis_ki$Supplementary$silhouette
match_my_sigs_ki <- matchSignatures(my_sigs_ki, my_sigs_ki)
match_my_sigs_ki$plot

ggplot(my_silhouette_ki, aes(y = silhouette_value, x = id, fill = signature)) +
  geom_bar(stat = 'identity', width = 1) + ylim(c(-1, 1)) + 
  coord_flip() + theme_bw() 

msigPlot(my_sigs_ki, signature = 1, ylim = c(0, 0.15))
msigPlot(my_sigs_ki, signature = 2, ylim = c(0, 0.15))
msigPlot(my_sigs_ki, signature = 3, ylim = c(0, 0.15))
# ...

image


The first silhouette plot (k=8) shows a few weak clusters (top panel). Signature matching (top-right heatmap) revealed high similarity between certain pairs of signatures, such as 4 and 7, or 3 and 6. A manual inspection confirmed the similarity of those profiles (right panel). Takeaway: reduce k. We tested a smaller k value (k=6; left, mid panel) and results suggested that one signature (6) was weak and possibly similar to signature 4. Takeaway: reduce k. When we tested a very small k value (left, bottom panel) we noticed no signature overlap, but only one of the signatures had a very good silhouette. Takeaway: increase k.

Final Note Unfortunately, it takes some work to find the best k value (as I anticipated), and you may be left with a judgement call in the end (e.g. is k=8 better than k=9 for my dataset?)... The tests I recommended in my reply can help, but often there is also an arbitrary component to the final decision. If possible, try to complement the results from these tests with the understanding of the underlying biology.

I hope this helps.

eibol1 commented 2 years ago

Wow thank you for the masterclass @dami82 it is very useful for me.

I have now understood how it works so I will try now and find the best for my analysis.

Again, thank you for the quick answer and time dedicated. I will report if anything goes wrong or good.