Closed arpoe closed 2 years ago
Hi @arpoe, thanks for filing this! I am pinging @cansavvy who initially did the analysis.
I also wanted to direct you to the current version of the mutational signature analysis so you can take a closer look: https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/5f2468daf2756a62c7d4615a7660b0a52e5ad135/analyses/mutational-signatures
When you say:
The "flat signatures" tend to have problems in this regard, of which Signature 3 is the most relevant one here.
You are referring to the previously defined/derived signatures (e.g., COSMIC, Nature) that tend to be flat across datasets, is that correct?
May I raise a discussion of alternative strategies?
Yes please!
I would suggest to perform de novo calling on the samples at hand, so that only signatures that are present in the dataset to a certain amount are actually used for fitting.
I had a couple questions regarding this point that I'd love to hear your thoughts on:
Thanks for the link to the analysis. Here are the responses to your questions:
The "flat signatures" tend to have problems in this regard, of which Signature 3 is the most relevant one here.
You are referring to the previously defined/derived signatures (e.g., COSMIC, Nature) that tend to be flat across datasets, is that correct?
With "flat signatures" it is meant that the nucleotide changes are not distinct, so the signature plot does not have any spikes. For example Signature 3 is flat, the Apobec signatures have distinct nucleotide changes. Flat signatures are (for mathematica reasons) more susceptible for overfitting and to being misassigned to what is actually background or the sum of several signatures that cannot be separated.
A signature that is present in most tissues (like #1, #5, #18, etc.) would be called ubiquitous. Sorry for the jargon.
May I raise a discussion of alternative strategies?
Yes please!
I would suggest to perform de novo calling on the samples at hand, so that only signatures that are present in the dataset to a certain amount are actually used for fitting.
I had a couple questions regarding this point that I'd love to hear your thoughts on:
- Part of our rationale for using COSMIC signatures is that they seem to be often used in other datasets/analyses and that would perhaps facilitate some comparison to other datasets (e.g., adult datasets, related to #551). I am not an expert in this area, so it's possible that this line of thinking is misguided. If one were to perform de novo calling on pediatric and adult datasets, what would be the approach for comparison?
Yes, this is the benefit of fitting the existing signatures, but it is difficult without causing artefacts for some datasets, especially those with lower mutation rates like pediatric cancers. I understand that it would be great to compare things this way, but I am afraid that this will be difficult here.
If using de-novo calling, the identity of the signatures is typically linked back to the known signatures through determining cosine similarity to the closest known signature. Using this approach, some signatures may be splitted, others may not be separated. Direct comparisons are becoming difficult this way. So this is clearly the downside of this approach.
- Is there any floor to the number of samples that are required for that? On a related note, if you have a (unbalanced) mix of disease types like we do in OpenPBTA, do you generally analyze disease types separately?
The minimal number of samples depends on the heterogeneity of mutagenic mechanisms, the number of mutations per sample and the number of distinct signatures one wants to obtain. When looking at heterogenous disease types I like separating the samples by tissue and call the signatures de novo, but trying not to get the sample numbers too low. So for example I call the signatures for all blood cancers together, but dont separate out the lineages. For pediatric brain tumours, several hundred samples are necessary for being confident about the signatures. The way how I am trying to do it is pretty much shaped by Serena Nik-Zainal's opinions, so I would recommend this paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7048622/
So in summary, I would suggest to call de novo organ-wise, which in this case means pooling all the pediatric brain tumours, and to use cosine similarity to link the found signatures back to known signatures.
I am also generally considering completely different approaches that ignore tissues, but I dont think this is relevant here as these approaches are in an early stage.
Thanks for the explanation @arpoe, makes sense!
So in summary, I would suggest to call de novo organ-wise, which in this case means pooling all the pediatric brain tumours, and to use cosine similarity to link the found signatures back to known signatures.
Is this something you would be able to contribute? If so, the steps would be 1) filing a pull request (or likely multiple pull requests) adding the new results and 2) where possible using the existing code for the bubble plots you've referenced, that way the figure that will eventually go in the manuscript (per #571) will automatically reflect your new results. Let me know if you have any questions about the process!
Thanks, yes, I will make the pull request and have a look at this next week. I am currently working on genome evolution of Corona, which has to have priority at the moment.
I am still planning to work on it this week, either starting today or on the weekend. Have been busy with mutagenesis in Corona...
Yes, sounds good @arpoe! I tagged you over on #646 about testing mutational signatures for significance because I thought you may have some ideas about approaching that problem.
I am currently running de novo calling on signatures. It takes a while (about a day I hope), because I am a bit limited with the infrastructure. I have implemented that it does not have to be redone, when the output file is provided. I will then assign the resulting signatures to the known signatures using cosine similarity. For this I will take the signature lists that are already implemented in the existing script and possibly others like the CNS set from the tissue specific signature lists. The assigned signatures and the de novo signatures, which cannot be directly assigned to a preexisting signature, will be visualised the same way as it is already been done in the fitting approach (bubble plots, etc). I will also attempt to interpret the de novo signatures.
Thanks for the update @arpoe! Let us know if you need anything!
I am sorry for the delay. I have now looked more closely at the current analysis and indeed the fitting is causing some issues. The newest Cosmic set (Cosmic 3) was not yet included. However, it is likely to make things even worse, because the algorithm will have more signatures to choose from. To provide an alternative I have implemented de novo calling of signatures, which reveals 10 known signatures. I have connected them back to known signatures and implemented it to output the same kind of plots as before. Please let me know what you think. Of course this is also not a perfect solution as not all signatures that may be present are picked up, so it is a much more conservative approach than fitting. I was also thinking about fitting a published CNS specific signature set as a compromise. However, this has less resolution than the de novo calling approach. Please let me know what you think. I am also happy to polish this analysis further, if you think it is worth proceeding in this direction.
Just thought to reopen this discussion of optimizing mutsigs analysis with respect to OpenPBTA.
A benchmarking paper last year highlighted more favorable performance when utilizing Bayesian NMF approaches as opposed to the original NMF modeling (though not sure which de novo method was leveraged above).
Further, sigfit
provides a framework to execute the above, enabling optimal selection of mut sigs based on a bayesian HPD/confidence interval.
Given the Nik-Zainal paper linked earlier in this thread, it seems an optimal approach, if the primary goal is to assess the exposure of known signatures, would be:
Though an alternative approach leveraging a de novo method in sigfit
could be (incorporating some suggestions from Sebastian Waszak of NCMM):
sigfit
using CNS-specific signature. Thoughts?
Hi @aadamk, this is almost precisely what I have done :-) There were some problems to reintegrate the branch into the pipeline, because I had separated out the signature calling itself from the pipeline, because it generates large files and is quite computing intense. This caused some issues that we still have to solve. @jaclyn-taroni, have there been any developments? Is there something I can do?
I see, thanks for the clarification @arpoe!
@aadamk you may find taking a look at the corresponding PR (#678) helpful! @arpoe I had prioritized subtyping issues that are blocking the next data release for this project, but since someone else has now picked up #509 I am hoping to revisit this in the next few days. (I've started a branch: https://github.com/jaclyn-taroni/OpenPBTA-analysis/tree/jaclyn-taroni/de-novo-mut-sig.)
For further discussion of mutational signatures, see these issues: #1173, #1220, #1248
Several PRs were merged to close this out. In the end, there remained convergence problems with de novo extraction, so we instead fit 8 known adult CNS signatures to the data.
The current figure for mutational signatures is very nice. I am however worried that there may be problems with fitting. The "flat signatures" tend to have problems in this regard, of which Signature 3 is the most relevant one here. With the current methodology it is likely to be overestimated. I am also concerned that the smoking and UV signatures are picked up in non-negligible levels. May I raise a discussion of alternative strategies? I would suggest to perform de novo calling on the samples at hand, so that only signatures that are present in the dataset to a certain amount are actually used for fitting. Of course this does not solve all problems, but it reduces misassignment of signatures.