AlexsLemonade / OpenPBTA-analysis

The analysis repository for the Open Pediatric Brain Tumor Atlas Project
Other
99 stars 66 forks source link

Mutational Signatures #636

Closed arpoe closed 2 years ago

arpoe commented 4 years ago

The current figure for mutational signatures is very nice. I am however worried that there may be problems with fitting. The "flat signatures" tend to have problems in this regard, of which Signature 3 is the most relevant one here. With the current methodology it is likely to be overestimated. I am also concerned that the smoking and UV signatures are picked up in non-negligible levels. May I raise a discussion of alternative strategies? I would suggest to perform de novo calling on the samples at hand, so that only signatures that are present in the dataset to a certain amount are actually used for fitting. Of course this does not solve all problems, but it reduces misassignment of signatures.

jaclyn-taroni commented 4 years ago

Hi @arpoe, thanks for filing this! I am pinging @cansavvy who initially did the analysis.

I also wanted to direct you to the current version of the mutational signature analysis so you can take a closer look: https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/5f2468daf2756a62c7d4615a7660b0a52e5ad135/analyses/mutational-signatures

When you say:

The "flat signatures" tend to have problems in this regard, of which Signature 3 is the most relevant one here.

You are referring to the previously defined/derived signatures (e.g., COSMIC, Nature) that tend to be flat across datasets, is that correct?

May I raise a discussion of alternative strategies?

Yes please!

I would suggest to perform de novo calling on the samples at hand, so that only signatures that are present in the dataset to a certain amount are actually used for fitting.

I had a couple questions regarding this point that I'd love to hear your thoughts on:

arpoe commented 4 years ago

Thanks for the link to the analysis. Here are the responses to your questions:

The "flat signatures" tend to have problems in this regard, of which Signature 3 is the most relevant one here.

You are referring to the previously defined/derived signatures (e.g., COSMIC, Nature) that tend to be flat across datasets, is that correct?

With "flat signatures" it is meant that the nucleotide changes are not distinct, so the signature plot does not have any spikes. For example Signature 3 is flat, the Apobec signatures have distinct nucleotide changes. Flat signatures are (for mathematica reasons) more susceptible for overfitting and to being misassigned to what is actually background or the sum of several signatures that cannot be separated.

A signature that is present in most tissues (like #1, #5, #18, etc.) would be called ubiquitous. Sorry for the jargon.

May I raise a discussion of alternative strategies?

Yes please!

I would suggest to perform de novo calling on the samples at hand, so that only signatures that are present in the dataset to a certain amount are actually used for fitting.

I had a couple questions regarding this point that I'd love to hear your thoughts on:

  • Part of our rationale for using COSMIC signatures is that they seem to be often used in other datasets/analyses and that would perhaps facilitate some comparison to other datasets (e.g., adult datasets, related to #551). I am not an expert in this area, so it's possible that this line of thinking is misguided. If one were to perform de novo calling on pediatric and adult datasets, what would be the approach for comparison?

Yes, this is the benefit of fitting the existing signatures, but it is difficult without causing artefacts for some datasets, especially those with lower mutation rates like pediatric cancers. I understand that it would be great to compare things this way, but I am afraid that this will be difficult here.

If using de-novo calling, the identity of the signatures is typically linked back to the known signatures through determining cosine similarity to the closest known signature. Using this approach, some signatures may be splitted, others may not be separated. Direct comparisons are becoming difficult this way. So this is clearly the downside of this approach.

  • Is there any floor to the number of samples that are required for that? On a related note, if you have a (unbalanced) mix of disease types like we do in OpenPBTA, do you generally analyze disease types separately?

The minimal number of samples depends on the heterogeneity of mutagenic mechanisms, the number of mutations per sample and the number of distinct signatures one wants to obtain. When looking at heterogenous disease types I like separating the samples by tissue and call the signatures de novo, but trying not to get the sample numbers too low. So for example I call the signatures for all blood cancers together, but dont separate out the lineages. For pediatric brain tumours, several hundred samples are necessary for being confident about the signatures. The way how I am trying to do it is pretty much shaped by Serena Nik-Zainal's opinions, so I would recommend this paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7048622/

So in summary, I would suggest to call de novo organ-wise, which in this case means pooling all the pediatric brain tumours, and to use cosine similarity to link the found signatures back to known signatures.

I am also generally considering completely different approaches that ignore tissues, but I dont think this is relevant here as these approaches are in an early stage.

jaclyn-taroni commented 4 years ago

Thanks for the explanation @arpoe, makes sense!

So in summary, I would suggest to call de novo organ-wise, which in this case means pooling all the pediatric brain tumours, and to use cosine similarity to link the found signatures back to known signatures.

Is this something you would be able to contribute? If so, the steps would be 1) filing a pull request (or likely multiple pull requests) adding the new results and 2) where possible using the existing code for the bubble plots you've referenced, that way the figure that will eventually go in the manuscript (per #571) will automatically reflect your new results. Let me know if you have any questions about the process!

arpoe commented 4 years ago

Thanks, yes, I will make the pull request and have a look at this next week. I am currently working on genome evolution of Corona, which has to have priority at the moment.

arpoe commented 4 years ago

I am still planning to work on it this week, either starting today or on the weekend. Have been busy with mutagenesis in Corona...

jaclyn-taroni commented 4 years ago

Yes, sounds good @arpoe! I tagged you over on #646 about testing mutational signatures for significance because I thought you may have some ideas about approaching that problem.

arpoe commented 4 years ago

I am currently running de novo calling on signatures. It takes a while (about a day I hope), because I am a bit limited with the infrastructure. I have implemented that it does not have to be redone, when the output file is provided. I will then assign the resulting signatures to the known signatures using cosine similarity. For this I will take the signature lists that are already implemented in the existing script and possibly others like the CNS set from the tissue specific signature lists. The assigned signatures and the de novo signatures, which cannot be directly assigned to a preexisting signature, will be visualised the same way as it is already been done in the fitting approach (bubble plots, etc). I will also attempt to interpret the de novo signatures.

jaclyn-taroni commented 4 years ago

Thanks for the update @arpoe! Let us know if you need anything!

arpoe commented 4 years ago

I am sorry for the delay. I have now looked more closely at the current analysis and indeed the fitting is causing some issues. The newest Cosmic set (Cosmic 3) was not yet included. However, it is likely to make things even worse, because the algorithm will have more signatures to choose from. To provide an alternative I have implemented de novo calling of signatures, which reveals 10 known signatures. I have connected them back to known signatures and implemented it to output the same kind of plots as before. Please let me know what you think. Of course this is also not a perfect solution as not all signatures that may be present are picked up, so it is a much more conservative approach than fitting. I was also thinking about fitting a published CNS specific signature set as a compromise. However, this has less resolution than the de novo calling approach. Please let me know what you think. I am also happy to polish this analysis further, if you think it is worth proceeding in this direction.

aadamk commented 3 years ago

Just thought to reopen this discussion of optimizing mutsigs analysis with respect to OpenPBTA.

A benchmarking paper last year highlighted more favorable performance when utilizing Bayesian NMF approaches as opposed to the original NMF modeling (though not sure which de novo method was leveraged above). Further, sigfit provides a framework to execute the above, enabling optimal selection of mut sigs based on a bayesian HPD/confidence interval.

Given the Nik-Zainal paper linked earlier in this thread, it seems an optimal approach, if the primary goal is to assess the exposure of known signatures, would be:

  1. Leverage bayesian NMF fitting to known CNS signatures.
  2. Filter signatures based on a 'sufficiently non-zero' cutoff (default = 0.1) using the lower HPD estimate.

Though an alternative approach leveraging a de novo method in sigfit could be (incorporating some suggestions from Sebastian Waszak of NCMM):

  1. Call de novo signatures in a cohort of tumors
  2. With K=2-10, pick number of signatures based on goodness-of-fit (elbow / change point)
  3. Match de novo signatures with known CNS-specific signatures (using a cosine similarity of 0.9), and re-fit each tumour sample with sigfit using CNS-specific signature.
  4. Convert to known reference signatures using the conversion matrix in the Nik-Zainal paper.

Thoughts?

arpoe commented 3 years ago

Hi @aadamk, this is almost precisely what I have done :-) There were some problems to reintegrate the branch into the pipeline, because I had separated out the signature calling itself from the pipeline, because it generates large files and is quite computing intense. This caused some issues that we still have to solve. @jaclyn-taroni, have there been any developments? Is there something I can do?

aadamk commented 3 years ago

I see, thanks for the clarification @arpoe!

jaclyn-taroni commented 3 years ago

@aadamk you may find taking a look at the corresponding PR (#678) helpful! @arpoe I had prioritized subtyping issues that are blocking the next data release for this project, but since someone else has now picked up #509 I am hoping to revisit this in the next few days. (I've started a branch: https://github.com/jaclyn-taroni/OpenPBTA-analysis/tree/jaclyn-taroni/de-novo-mut-sig.)

sjspielman commented 2 years ago

For further discussion of mutational signatures, see these issues: #1173, #1220, #1248

Several PRs were merged to close this out. In the end, there remained convergence problems with de novo extraction, so we instead fit 8 known adult CNS signatures to the data.

678 and #799 (closed without merge)

806

811

974, #1018, #1100 (stacked)

1190

1192

1226 and #1227 (stacked)