Request for help with statistically significant PBTA scores

letitiaismyname commented 4 years ago

What data file(s) does this issue pertain to?

nature_signatures_results.tsv cosmic_signatures_results.tsv

What release are you using?

V11 Release (#293)

Put a link to the relevant section of the OpenPBTA-manuscript here.

https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/mutational-signatures/results

Put your question or report your issue here.

Hi! My name is Letitia Mueller and I am a PhD student, part of the UCSC genomics institute, doing analysis on your mutational signature score data. Thank you for providing this data! I am interested in figuring out a statistical threshold for the signature scores, in order to just look at the significant ones (column 'num_mutations' in file 'nature_signatures_results.tsv') Many scores come in at 0, some are single digits or low double digits, whereas others are in the hundreds. Any chance you could suggest a method by which to check which are significant? Thank you very much for your help!

jaclyn-taroni commented 4 years ago

Hi @letitiaismyname 👋 welcome – there is some discussion over on #636 about limitations of our current approach to mutational signatures that I wanted to bring to your attention. I'm going to tag @arpoe to weigh in here because of relevant expertise!

letitiaismyname commented 4 years ago

Thank you all! I'm assuming that the zero values are non-significant :) But for the smaller ones, I'm not sure...

letitiaismyname commented 4 years ago

Hi @arpoe! Please let me know, would I be able to pick your brain for a couple of questions about figuring out a statistical threshold? :) Thanks again!

arpoe commented 4 years ago

Hi @letitiaismyname, sure, no problem. I am halfway through with implementing de novo calling. I think just fitting signatures is problematic with these data. Could you please let me know exactly what you mean with statistical threshold? I am sorry that all this may take a bit longer than it usually would, but I am also working on mutagenesis of SARS-CoV2 and with this I am very busy both with programming and with managing a group of volunteer scientists from my couch.

letitiaismyname commented 4 years ago

Wow!! Sounds awesome, thank you for your service!

Sure! Here's a histogram of the data I got from the github, where x is the mutational signature score value ('num_mutations'), and y is the number of samples that fell into a certain category. https://ibb.co/L0X6gk1 Often I see a lot of data in the first bucket, because many numbers are zero, or single digits. I'm wondering how I could come to a cut-off value in the x axis for data that is statistically significant. Thanks again for helping me out :)

arpoe commented 4 years ago

sorry, the link is dead.... could you please repost?

jaclyn-taroni commented 4 years ago

The link works for me when I copy and paste the URL - sticking that image here for convenience:

arpoe commented 4 years ago

could you increase the resolution of the x-axis? I think here it would be difficult to assess the distribution, because most samples fall into 0-40, which are binned together.

letitiaismyname commented 4 years ago

Sure! Let me re-run the analysis and make the buckets a bit smaller, and I will find a similar example (and I will also specify the mutational signature number, and cancer type, which I forgot to do this time :) )

letitiaismyname commented 4 years ago

Ok here's an example of just Signature 3-LGAT tumor type. Bucket size is now 50, so in this case, each incremental bucket is an increase of 3.5 I realize that nearly all data points in the first bucket are 0, so I should probably disregard those, or perhaps take them out altogether... But the values just above (in the 3.5-7 or 10.5-14 bucket, for example), are those signature scores large enough to be considered statistically significant? Any chance you could help me understand where the cut off point would be? Would it be different depending on the range of signature scores for that signature number or tissue type, or is it a hard cut off? https://ibb.co/cN0T82w

arpoe commented 4 years ago

Sorry, again I cannot open your picture. I dont know why.

jaclyn-taroni commented 4 years ago

@letitiaismyname if that picture is a PNG, you can drag and drop it into the text box and GitHub should automagically format the Markdown.

letitiaismyname commented 4 years ago

Ah sorry about that, thanks for the tip! Signature 3-LGAT

arpoe commented 4 years ago

Sorry, but I am affraid there is no easy answer to this. I think one of the important things to keep in mind is that even the assignment of each signature has a certain insecurity due to the fitting. This insecurity is very different per signature and fitting works better or worse dependent on the total number of mutations there are. In pediatric brain cancers, these are frequently low, which makes things more difficult. Regarding signatures, S3 for example is among the most difficult ones, because it resembles random background. Therefore especially for this one the question is very hard to answer. You are trying to find a cut-off in a distribution that is made of distributions, which are all poorly understood. If you would want to have "statistical significance" in it's original sense, you would actually have to build a quite complicated model. I am not 100% sure about the score that this particular package gives. But if you include the score in the the plot above you might find a way to determine a suitable cut-off based on the expectation of which signatures should be there. You could also simulate the combination of several signatures and use them as a toy example to determine with which settings you get the desired result. Here you have the benefit that you exactly know the outcome, so you can optimise the cut-off systematically. Also I would recommend to look into the cut-off rules in the package "sigfit", which I generally find quite suitable. For a general overview on the signatures, I would recommend to look into the original work by Ludmil Alexandrov and Serena Nik-Zainal, as well as into the two very recent papers by Serena Nik-Zainal's lab. These last papers also touch on the overfitting problems and recommend strategies to deal with these issues.
(by the way, I am in Europe, so please apologize typos, when I am answering this late ;-)

letitiaismyname commented 4 years ago

Ah ok! Thank you so much Anna, I appreciate your explanation and help! I will check out sigfit and look into overfitting issues in the Alexandrov paper 👍 💯

jaclyn-taroni commented 4 years ago

Hi @letitiaismyname - wanted to bring @arpoe's pull request for de novo mutational signature calling to your attention: #678

sjspielman commented 2 years ago

Closing because the de novo extraction continued to experience convergence problems and is effectively deprecated, as described here.

AlexsLemonade / OpenPBTA-analysis