Question regarding statistical power with small numbers of samples and mutations

AlexandrovLab / SigProfilerExtractor

SigProfilerExtractor allows de novo extraction of mutational signatures from data generated in a matrix format. The tool identifies the number of operative mutational signatures, their activities in each sample, and the probability for each signature to cause a specific mutation type in a cancer sample. The tool makes use of SigProfilerMatrixGenerator and SigProfilerPlotting.

BSD 2-Clause "Simplified" License

153 stars 51 forks source link

Hello,

Thanks for developing an excellent set of tools for this important field.

This is not a code issue, but more of an operational question, so I apologize for opening an issue, but it seems that your team is responsive here and I thought others might find the discussion useful. I'm not sure what other forum would be appropriate, but feel free to suggest an alternative.

I have a few questions relating to small sample sizes. We are interested in understanding the mutational processes in the context of chemical exposures, using data produced from mutation assays (primarily Duplex Sequencing, currently) to gain mechanistic insights.

1) Do you have any recommendations about the minimum number of samples and the minimum number of mutations one should use for de novo signature extraction? We are aware that to extract the COSMIC signatures you are dealing with massive mutational catalogs encompassing thousands of tumors and millions of mutations. Discussing this with Ludmil in the past has led me to understand that it isn't really appropriate to use a tool like SigProfilerExtractor when the number of samples, groups, and mutations are comparatively small. The datasets we are talking about have on average 3 to 6 replicates per group, and somewhere between hundreds or thousands of mutations per sample. So for a case like this, is the recommended approach to only use SigProfilerAssignment? If I use the SigProfilerExtractor function to extract a de novo "signature" for one of these smaller datasets, the COSMIC-based decomposition of these signatures is much cleaner (i.e., fewer signatures reported to reconstruct my data) than if I only use SigProfilerAssignment. This seems to suggest that de novo extraction may be preferable, but I'm not certain it is appropriate. Do you have any comment on this?

2) Comparisons between exposure types: this is another point where I would appreciate advice. Again, there are a few choices on how to approach this, and they won't give the same answer. Would you consider it a valid analytical approach to aggregate mutations from biological replicates (i.e., different individuals or different animals), in order to obtain enough mutations for de novo extraction?

3) Taking into account depth of sequencing for genomic regions For Duplex Sequencing, we have seen people perform a normalization for the depth of sequencing of each of the 32 trinucleotide sequences to account for variability in the sampling of different genomic regions. Using other signature fitting algorithms, we have attempted to do the same thing; but in the case of SigProfilerExtractor, I believe you expect raw count data. Do you have any opinions on whether I should attempt to perform such a normalization prior to using SigProfilerExtractor?

Thanks again! Matt

Dear Matt,

Thank you for the nice words about our tools! Also, my apologies for the late response -- everything has been quite busy in the last few weeks.

I am not sure that I have a really good answer to your questions. Indeed, we also struggle with these types of questions. In most cases, one would expect to have a very small number of replicates (less than 10 per model) for experimental data. I have listed below three possible ways to examine these data.

First, one can examine these samples by creating an average profile after subtracting the background rate from each experiment (i.e., the somatic mutations that have accumulated in an unexposed control). This could be an effective approach if the background rate is stable (e.g., SBS18) and the chemical exposure results in a consistent mutational signature different than the background (e.g., SBS22). Whether one should subsequently match/decompose the chemical exposure signature to COSMIC signatures is a separate question (see below)

A second approach would be to directly assign the know set of COSMIC signatures to each sample with a tool like SigProfilerAssignment. Usually, we would add an extra background signature composed from the average pattern of somatic mutations that have accumulated in an unexposed control. This works very well when the chemical exposure signature is similar to COSMIC signatures; however, I am not sure whether this makes sense most of the time. Presumably, in most cases, many COSMIC signatures are in fact mixtures of multiple chemical exposure signatures, and one should be doing it the other way around. Specifically, if we have the mutational signatures of all known chemical exposures, one would be trying to decompose each COSMIC signature to these chemical exposure signatures.

A third approach is to run a de novo extraction of mutational signatures with a tool like SigProfilerExtractor. In principle, this should work quite well if one has sufficient number of samples (>100 whole-genomes) and/or sufficient number of mutations within a sample (e.g., >10k from UV-light). We have done it successfully for large-scale initiatives (e.g., PMID: 32989322) and for smaller analyses with potent mutagens. Unfortunately, this will not work for small datasets or for weak mutagens. Note that internally SigProfilerExtractor now utilizes SigProfilerAssignment to decompose all de novo signatures to known COSMIC signatures.

Regarding the numbers of samples, I would recommend using triplicates and, if these are discordant (e.g., you have samples with more than 2x difference in mutations) to move to 5 or more replicates. In all cases, I would suggest doing these replicates within the same system (e.g., same cell line) as we have data showing that the same mutagen can induce completely different signatures in different cell lines (e.g., because of failure of some repair processes in a cell line) and also different in mouse tumors (e.g., presumably due to selection).

Regarding duplex sequencing, this would very much depend on the duplex sequencing protocol. I would recommend using an approach that covers a large portion of the genome (e.g., NanoSeq) as the need for correction becomes negligible and you can probably ignore it. If you are using a panel duplex approach, then you can perform all analyses listed above with the exception of the comparison with COMSIC signatures. For this comparison, I would suggest correcting the COMSIC signatures based on the trinucleotide frequency of the panel and then performing any comparisons.

I hope this response makes sense. I will now close this ticket now and happy to discuss more over a Zoom call if needed.

Best wishes,

Ludmil

AlexandrovLab / SigProfilerExtractor

Question regarding statistical power with small numbers of samples and mutations #180