NKI-CCB / DISCOVER

DISCOVER co-occurrence and mutual exclusivity analysis for cancer genomics data
Apache License 2.0
27 stars 6 forks source link

Samples with high mutation rates #24

Open kw10 opened 9 months ago

kw10 commented 9 months ago

Hello

I was wondering if DISCOVER is suitable for samples with high mutation rates (e.g. skin tumour samples with UV mutational signatures; ~5-10 mutations/Mb). I have a cohort that seems to have TP53 and RB1 co-mutated (which is common in several cancer types) but the FDR (BH correction) is 0.56. (16 TP53 mutations, 13 co-occur with and RB1 mutation' p-value 0.004). I was wondering if this is due to the fact that most of these samples have a high mutation rate. If so, is there a way I can adjust for this when running DISCOVER?

Thanks! Kim

scanisius commented 8 months ago

DISCOVER can be applied to samples with high mutation rates. But indeed that particular situation may affect the power for detecting co-occurrence. DISCOVER's background model is based on the assumption that a gene is more likely to be mutated if a tumour has many mutations overall. A consequence of this assumption is that in highly mutated tumours you would expect to find many pairs of co-mutated genes purely by chance. So for a co-occurrence to reach significance the evidence against such a chance co-mutation needs to be stronger than in tumours with few mutations. That being said, apparently your data actually provides such evidence for TP53 and RB1. The nominal p-value of 0.004 indicates there is reason to assume that the co-mutation of those genes cannot be explained by chance alone. Since your sample size does not seem to be extremely high, this is not that bad a p-value.

You do seem to suffer from a high multiple testing penalty, which is probably due to the fact that you test many pairs of genes. Are all of those genes actually of interest in your analysis, or could you make a more specific selection? That may reduce the multiple testing burden, and give you lower false discovery rates for the strongest co-occurrences in your data. When you say BH correction, is that the method you use with the pairwise_discover_test function as opposed to the default DBH? In that case, using DBH will also give lower FDRs, since it is a better fit for discrete test statistics like DISCOVER's. It is quite a bit slower though.

One other thing you might try is adjusting for subtypes of the tumours, if there are any. If your set of tumours consists of several different tumour subtypes that have different mutational profiles, it may be helpful to adjust your DISCOVER analysis for those subtypes. You can do so by passing a vector of subtype labels matching the columns in your mutation matrix to the function that constructs the background model: DiscoverMatrix(mut_matrix, strata=subtype_labels) in Python, or discover.matrix(mut_matrix, strata=subtype_labels) in R.

kw10 commented 8 months ago

Hi Sander,

Thanks for your reply and suggestions! I've now tried using DHB and BD (with the pairwise.discover.test) TP53 and RB1 did not come up significant. I then tried limiting my genes to COSMIC Cancer Gene Census genes. I also tried splitting my cohort into somewhat arbitrary groups of high and low UV signature activity. I tried this in various combinations. Unfortunately nothing is coming up significant. I also tried removing the top 5 samples with the highest number of mutations.

Something that did work was providing a subset of genes that are present in more than 5 samples, but only if I included samples from additional samples (the TP53/RB1 co-mutation is in a malignant tumour type; we also have variants from its benign counterpart, TP53 and RB1 come up as significant if I use all of the samples from both cohorts and require that a gene is mutated in more than 5 samples). My guess is that it is significant because I am now including tumours with low mutation rates (the benign tumours) on top of comparing fewer gene pairs (i did not use subtype labels in this case). If I run DISCOVER with the requirement of more than 5 samples on the malignant cohort only, TP53/RB1 is not significant. Does combining these cohorts make sense? (It feels like maybe I am massaging the data too much?)

Thanks again! Kim

scanisius commented 8 months ago

It is is difficult for me to say whether combining with the benign tumour type makes sense. It depends on your research question and also requires more insight into the data than I have. There are some other thoughts I can share with you though. For a gene pair to be found as a significant co-occurrence, it needs to be the case that if gene 1 is mutated, gene 2 tends to be mutated too, but also that if gene 1 is not mutated, gene 2 tends not to be mutated either. So for optimal statistical power, you'd need both samples with mutations in those genes and samples without mutations in those genes. Could it be that in your malignant cohort, most tumours have a mutation in TP53 and RB1? If so, there may not be sufficient sample size for the case without mutations in TP53 and/or RB1. Then adding the benign tumours may provide more samples without those mutations. But again, I can't tell whether that makes sense for your research question.

As a second observation: when you say genes with mutations in more than 5 samples, do you mean that those are the genes that you test for co-occurrence? It is advisable to have a higher threshold here. Because of statistical power it is very unlikely that you'll find significant co-occurrences involving genes with only a handful of mutations, but you are paying a price for them in terms of multiple testing. My suggestions is to require at least a few dozens of mutations.

kw10 commented 8 months ago

Thank you, Sander, that is very helpful. In our case the malignant cohort has 23 samples and TP53 is mutated in 16; RB is mutated in 13 (all are co-mutated with TP53). The benign cohort has 42 samples. So, it seems it is indeed the case you described, where I don't have enough gene without TP53/RB1 mutations.

Regarding the threshold I described, I'm referring to the selection of a subset of samples. For example, in the DISCOVER vingette, all genes are used to calculate the background mutation rate, but then a subset of genes is analysed:

subset <- rowSums(BRCA.mut) > 25

In our case we have only 23+42=65 samples and the benign tumours are rather "quiet", so increasing the requirement from 5 leaves very few gene pairs to test.

Thanks again for your help.

Kim