NKI-CCB / DISCOVER

DISCOVER co-occurrence and mutual exclusivity analysis for cancer genomics data
Apache License 2.0
27 stars 6 forks source link

Multi-omics analysis and FDR filtering #20

Closed Tesson98 closed 1 year ago

Tesson98 commented 1 year ago

Dear @scanisius    Thanks for releasing this powerful tool.     Recent I have read a paper using "DISCOVER" in the exclusivity analysis between virus infection and gene mutations (Zapatka M, Borozan I, Brewer DS, et al. The landscape of viral associations in human cancers. Nat Genet. 2020;52(3):320-330. doi:10.1038/s41588-019-0558-9), which raising my interest in the relationship between gene fusions and gene mutations.     But there are still some questions which I hope to get some suggestions from you.     First, I know "DISCOVER" was developed for the whole-genome wide mutation analysis. So in your view, is it suitable for the analysis in the whole-genome wide "gene-fusion" analysis? (I have more than 5000 gene-fusions in one single tumor type)     Second, if "DISCOVER" is suitable for "gene-fusion" analysis, can I implement this method in the combination data of gene fusions and gene mutations (I only have data of 55 gene mutations, but it's noted that "Even if only a subset of genes will subsequently be used in the analysis, a whole-genome view of the mutations is required for this first step" in the method's R introduction ) ? Can I use this method in the exclusivity analysis among gene-fusions, among gene mutations, and between gene fusions and mutations? Should I create a matrix with rows correspond to fusions and mutations, and columns to samples (example as below)?     Finally, it's noted that "To get the pairs of genes which are significantly mutually exclusive, we can use the as.data.frame method. This method, too, takes an optional FDR threshold as argument. " in the method's R introduction, but when I use as.data.frame(results_myfusionandgene_ex), I can't find the argument for FDR threshold chosing, could you help me with this problem?     Thanks again and really look forward to your reply.                                                                                                             Tesson,Dec 7th, 2022

  | fusion1 | fusion2 | fusion3 | … | fusion5000 | mutation1 | mutation2 | … | mutation55 -- | -- | -- | -- | -- | -- | -- | -- | -- | -- sample1 | 1 | 1 | 0 | … |   | 1 | 0 | 1 | 0 sample2 | 1 | 1 | 0 | … |   | 1 | 0 | 0 | 1 sample3 | 1 | 0 | 1 | … |   | 1 | 0 | 0 | 0 sample4 | 1 | 0 | 1 | … |   | 0 | 0 | 1 | 1 sample5 | 0 | 0 | 0 | … | 1 | 0 | 0 | 1 | 0 … | … | … | … | … | … | … | … | … | … sample200 | 0 | 0 | 1 | … | 1 | 0 | 0 | 1 | 0
scanisius commented 1 year ago

First, I know "DISCOVER" was developed for the whole-genome wide mutation analysis. So in your view, is it suitable for the analysis in the whole-genome wide "gene-fusion" analysis? (I have more than 5000 gene-fusions in one single tumor type)

Let me start with the disclaimer that I have not used DISCOVER for gene-fusion data, so you will have to interpret my suggestions in the context of your knowledge of the data.

The DISCOVER model is based on the observation that there is large variation in the total number of mutations across tumours. It is then assumed that genes are a-priori more likely mutated in tumours with large overall numbers of mutations than in tumours with lower numbers of mutations. For mutation data this is a reasonable assumption, since most mutations tend to be passenger mutations. For those passenger mutations there is little positive or negative selection, so total mutation load is a good predictor of their occurrence.

To decide whether you can apply DISCOVER to your gene-fusion data, you have to answer two questions. First, is there large variation in total 'gene-fusion load' across tumours? And second, is it reasonable to assume that a large part of the gene fusions in a tumour are in fact passenger gene fusions, for which the occurrence is strongly correlated with total gene-fusion load?


Second, if "DISCOVER" is suitable for "gene-fusion" analysis, can I implement this method in the combination data of gene fusions and gene mutations.

If you have concluded that DISCOVER is a good match for your gene-fusion data, then combining gene fusions and gene mutations is certainly something you can do. We did something similar in the DISCOVER publication, where we combined gene mutations with copy number changes. My recommendation is that you estimate two background models, one for the mutations and another for the fusions, and then combine those. In R, that might look like the following.

mutation_model <- discover.matrix(mutation_matrix)
fusion_model <- discover.matrix(fusion_matrix)
combined_model <- rbind(mutation_model, fusion_model)

The mutation matrices must have mutations or fusions in the rows and tumours in the columns (so the opposite of your example). For the above rbind to work, you will need to make sure that the tumours in the columns of both matrices are matched.

If you only want to test mutation-fusion pairs, but no mutation-mutation or fusion-fusion pairs, you can instruct DISCOVER to do so using the following code.

alteration_type <- rep(c("mutation", "fusion"), c(nrow(mutation_matrix), nrow(fusion_matrix)))
mutex_result <- pairwise.discover.test(combined_model, g = alteration_type)

Here, the alteration_type vector is basically a label for each row in your combined matrix. With this method of calling the pairwise.discover.test function, only pairs of rows with different labels are tested.


(I only have data of 55 gene mutations, but it's noted that "Even if only a subset of genes will subsequently be used in the analysis, a whole-genome view of the mutations is required for this first step" in the method's R introduction ) ?

Whole-genome or whole-exome data is the ideal situation, as it leads to the best estimates of background mutation rates. However, we have observed that smaller gene panels of a few hundred genes still lead to good background model fits. I don't think we have ever tried this with only 55 genes, so the best I can advise you is to critically evaluate your results.


Finally, it's noted that "To get the pairs of genes which are significantly mutually exclusive, we can use the as.data.frame method. This method, too, takes an optional FDR threshold as argument. " in the method's R introduction, but when I use as.data.frame(results_myfusionandgene_ex), I can't find the argument for FDR threshold chosing, could you help me with this problem?

The as.data.frame function takes an optional argument q.threshold, so you could use as.data.frame(results_myfusionandgene_ex, q.threshold = 0.05) for a 5% false discovery rate threshold.

Tesson98 commented 1 year ago

@scanisius
Really thanks to your kind and useful suggestions! To my knowledge, I can answer that the total “gene-fusion load" is also dramtically variant in my tumor samples. Some patients harbor none positive fusion but others harbor more than ~100 fusions. And also most of fusions are "passengers" according to previous literature review (Mertens F, Johansson B, Fioretos T, Mitelman F. The emerging complexity of gene fusions in cancer. Nat Rev Cancer. 2015;15(6):371-381. doi:10.1038/nrc3947). Your code suggestions are also helpful and I will try these in next steps (maybe I can expand my list of gene mutations).

Best wishes, Tesson Dec 14th, 2022

scanisius commented 1 year ago

This issue has been inactive for a while now, so I am closing it. Please open a new issue if you are still experiencing problems with DISCOVER.