Open jaclyn-taroni opened 4 years ago
I was looking more into what functionality maftools
has in their documentation, specifically Detecting cancer driver genes based on positional clustering which states:
oncodrive
is a based on algorithm oncodriveCLUST which was originally implemented in Python. Concept is based on the fact that most of the variants in cancer causing genes are enriched at few specific loci (aka hot-spots). This method takes advantage of such positions to identify cancer genes.
Following that to the OncodriveCLUST website, a couple things caught my attention -
The method does not assume that the baseline mutation probability is homogeneous across all gene positions but it creates a background model using silent mutations. Coding silent mutations are supposed to be under no positive selection and may reflect the baseline clustering of somatic mutations. Given recent evidences of non-random mutation processes along the genome, the assumption of homogenous mutation probabilities is likely an oversimplication introducing bias in the detection of meaningful events.
I came across DriverPower with the PCAWG pan-cancer paper releases: https://www.nature.com/articles/s41467-019-13929-1#code-availability (code), but would entail liftover from hg38 to hg19.
I was reading through this PCAWG paper https://www.nature.com/articles/s41586-020-1965-x.pdf and found their methods to look for driver mutations, might be useful:
We obtained results (P values) from 13 methods of driver discovery, including ActiveDriverWGS54, CompositeDriver, DriverPower55, dndscv46, ExInAtor56, LARVA57, MutSig tools3, NBR10, ncdDetect58, ncDriver59, OncodriveFML60 and regDriver61. We integrated the results of all these methods using a custom framework based on a previously published method62 for combining P values. Results from individual methods that showed large deviations from the expected uniform null distribution of P values were excluded. This approach was evaluated on real and simulated data.
P value combination from multiple driver methods is available from https://github.com/broadinstitute/getzlab-PCAWG-pvalue_combination/
There are a number of ways to define or identify recurrent mutations. The purpose of this issue is to discuss how to define a "recurrent mutation" throughout the project, with some acknowledgment that the answer might be "it depends."
My goal is to document some of the things that I've been thinking about or reading recently (which is almost certainly not a complete look at all available literature) to get the discussion started.
Here are a few examples of analyses that use or may use the concept of a recurrent mutation:
interaction-plots
- where mutations are processed in the following ways by default (seeanalyses/interaction-plots/scripts/02-process_mutations.R
): remove synonymous mutations, remove non-transcribed mutations, remove non-coding mutationsrecurrent-VUS
from the draft pull request #362 - looks like it this includes a specific amino acid change (https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/362/files#diff-2058996fcf1edc695eff61268de983cfR101)oncoprint-landscape
- This code has the ability to accept a list of genes of interest and we'll probably want to generate lists of genes of interest that are comprised of recurrently altered genes to make the OncoPrint plots.All this to say - is a recurrent mutation a specific alteration, e.g., H3F3A K28M, or is it any mutation in a gene given some constraints (e.g., drop synonymous mutations)?
I think the
interaction-plots
andrecurrent-VUS
are good examples of why the answer may depend on the specific analysis, but it would be good to get some discussion around this going.Significantly mutated genes
Beyond recurrent mutations, there is also the question of whether or not a gene is "significantly mutated" and what method could be used to make that determination. Here, I'll link to relevant literature and software/code.
From Ma et al. Nature 2018.:
Where the methods state
The GRIN R package is available here: https://www.stjuderesearch.org/site/depts/biostats/grin
MutSigCV v1 is available as a GenePattern module: https://www.genepattern.org/modules/docs/MutSigCV
Note I happened upon some R code that implements the MutSig1.0 statistic: https://github.com/lixiangchun/lxctk/blob/ea74021f49393c65993b28f6a11a4c5cccbf66ae/R/mutsig.gene.R#L102
And Maftools seems like it has some functionality to use the output of MutSigCV based on my skimming of Mayakonda et al. Genome Research. 2018.
From Gröbner et al. Nature. 2018:
And from the methods:
MuSiC2 is available on GitHub: https://github.com/ding-lab/MuSiC2
Some of the tests proposed by the MuSiC paper (Dees et al. Genome Research. 2012.), namely the Fisher's combined p-value test and likelihood ratio test, are implemented in the same function I linked to above: https://github.com/lixiangchun/lxctk/blob/ea74021f49393c65993b28f6a11a4c5cccbf66ae/R/mutsig.gene.R, where the method labeled
PCT
is from Kan et al. Nature. 2010. per the documentation.Comparison to other literature
The Gröbner et al. Nature. 2018 cohort is enriched for CNS tumors
A comparison to their results seems like a good thing to do as part of this project. Here's a link from that paper: http://www.pedpancan.com/ which mentions PedcBioPortal when you follow it!