[Discussion] How should we define a recurrent mutation and how do our mutation results compare to the literature?

jaclyn-taroni commented 4 years ago

There are a number of ways to define or identify recurrent mutations. The purpose of this issue is to discuss how to define a "recurrent mutation" throughout the project, with some acknowledgment that the answer might be "it depends."

My goal is to document some of the things that I've been thinking about or reading recently (which is almost certainly not a complete look at all available literature) to get the discussion started.

Here are a few examples of analyses that use or may use the concept of a recurrent mutation:

interaction-plots - where mutations are processed in the following ways by default (see analyses/interaction-plots/scripts/02-process_mutations.R): remove synonymous mutations, remove non-transcribed mutations, remove non-coding mutations
recurrent-VUS from the draft pull request #362 - looks like it this includes a specific amino acid change (https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/362/files#diff-2058996fcf1edc695eff61268de983cfR101)
oncoprint-landscape - This code has the ability to accept a list of genes of interest and we'll probably want to generate lists of genes of interest that are comprised of recurrently altered genes to make the OncoPrint plots.

All this to say - is a recurrent mutation a specific alteration, e.g., H3F3A K28M, or is it any mutation in a gene given some constraints (e.g., drop synonymous mutations)?

I think the interaction-plots and recurrent-VUS are good examples of why the answer may depend on the specific analysis, but it would be good to get some discussion around this going.

Significantly mutated genes

Beyond recurrent mutations, there is also the question of whether or not a gene is "significantly mutated" and what method could be used to make that determination. Here, I'll link to relevant literature and software/code.

From Ma et al. Nature 2018.:

By analysing the enrichment [12, 13] of somatic alterations within each histotype or the pan-cancer cohort (see Methods), we identified 142 significantly mutated driver genes (Fig. 2a, Supplementary Table 2, Extended Data Fig. 3a).

Where the methods state

We discovered 142 candidate driver genes by this approach (Supplementary Table 2). Of these, 133 were significant by GRIN analysis (87 genes common to both GRIN and MutSigCV) and nine were significant only by MutSigCV.

The GRIN R package is available here: https://www.stjuderesearch.org/site/depts/biostats/grin

MutSigCV v1 is available as a GenePattern module: https://www.genepattern.org/modules/docs/MutSigCV

Note I happened upon some R code that implements the MutSig1.0 statistic: https://github.com/lixiangchun/lxctk/blob/ea74021f49393c65993b28f6a11a4c5cccbf66ae/R/mutsig.gene.R#L102

And Maftools seems like it has some functionality to use the output of MutSigCV based on my skimming of Mayakonda et al. Genome Research. 2018.

From Gröbner et al. Nature. 2018:

MuSiC identified 77 significantly mutated genes (SMGs), which were ranked according to their pan-cancer mutation frequency [24] (Fig. 4, Supplementary Tables 9, 10). Most SMGs were mutually exclusively mutated across cancer types, demonstrating specificity of single putative driver genes in childhood cancers as compared to more frequent co-mutation in adult cancers in the TCGA study [7] (Extended Data Fig. 4c–e).

And from the methods:

Significantly mutated genes based on somatic SNVs and indels were identified with the SMG module of the MuSiC tools suite [24] separately from all cancer types and from the pan-cancer cohort, and then merged.

This kind of significance analysis often produces false positive hits (for example, very large genes), despite normalization procedures, and thus several filters were applied to the raw output [30].

MuSiC2 is available on GitHub: https://github.com/ding-lab/MuSiC2

Some of the tests proposed by the MuSiC paper (Dees et al. Genome Research. 2012.), namely the Fisher's combined p-value test and likelihood ratio test, are implemented in the same function I linked to above: https://github.com/lixiangchun/lxctk/blob/ea74021f49393c65993b28f6a11a4c5cccbf66ae/R/mutsig.gene.R, where the method labeled PCT is from Kan et al. Nature. 2010. per the documentation.

Comparison to other literature

The Gröbner et al. Nature. 2018 cohort is enriched for CNS tumors

This study is biased towards central nervous system tumours, and is complemented by an additional study of a non-overlapping paediatric cohort with mainly leukaemias and extracranial solid tumours [9].

A comparison to their results seems like a good thing to do as part of this project. Here's a link from that paper: http://www.pedpancan.com/ which mentions PedcBioPortal when you follow it!

jaclyn-taroni commented 4 years ago

I was looking more into what functionality maftools has in their documentation, specifically Detecting cancer driver genes based on positional clustering which states:

oncodrive is a based on algorithm oncodriveCLUST which was originally implemented in Python. Concept is based on the fact that most of the variants in cancer causing genes are enriched at few specific loci (aka hot-spots). This method takes advantage of such positions to identify cancer genes.

Following that to the OncodriveCLUST website, a couple things caught my attention -

There's now a new version called OncodriveCLUSTL available via pip (publication, bitbucket)
The method does not assume that the baseline mutation probability is homogeneous across all gene positions but it creates a background model using silent mutations. Coding silent mutations are supposed to be under no positive selection and may reflect the baseline clustering of somatic mutations. Given recent evidences of non-random mutation processes along the genome, the assumption of homogenous mutation probabilities is likely an oversimplication introducing bias in the detection of meaningful events.

jharenza commented 4 years ago

I came across DriverPower with the PCAWG pan-cancer paper releases: https://www.nature.com/articles/s41467-019-13929-1#code-availability (code), but would entail liftover from hg38 to hg19.

kgaonkar6 commented 4 years ago

I was reading through this PCAWG paper https://www.nature.com/articles/s41586-020-1965-x.pdf and found their methods to look for driver mutations, might be useful:

Candidate-driver-mutation identification methods and combination of results

We obtained results (P values) from 13 methods of driver discovery, including ActiveDriverWGS54, CompositeDriver, DriverPower55, dndscv46, ExInAtor56, LARVA57, MutSig tools3, NBR10, ncdDetect58, ncDriver59, OncodriveFML60 and regDriver61. We integrated the results of all these methods using a custom framework based on a previously published method62 for combining P values. Results from individual methods that showed large deviations from the expected uniform null distribution of P values were excluded. This approach was evaluated on real and simulated data.

Code availability

P value combination from multiple driver methods is available from https://github.com/broadinstitute/getzlab-PCAWG-pvalue_combination/

AlexsLemonade / OpenPBTA-analysis