Open dhimmel opened 8 years ago
This is an interesting question. From some quick searching of the academic literature, I've dug up mutational signatures of heavy mutation load cancers (the types of mutations that occur in these cancers seem to be different). I didn't find anything on a gene expression pattern common to them. It may be important to control for confounding by cancer type (maybe you pick the most mutated 10% within each cancer type as positive and the least mutated 10% as negative). I think this is an interesting question that may have just created another use case!
on a call now and this issue was mentioned. Really, the issue is mainly that these hyper-mutated tumors have a ton of passenger mutations and would contaminate gold standards. The solution proposed involved subsetting mutations using Cancer Hotspots as defined by Chang et al.
Essentially what the group is doing is only considering a sample to have a mutation in a given gene if the mutation is found in this database. I don't necessarily know what to do with this info - or if it even makes sense to use at all but generally, using it would increase the percentage of true positives but simultaneously increase false negatives.
it would increase the percentage of true positives but simultaneously increase false negatives
What do you mean by true positives and false negatives?
From Chang et al.:
Here, we developed a statistical algorithm to identify recurrently mutated residues in tumor samples. We applied the algorithm to 11,119 human tumors, spanning 41 cancer types, and identified 470 somatic substitution hotspots in 275 genes.
So if we were to only count mutations that were in recurrently mutated residues (cancer hotspots), we would only be able to offer our users a choice between 275 genes — not good? Additionally, I'm not sure I see:
However, I still think a covariate is the way to go and can address most of the problem. A good first analysis to see the extent of this problem would be to measure the AUROC between TP53 mutation status versus total mutation count.
What do you mean by true positives and false negatives?
True positives meaning samples that actually have a deleterious mutation in the given gene (either an activating or inactivating mutation) that leads to a gene expression based signature representative of the normal gene activity being lost. False negatives meaning samples that actually do have the irregular gene expression signature but are incorrectly considered a "0" or "not mutated". Either will decrease the classifiers performance. We can get a false negative from either:
we would only be able to offer our users a choice between 275 genes — not good?
Probably not good, I agree.
that restricting to hotspots will be able to fully eliminate mutation load confounding
aside from removing samples with high mutation load, I don't think anything we do will fully eliminate this confounding. Restricting to hotspots for these samples will remove many passenger mutations that are less likely to alter gene expression signatures associated with the mutation of associated input genes. Adjusting for them when building a model could work nicely too.
that restricting to hotspots makes sense given that we run a supervised algorithm that learns whether there's signal. Let it learn.
The 'let it learn' argument makes much more sense in an unsupervised setting. For a supervised algorithm we are severely impacted by false labeling information and the first question when troubleshooting performance should always be: "is my data good?"
A good first analysis to see the extent of this problem would be to measure the AUROC between TP53 mutation status versus total mutation count.
I think this is a great idea! Although we probably should approach it using a gene other than TP53. Since TP53 is crucial for DNA repair, tumors with the defective protein are likely to have more mutations than tumors with wildtype TP53. I would recommend building a new classifier for RAS or NF1, or we can even try using genes in a pathway. E.g. Hippo Signalling Pathway to test this hypothesis.
In general, I would be in favor of sticking with our filtered mutation calls as a gold standard for now (at least until cleaner data comes in :smiley:) and testing to see how much of an impact mutation load has on predictions.
Reproducing a comment by @gwaygenomics here:
I was at talk by Olivier Elemento - he was building models for a different purpose (predict immunotherapy responders) but was adjusting for mutation burden as a covariate. We may want to consider checking out his stuff and adjusting for burden too
I did the Elemento Lab's GitHub organization but I couldn't find the handle for the doctor himself. However, I did find his Twitter, so I'll tweet him the link to this question:
Q: We're creating models to predict mutation status at a specific gene using gene expression on TCGA samples. We'd like to add a mutation load covariate and have explored adding n_mutations_log1p
(the log of 1 plus the number of mutations per sample) to the model. Do you have any advice or can you point us to models you've created with a mutation load covariate?
Update: link to Tweet
this issue has come up once again - it appears to be something the field is keenly aware of but do not know of a "best" solution for. It also appears to be extremely important when trying to predict the gene expression signature of samples that have DNA damage repair response defects.
Some of the solutions I have seen so far:
I have also seen a number of different ways mutation burden is added to the model. I plan on looking into this today at the meetup and exploring some of the solutions
I think it's likely that there is a general expression pattern for how mutated a tumor is. For example, super mutated tumors may have wacky gene expression, solely because they're super mutated and not specifically because of which exact mutations they contain.
For a given gene, tumors with mutations are more likely to be highly mutated overall. This could cause confounding. It may appear that a mutation is associated with a specific expression pattern, although the signal is be driven by general mutation-load.
So we may need to end up including a mutation-load covariate. In the meantime, someone should see whether it's possible to use gene expression to predict the mutation-load of each sample (labeling this a task and looking for a volunteer).