Closed sjspielman closed 1 year ago
My knee jerk reaction upon reading the relevant reviewer comment was "regress out" tumor purity. To quote this blog post from Rob Alderman:
By "regress out", we mean to "remove any correlative influence" of one variable from another, so that we can examine the variable in "isolation", so to speak. We do this by building a simple regression model for the two variables, treating one as the response and the other as the regressor (the one being "regressed out"), then taking the residuals and making some use of them (e.g. correlating them with another variable).
The residuals represent the "leftover" variance in the response variable after removing (i.e. "accounting for") any correlative relationship with the regressor variable. In other words, the residuals represent the portion of the response variable's variance that is NOT explained by the regressor variable.
Knee jerk reactions, famously, are not always good.
It sounds to me like you are proposing (with knee jerk caveat!) something conceptually along these lines -
lm(EXTEND ~ histology + tumor purity)
, and then regress out tumor purity
. What to then do with the resulting residuals...
Edit: As part of this, we should include exploratory viz - is there even a relationship between tumor purity and EXTEND and tp53? Quick scatterplots, maybe colored by histology, can help point us in a useful direction for this analysis.
Another idea (h/t @envest!) could involve running transcriptomics analyses again but only focusing on samples with >=X % tumor purity. If vibes match between all samples and, say, only samples that are >=95% tumor, then we're good.
I've started (emphasis on started!) a notebook here https://github.com/sjspielman/OpenPBTA-analysis/blob/sjspielman/tumor-purity-transcriptomics/analyses/tumor-purity-exploration/02_tumor-purity-transcriptomics.Rmd
Something I think is worth exploring here is redoing some of these analyses but using only samples with tumor fraction >= some threshold. This will help us see to what extent results are influenced by low tumor purity samples. That said, doing this will really knock down the sample size which can affect analysis results as well!
This is the summary for tumor fraction across the full cohort -
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.000123 0.463044 0.756648 0.696020 1.000000 1.000000 12
We might, for example, use a 0.75 threshold which is the median value, but this removes about half the data. A more stringent threshold would tank sample size even more though, so there's a real tradeoff in this exploration to be aware of.
👍 Does this X% threshold affect all cancer types equally? Or would some cancer types be more affected than others by excluding low purity? (I'm guessing the latter)
👍 Does this X% threshold affect all cancer types equally? Or would some cancer types be more affected than others by excluding low purity? (I'm guessing the latter)
And following up on this – if we use a median cutoff, what is the set of transcriptomics analyses in the paper that we can reasonably re-do? If we have a list of those analyses, we could:
To begin assessing feasibility, I'll explore in #1622 how the overall distributions of cancer groups/broad histologies is affected when a median (or other) threshold is applied.
Noting that I'm going to be opening a couple other issues to reorganize our approach here. For the time being I'm going to leave this issue open but it may get closed in the shuffle in the not distant future.
I'm going to go ahead and close this issue to focus on the more granular other issues for revision of each transcriptomics analysis.
From https://github.com/AlexsLemonade/OpenPBTA-manuscript/issues/377
Add an exploratory notebook that looks at relationship between tumor purity and some results from transcriptomics analyses. For example, we can look at:
It's not immediately clear to me what we could do with GSVA scores, so tagging in some folks for discussion @jaclyn-taroni @jashapiro @jharenza
Since this notebook will rely on the output from a couple other modules (
telomerase-activity-prediction
andtp53_nf1_score
for example), it might be a good idea to go ahead and add atumor_purity_exploration
module to hold this notebook.