Revision: Tumor purity and transcriptomics exploration

sjspielman commented 1 year ago

From https://github.com/AlexsLemonade/OpenPBTA-manuscript/issues/377

Add an exploratory notebook that looks at relationship between tumor purity and some results from transcriptomics analyses. For example, we can look at:

EXTEND scores vs tumor purity
tp53 scores vs tumor purity

It's not immediately clear to me what we could do with GSVA scores, so tagging in some folks for discussion @jaclyn-taroni @jashapiro @jharenza

Since this notebook will rely on the output from a couple other modules (telomerase-activity-prediction and tp53_nf1_score for example), it might be a good idea to go ahead and add a tumor_purity_exploration module to hold this notebook.

jaclyn-taroni commented 1 year ago

My knee jerk reaction upon reading the relevant reviewer comment was "regress out" tumor purity. To quote this blog post from Rob Alderman:

By "regress out", we mean to "remove any correlative influence" of one variable from another, so that we can examine the variable in "isolation", so to speak. We do this by building a simple regression model for the two variables, treating one as the response and the other as the regressor (the one being "regressed out"), then taking the residuals and making some use of them (e.g. correlating them with another variable).

The residuals represent the "leftover" variance in the response variable after removing (i.e. "accounting for") any correlative relationship with the regressor variable. In other words, the residuals represent the portion of the response variable's variance that is NOT explained by the regressor variable.

Knee jerk reactions, famously, are not always good.

sjspielman commented 1 year ago

It sounds to me like you are proposing (with knee jerk caveat!) something conceptually along these lines - lm(EXTEND ~ histology + tumor purity), and then regress out tumor purity. What to then do with the resulting residuals...

Compare back to original residuals?
Correlate back to EXTEND?
?

Edit: As part of this, we should include exploratory viz - is there even a relationship between tumor purity and EXTEND and tp53? Quick scatterplots, maybe colored by histology, can help point us in a useful direction for this analysis.

sjspielman commented 1 year ago

Another idea (h/t @envest!) could involve running transcriptomics analyses again but only focusing on samples with >=X % tumor purity. If vibes match between all samples and, say, only samples that are >=95% tumor, then we're good.

sjspielman commented 1 year ago

I've started (emphasis on started!) a notebook here https://github.com/sjspielman/OpenPBTA-analysis/blob/sjspielman/tumor-purity-transcriptomics/analyses/tumor-purity-exploration/02_tumor-purity-transcriptomics.Rmd

sjspielman commented 1 year ago

Something I think is worth exploring here is redoing some of these analyses but using only samples with tumor fraction >= some threshold. This will help us see to what extent results are influenced by low tumor purity samples. That said, doing this will really knock down the sample size which can affect analysis results as well!

This is the summary for tumor fraction across the full cohort -

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
0.000123 0.463044 0.756648 0.696020 1.000000 1.000000       12

We might, for example, use a 0.75 threshold which is the median value, but this removes about half the data. A more stringent threshold would tank sample size even more though, so there's a real tradeoff in this exploration to be aware of.

envest commented 1 year ago

👍 Does this X% threshold affect all cancer types equally? Or would some cancer types be more affected than others by excluding low purity? (I'm guessing the latter)

jaclyn-taroni commented 1 year ago

👍 Does this X% threshold affect all cancer types equally? Or would some cancer types be more affected than others by excluding low purity? (I'm guessing the latter)

And following up on this – if we use a median cutoff, what is the set of transcriptomics analyses in the paper that we can reasonably re-do? If we have a list of those analyses, we could:

Generate the list of biospecimen IDs to be included using that cutoff
Identify/describe a general approach to the re-doing of analyses
Add notebooks to the modules where we do those analyses, which would allow multiple people to work on this in parallel

sjspielman commented 1 year ago

To begin assessing feasibility, I'll explore in #1622 how the overall distributions of cancer groups/broad histologies is affected when a median (or other) threshold is applied.

sjspielman commented 1 year ago

Noting that I'm going to be opening a couple other issues to reorganize our approach here. For the time being I'm going to leave this issue open but it may get closed in the shuffle in the not distant future.

sjspielman commented 1 year ago

I'm going to go ahead and close this issue to focus on the more granular other issues for revision of each transcriptomics analysis.

AlexsLemonade / OpenPBTA-analysis

Revision: Tumor purity and transcriptomics exploration #1621