Initiate tumor purity module

sjspielman commented 1 year ago

Closes #1620

This PR initiates a tumor-purity-exploration module for looking into tumor purity (called tumor_fraction in our metadata). I started a notebook for overall exploration as a jumping off point. As part of this review, please feel free to suggest other initial explorations we can do in this notebook. If we like this overall structure, next we can address transcriptomics aspects in other notebooks (or in more sections of this notebook?) in this module to address #1621.

The module has been added to CI and the analyses/README.md file with TBD in the column for whether this is used in the MS.

jharenza commented 1 year ago

Thanks for starting this @sjspielman! I was looking through and wondering if we should really be using the DNA tumor purity for RNA, though it does appear they might correlate. We do not have this formally calculated for RNA, but can possibly run the ESTIMATE R package to do so. This doesn't get at purity, but another thought was removing non-pass MENDQC samples as far as reproducibility concerns with too few mapped exonic reads.

sjspielman commented 1 year ago

I was looking through and wondering if we should really be using the DNA tumor purity for RNA, though it does appear they might correlate.

The initial exploration (very initial) I have done here is already using DNA tumor purity since we don't have the RNA equivalent in the metadata. I think it's overkill to derive for RNA.

jharenza commented 1 year ago

I think it's overkill to derive for RNA.

Understood! :)

sjspielman commented 1 year ago

In f52472d I added some chunks to explore sample distributions with some filtering thresholds for tumor purity as discussed here: https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/1621#issuecomment-1332201013

sjspielman commented 1 year ago

These thresholds will need to be re-explored once #1629 is done - how do thresholds change when only co-extracted (edit; typo) samples are considered?

sjspielman commented 1 year ago

This is now actually ready for real review! The notebook now includes sections for exploring thresholds after filtering down to only Same Extraction samples, including both global thresholds and group-specific thresholds. A couple metadata file versions are exported that contain only relevant samples that make it past different filtering steps (same extraction, and then thresholding), which can potentially be used to facilitate sample filtering in other analyses.

01_explore-tumor-purity.nb.html.zip

jharenza commented 1 year ago

Hi @sjspielman thanks for working on this. I was able to go through and I had a few comments/suggestions.

I know you have this plotted, but can you also print (in the notebook), the values of the cancer_group_tumor_threshold? I generally like this idea, but I also wonder if we still need some minimum threshold across the board (for example, PXA have a low median so many of these tumors do not have high purity at all, so should we drop them and how do we do this statistically?)
I saw this paper, which assesses tumor purity in a few TCGA cancers and then uses purity as a covariate for different expression analyses. I don't think we need to create new analysis models, but it was nice to see that our overall median (~0.7) of tumor purity was exactly what they saw in their 3 cancers assessed. They then used this as their cutoff of "high purity" samples.

So I think approach 1 is sound, does it answer the question of whether the overall cohort results recapitulate those of highly pure samples? Not to make this more complicated, but what we could do is use the approach in 2 to run a few modules with that data subset (specifically GSVA and UMAP) and see if we reproduce our initial results. I think GSVA will be reproduced if we have a large enough N because without purity correction, we have great signal in both of these modules. I just don't know if they would improve because of the sample size reduction. If we fail to, then it may be due to samples with still low purity in the mix, although using approach 1 may reduce our sample N enough to get a great UMAP.

At this time, I don't think we need to worry about TP53/EXTEND.

jharenza commented 1 year ago

Can you also order the cancer group tumor_violins by ascending median? I think this would make a good supplemental figure.

sjspielman commented 1 year ago

Hi @jharenza, thanks for having a look! @jaclyn-taroni and I have been discussing this PR a bit over the past few days which may have resulted in some confusing review requests and drafting, so sorry about any of that if there was a lot of spam!

I went ahead and arranged plots and printed the tables as you wrote in your previous comments. With regards to the paper you linked, we don't quite have the same sorts of analyses here where we can formally include tumor purity as a covariate, although I think what you are suggesting is similar to what @jaclyn-taroni thought of a little while ago - https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/1621#issuecomment-1302403440 That said, I think the idea we're progressing to should accomplish the same overall goal? We're aiming to re-run modules with a subsetted version of the data based on a tumor purity threshold and see if the qualitative results hold up. I'd be curious along these lines to hear more about your specific thoughts on TP53 and EXTEND, which you mentioned briefly? In my mine, EXTEND is sort of "low-hanging fruit" as a place to start on a subsetted analysis.

Either way more generally, here's the current version of this notebook: 01_explore-tumor-purity.nb.html.zip I went through this a couple times to convince myself that the ID mapping is correct, and I think it is! Hopefully @jaclyn-taroni agrees! I ended up also exploring two minimum global thresholds: the overall median and 0.7. The former ended up removing a lot of samples so I went with 0.7 for now. Also, you'll notice a new small section at the end looking for the hypermutators - only 1 sample ends up getting kept in the end, so there turns out to probably be not much we can say at the intersection of expression and mutation for these samples. We may want to add some form of caveat for the fig 4 panel where we emphasize hypermutators among TP53 and EXTEND scores.

jharenza commented 1 year ago

With regards to the paper you linked, we don't quite have the same sorts of analyses here where we can formally include tumor purity as a covariate, although I think what you are suggesting is similar to what @jaclyn-taroni thought of a little while ago - https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/1621#issuecomment-1302403440

Ah, I referenced the paper mainly to show they have a similar median tumor purity (and didn't mean we need to update our methods to try to regress out tumor purity), and mainly that I think using high purity samples will likely enable us to determine whether the two analyses are reproducible with each other (that is, keeping more samples (for eg in groups with lower median cutoffs will increase the N for lower purity samples,, so may not help us with the rebuttal/detect what we should). I think our current plan to just rerun modules with the highly pure samples is good!

specific thoughts on TP53 and EXTEND, which you mentioned briefly? In my mine, EXTEND is sort of "low-hanging fruit" as a place to start on a subsetted analysis.

The reviewer specifically mentioned UMAP, so I think that we should try this first, albeit, we will probably lose too many samples to see "better" separation here, but either way, we should rebut by saying such if that is the case. I also thought GSVA would be a higher priority because it gets at overall oncogenic biological pathways upregulated - if we see that our total results are not impacted by purity (which I surmise they will not be since we obtained results consistent with the literature), then we can check the need to correct for tumor purity off.

I think TP53/EXTEND might be slightly more difficult (or maybe I am overthinking!) because EXTEND doesn't have cutoffs for high/low activity - more continuous, so would we just replot the distributions? For TP53, we could redo and get a new ROC (this also depends how enriched or "de-enriched" (is that a word?) for high confidence TP53 mutations our cohort becomes. We can also plot the expression by phenotype (loss, activated, other) and see if we see the same results. We can do all 4 for sure! We just may not need to, but it could be easy enough!

sjspielman commented 1 year ago

@jharenza For EXTEND and TP53, I'm thinking exactly what you say - replot distributions (or re-run ROC) and see if the conceptual interpretations hold up. We'd like to see if certain trends across cancer groups (like we're showing in Fig 4) are recapitulated with the high tumor purity data. Nothing too fancy besides some if statements for filtering and re-running existing code (🤞 ).

sjspielman commented 1 year ago

However, this notebook is also very long. I wonder if generating the results files – after the decisions are recorded in the notebook – is a logical place to split this up.

@jaclyn-taroni I can also split this up into one notebook for overall exploration and one for exploring different thresholds. Basically, pop everything under the Exploring thresholding header into a new notebook. Edit - I went ahead and did this in 38c0a34. Much more manageable in two notebooks!

AlexsLemonade / OpenPBTA-analysis

Initiate tumor purity module #1622