Tumor purity tp53 notebook

sjspielman commented 1 year ago

Closes #1624

This PR adds a notebook 10-tp53-tumor-purity-threshold.Rmd to the tp53_nf1_score module to complete the tumor purity re-analysis.

Notable file changes:

Woops, I accepted this but didn't re-render notebooks before merge! https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/1664#discussion_r1107830134 The 03 notebooks are re-rendered here.
Added 10-tp53-tumor-purity-threshold.Rmd and its output results/tumor-purity-threshold/10-tp53-tumor-purity-threshold.nb.html for result comparison
I added an export argument with a default TRUE o the plot_roc() function to be able to show plots in ^ notebook without having to save ROCs to a file. So, I add export = FALSE in that notebook.
Added bullet to README.md about this notebook

The notebook itself raises some important manuscript revision points, including:

outdated P-values (this is an easy fix)
how many LFS patients?
Are all the hypermutants except for 1 really above 0.5?

Tagging in @jaclyn-taroni and @jharenza for some discussion here, since I'm not sure for the last two bullet points whether we have MS typos or if I have some wonky calculations in this notebook?

Once this is merged and we have a game plan for all bullets that need game plans, I'll get those issues filed and moving along in OpenPBTA-manuscript.

Edit - It's also worth noting the new shuffled AUC is 0.34, which is rather less than 0.5.....

Another edit! - This notebook now also exports plots we can use in #1665. Associated changes were made in https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/1670/commits/b0793c534187ae705ea1e68c4203057df95fbbba and https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/1670/commits/de513ac486be7f751ce148414159b8293248acb7.

sjspielman commented 1 year ago

Noting this passed CI through the tp53_nf1_score module before I merged in master.

sjspielman commented 1 year ago

In the spirit of moving https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/1665 along, I'm going to go ahead and update this notebook in this PR to export some of the relevant plots.

jaclyn-taroni commented 1 year ago

how many LFS patients?

I poked around in the histologies file, independently of what you have here, and I believe 9 is correct.

jharenza commented 1 year ago

how many LFS patients?

I poked around in the histologies file, independently of what you have here, and I believe 9 is correct.

will check this, but MS states "tumors" (assuming which have RNA-Seq and therefore TP53 scores).

jharenza commented 1 year ago

@sjspielman I am not seeing an HTML file for 10-tp53-tumor-purity-threshold.Rmd

jharenza commented 1 year ago

@sjspielman I am not seeing an HTML file for 10-tp53-tumor-purity-threshold.Rmd

Oh, I see them in a different folder- can we keep them in the main folder, or do you want to move the actual Rmd files to the threshold folder?

sjspielman commented 1 year ago

@jharenza I am working on this now

@sjspielman I am not seeing an HTML file for 10-tp53-tumor-purity-threshold.Rmd

This is in the results/tumor-purity-threshold/ directory.

Also note the LFS is ok - there are a few samples per patient in there.

Something I'm finding while wrapping this up is that the shuffled AUC is not at all reproducible. I'm hunting down where some seed was probably not set..

jharenza commented 1 year ago

Something I'm finding while wrapping this up is that the shuffled AUC is not at all reproducible. I'm hunting down where some seed was probably not set..

👍

for

Here, it appears that all but 2 tumors have scores >0.5, so we may want to check this aspect as well.

I think I eyeballed this and looking at panel 4D, I only see 7 tumors highlighted, but checking the bs_ids you posted in notebook 10 as hypermutators, I also see this comes from only 6 patients, and not sure if we are clear figure 4D is not an indep sampling - should we revise or clarify this?

> v23 %>%
+   filter(Kids_First_Biospecimen_ID %in% hypermutator_bs_ids) %>%
+   select(Kids_First_Participant_ID, Kids_First_Biospecimen_ID, pathology_diagnosis, tumor_descriptor) %>%
+   arrange(Kids_First_Participant_ID)
# A tibble: 8 × 4
  Kids_First_Participant_ID Kids_First_Biospecimen_ID pathology_diagnosis                              tumor_descriptor 
  <chr>                     <chr>                     <chr>                                            <chr>            
1 PT_0SPKM4S8               BS_VW4XN9Y7               High-grade glioma/astrocytoma (WHO grade III/IV) Initial CNS Tumor
2 PT_3CHB9PK5               BS_20TBZG09               High-grade glioma/astrocytoma (WHO grade III/IV) Initial CNS Tumor
3 PT_3CHB9PK5               BS_8AY2GM4G               High-grade glioma/astrocytoma (WHO grade III/IV) Progressive      
4 PT_EB0D3BXG               BS_F0GNWEJJ               Neuroblastoma                                    Progressive      
5 PT_JNEV57VK               BS_85Q5P8GF               High-grade glioma/astrocytoma (WHO grade III/IV) Initial CNS Tumor
6 PT_JNEV57VK               BS_P0QJ1QAH               High-grade glioma/astrocytoma (WHO grade III/IV) Progressive      
7 PT_S0Q27J13               BS_P3PF53V8               High-grade glioma/astrocytoma (WHO grade III/IV) Initial CNS Tumor
8 PT_VTM2STE3               BS_02YBZSBY               High-grade glioma/astrocytoma (WHO grade III/IV) Progressive

sjspielman commented 1 year ago

and not sure if we are clear figure 4D is not an indep sampling - should we revise or clarify this?

At this stage, I generally thing clarifying >> revising.. Just make sure this one says "samples" maybe and not "tumors"?

sjspielman commented 1 year ago

@jaclyn-taroni I might want to loop you back in for the reproducibility issue here. I have been observing that the shuffled AUC is not at all reproducible across runs, and I was hoping to track it down.

I tried re-setting a seed again in the function that actually performs the shuffling: https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/dd675dd681a5a69b94ec753d1009ef259f2405a6/analyses/tp53_nf1_score/utils.py#L54-L61

using two approaches

Simply another np.random.seed(123) - didn't help
Using the newer numpy seed setting methods, which I believe we should be using at this 1.17 version anyways (though the existing np.random.seed() code is still legacy supported in this version, see docs: https://numpy.org/doc/1.17/reference/random/generator.html) . I updated the function as follows, but it also didn't help with reproducibility
```
import numpy as np
rng = np.random.default_rng(123)
return rng.permutation(gene.tolist())
```
Also didn't help.

For my final run through of the tumor purity pipeline, AUC ended up at 0.49 which is what I've pushed in https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/1670/commits/a0fa06ec9e318ef5135a86fed72e68e86308c19f.

My gut still tells me this is a version vs. code problem, but it's not a quick fix. I don't want this to really hold us up, so perhaps this comment represents the official documentation that there is something funky specifically with reproducibility for shuffled tumor purity AUC (maybe shuffled more generally..).

jaclyn-taroni commented 1 year ago

So I would say: given that the display/calculation & reporting is a single instance of shuffling the gene labels (as I understand it), there are some inherent limitations, and no result you've encountered really materially changes interpretation. I think noting it in the README is appropriate.

AlexsLemonade / OpenPBTA-analysis

Tumor purity tp53 notebook #1670