AlexsLemonade / OpenPBTA-analysis

The analysis repository for the Open Pediatric Brain Tumor Atlas Project
Other
99 stars 66 forks source link

Planned Analysis: Survival Analysis across PBTA #18

Closed PichaiRaman closed 4 years ago

PichaiRaman commented 5 years ago

Comparison of genomic/transcriptomic features and derivates such as TMB, pathway activity, etc... to survival. Determine markers that could add prognostic values to different PBTA cancer types. KM Plots for significant and interesting features.

alexfelmeister commented 5 years ago

I maybe able to help out here. I've done this with CBTTC data for demonstration purposes as well as done the back and forth about how to get this from that data structure.

jharenza commented 5 years ago

@alexfelmeister - Thanks for volunteering! We have to work with @allisonheath to hammer down survival before adding it to the clinical file, so I have marked this issue as blocked for now.

jharenza commented 5 years ago

survival has been added to the clinical file with release-v4-20190909

PichaiRaman commented 4 years ago

I would like to potentially tackle this. I will work on not only categorical features such as mutations or fusions but was also going to look at RNA-Sequencing data and any other continuous measure. I have some experience with this and have some code pretty much ready to go

https://www.sciencedirect.com/science/article/abs/pii/S2210776218304897

I was also thinking of potentially using other derived data such as Gene Set Scores, TMB, and immune cell type enrichment scores.

jaclyn-taroni commented 4 years ago

@PichaiRaman do you have an idea of when you expect to file the first pull request for this?

I also wanted to ask if the code you mentioned above was publicly available. If so, we could potentially have someone else get started on adapting it for this project and have you review. Let me know what you think!

jharenza commented 4 years ago

@PichaiRaman should we un-assign you from this?

cansavvy commented 4 years ago

I'm going to start looking into this and planning for it this week.

cansavvy commented 4 years ago

A few questions for me to get this analysis going:

1) What basic analyses and output plots/tables do we need for this issue? Can I get a list of needed plots/tables we need?

It seems we will want some Kaplan-Meier survival probability curves like these from Pajtler et al, 2015. Is this a good example to follow? I can build it in such a way that we can split it up by short_histology for now, but then when we get molecular sub-typing information, we can graph the data by their subtypes.

Screen Shot 2019-12-04 at 1 46 55 PM

2) Related to question 1, are their downstream analyses that will use this information and what information (file format, output) will they need?

I notice some mention of survival analyses on #27 but it is unclear to me what, if anything, would be needed for that analysis from this one.

3) Do you have recommended tools or packages that I should look at?

I've seen survminer and survMisc and used in this nice tutorial. Are there other, more specific to cancer genomics, packages I should look into?

4) A more specific question: how should I handle the data when a sample has an NA for OS_days?

What do NAs represent and how should I use it (or not use it)?

cgreene commented 4 years ago

Another question, probably for @jharenza at this point: are there any items that one would want to adjust for? For adult tumors this could be something like age at the time of diagnosis. If so, it would be good to write this such that something like this can be specified. If not, there'd be no need to include that functionality.

alexfelmeister commented 4 years ago

I like https://lifelines.readthedocs.io/en/latest/. Good tutorials and good information about censoring for missing data and what to do and when with records where we don't know OS days etc.

jharenza commented 4 years ago

@cgreene - I don't think we need to adjust for that per se, as usually that is embedded within subtype info (eg DIPGs segregate by age based on the two different histone variants so we would probably do separate survival analyses due to that). I do think maybe we should try to list out certain variables in which we want to test whether survival is higher/lower, as @PichaiRaman mentioned in the description. Eg: were there any significant findings for any tumor type in GSEA #133 and for that tumor type, was there a survival difference based on the enriched samples for X pathway vs not. (I think that is what he was going for here). Similarly with TMB, if we have any tumor types with a good distribution of TMBs - low, high, ultra-high (or however they are designated), we can assess survival based on that variable. If we find novel recurrent fusions or fusion partners in a tumor type/class of tumors, is survival different based on the presence of these?

jaclyn-taroni commented 4 years ago

@cansavvy echoing @jharenza’s sentiment here a bit — I would conceptualize survival analysis as the downstream analysis. That is to say you want to design this such that it can consume both continuous and categorical information from other analyses — so I believe you’ll want to plan to support Cox regression and Kaplan-Maier estimates and log-rank tests and whatever plots are appropriate for each of those.

cansavvy commented 4 years ago

Okay. This all gives me something I can start with. 👍

cansavvy commented 4 years ago

We have a basic survival analysis notebook that lays out the various questions and models for survival analysis depending on what independent variables we have questions about.

For the next steps, what would we like to see? Are there specific list of variables we know that we want modeled with survival? Or, should I create a sort of function that can be applied to other data and give back a plot or table?

What kind of set up would people get the most use out of and would best set us up for the upcoming survival based scientific questions?

cansavvy commented 4 years ago

@sjspielman Do you have any thoughts/recommendations about these questions above as far as future structure?

sjspielman commented 4 years ago

Thanks for the ping - frankly, this is not my knowledge domain in terms what variables would be appropriate for survival analysis, but I think a function would be ideal if possible so that all survival analyses as consistent as possible. Using a tidyeval setup to have users pass in the specific variables, ensure inputted types are acceptable, returns a plot and two tables as you've done in the template?

jaclyn-taroni commented 4 years ago

Closing all planned analysis tickets in favor of opening new proposed analysis/updated analysis tickets as needed.