Closed PichaiRaman closed 4 years ago
I maybe able to help out here. I've done this with CBTTC data for demonstration purposes as well as done the back and forth about how to get this from that data structure.
@alexfelmeister - Thanks for volunteering! We have to work with @allisonheath to hammer down survival before adding it to the clinical file, so I have marked this issue as blocked
for now.
survival has been added to the clinical file with release-v4-20190909
I would like to potentially tackle this. I will work on not only categorical features such as mutations or fusions but was also going to look at RNA-Sequencing data and any other continuous measure. I have some experience with this and have some code pretty much ready to go
https://www.sciencedirect.com/science/article/abs/pii/S2210776218304897
I was also thinking of potentially using other derived data such as Gene Set Scores, TMB, and immune cell type enrichment scores.
@PichaiRaman do you have an idea of when you expect to file the first pull request for this?
I also wanted to ask if the code you mentioned above was publicly available. If so, we could potentially have someone else get started on adapting it for this project and have you review. Let me know what you think!
@PichaiRaman should we un-assign you from this?
I'm going to start looking into this and planning for it this week.
A few questions for me to get this analysis going:
It seems we will want some Kaplan-Meier survival probability curves like these from Pajtler et al, 2015. Is this a good example to follow? I can build it in such a way that we can split it up by short_histology
for now, but then when we get molecular sub-typing information, we can graph the data by their subtypes.
I notice some mention of survival analyses on #27 but it is unclear to me what, if anything, would be needed for that analysis from this one.
I've seen survminer
and survMisc
and used in this nice tutorial.
Are there other, more specific to cancer genomics, packages I should look into?
NA
for OS_days
?What do NA
s represent and how should I use it (or not use it)?
Another question, probably for @jharenza at this point: are there any items that one would want to adjust for? For adult tumors this could be something like age at the time of diagnosis. If so, it would be good to write this such that something like this can be specified. If not, there'd be no need to include that functionality.
I like https://lifelines.readthedocs.io/en/latest/. Good tutorials and good information about censoring for missing data and what to do and when with records where we don't know OS days etc.
@cgreene - I don't think we need to adjust for that per se, as usually that is embedded within subtype info (eg DIPGs segregate by age based on the two different histone variants so we would probably do separate survival analyses due to that). I do think maybe we should try to list out certain variables in which we want to test whether survival is higher/lower, as @PichaiRaman mentioned in the description. Eg: were there any significant findings for any tumor type in GSEA #133 and for that tumor type, was there a survival difference based on the enriched samples for X pathway vs not. (I think that is what he was going for here). Similarly with TMB, if we have any tumor types with a good distribution of TMBs - low, high, ultra-high (or however they are designated), we can assess survival based on that variable. If we find novel recurrent fusions or fusion partners in a tumor type/class of tumors, is survival different based on the presence of these?
@cansavvy echoing @jharenza’s sentiment here a bit — I would conceptualize survival analysis as the downstream analysis. That is to say you want to design this such that it can consume both continuous and categorical information from other analyses — so I believe you’ll want to plan to support Cox regression and Kaplan-Maier estimates and log-rank tests and whatever plots are appropriate for each of those.
Okay. This all gives me something I can start with. 👍
We have a basic survival analysis notebook that lays out the various questions and models for survival analysis depending on what independent variables we have questions about.
For the next steps, what would we like to see? Are there specific list of variables we know that we want modeled with survival? Or, should I create a sort of function that can be applied to other data and give back a plot or table?
What kind of set up would people get the most use out of and would best set us up for the upcoming survival based scientific questions?
@sjspielman Do you have any thoughts/recommendations about these questions above as far as future structure?
Thanks for the ping - frankly, this is not my knowledge domain in terms what variables would be appropriate for survival analysis, but I think a function would be ideal if possible so that all survival analyses as consistent as possible. Using a tidyeval setup to have users pass in the specific variables, ensure inputted types are acceptable, returns a plot and two tables as you've done in the template?
Closing all planned analysis tickets in favor of opening new proposed analysis/updated analysis tickets as needed.
Comparison of genomic/transcriptomic features and derivates such as TMB, pathway activity, etc... to survival. Determine markers that could add prognostic values to different PBTA cancer types. KM Plots for significant and interesting features.