Add survival curves by TP53 and telomerase scores

jharenza commented 2 years ago

Purpose/implementation Section

What scientific question is your analysis addressing?

Plotting survival curves for all PBTA, HGAT only, and non-HGAT samples for various categorizations of TP53 and telomerase scores for publication

What was your approach?

Important points:

utilized primary stranded RNA samples only (primary for the survival curve to be performed on one phase of therapy and stranded since the TP53 classifier did not perform well on the polyA dataset
Binned TP53 and telomerase scores in the following ways:
1. tp53_0.5 = TP53 high or low using a 0.5 cutoff
2. tel_0.5 = telomerase high or low using a 0.5 cutoff
3. tp53_strict = TP53 stricter cutoff using <= 25th and >= 75th quantiles as low/high and in between == mid
4. tel_strict = telomerase stricter cutoff using 25th and 75th quantiles as low/high and in between == mid
5. pheno_0.5 = TP53 & telomerase scores binned together using 0.5 as cutoff in both to be high or low
6. pheno_strict = TP53 & telomerase scores binned together using 25th and 75th quantiles to determine high/low/mid
7. pheno_extremes = TP53 & telomerase scores binned together using only 25th and 75th quantiles to determine high/low (left mid out of plot)

What GitHub issue does your pull request address?

NA

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

Is there anything that you want to discuss further?

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

No

Results

What types of results are included (e.g., table, figure)?

KM plots and results tables

What is your summary of the results?

As expected high telomerase scores/activity corresponds to a lower survival across all tumor types. High TP53 scores also correlate with lower survival compared to lower scores in all PBTA and non-HGAT samples. However, all HGAT have poor survival, so we cannot see any trends with TP53/telomerase, so I took HGAT out of the full analysis. In both PBTA all and non-HGAT PBTA, we observe higher TP53 scores result in a survival advantage. This has previously been reported in gliomas in "Association of Mutant TP53 with Alternative Lengthening of Telomeres and Favorable Prognosis in Glioma" https://cancerres.aacrjournals.org/content/66/13/6473. In this study, telomerase was measured using a PCR assay and TP53 mutations via sequencing. Here, we can confirm these previous results using classification of RNA-Seq.

Reproducibility Checklist

[ ] The dependencies required to run the code in this pull request have been added to the project Dockerfile.
[ ] This analysis has been added to continuous integration.

Documentation Checklist

[ ] This analysis module has a README and it is up to date.
[ ] This analysis is recorded in the table in analyses/README.md and the entry is up to date.
[ ] The analytical code is documented and contains comments.

runjin326 commented 2 years ago

@jashapiro , I have made the following updates: 1) Print out statistic results for all models 2) Add a README.md for the module and added more description/comments in the notebook 3) Fixed the function in quantile_calc.R to be more succinct - please check to see whether it makes sense.

There are still some issues left: 1) Some legends still get cutoff in the figures, any suggestion as to how to fix that? 2) When running pairwise_survdiff with HGAT only sample on pheno_strict and pheno_extremes, I got:

Error in survdiff.fit(y, groups, strata.keep, rho) : 
There is only 1 group

However, I checked and there are always >1 groups in the df and I also tried to set the pheno column as factor before calling the function, none of them works. So I just used a try to wrap around the function and those two have error message output instead of real statistics results. Any idea what else I can try?

3) Another question is just clarification - @jharenza: previously for pheno_strict, other than 4 extreme calls, only tp53_low telomerase_mid and tp53_mid telomerase_low were defined and everything else was called tp53 tel mid. This would count NA, tp53_high telomerase_mid and tp53_mid telomerase_high as mid as well? Is this intentional? Currently, I modified it to be a combination of tp53_strict and telomerase_strict but feel free to let me know if that is not what you want and I can change it back.

I will now look into the files used in CI and try to figure out what the issue might be.

jharenza commented 2 years ago

other than 4 extreme calls, only tp53_low telomerase_mid and tp53_mid telomerase_low were defined and everything else was called tp53 tel mid. This would count NA, tp53_high telomerase_mid and tp53_mid telomerase_high as mid as well? Is this intentional?

Oh no this was not intentional- thank you for catching this!

jharenza commented 2 years ago

@jashapiro i think this is ready for another look

runjin326 commented 2 years ago

@jashapiro - thanks for reviewing this! I think removing pheno_extreme is fine and I can implement using median instead of 0.5 for splitting. I think the idea behind all these pheno groups are to see which ones can get statistically significant survival results and I agree, pheno_strict can be the main one that we look at. And yes we can use multiple covariates for survival analysis (tp53 strict + telomerase strict) instead of using combined pheno strict column but I will need to modify the function a little bit. Maybe questions for @jharenza before I make any changes: 1) Should we use median instead of 0.5? 2) It is ok to drop pheno_extremes? 3) Should we re-write the notebook to center around pheno_strict and then add analysis for other pheno later on in the notebook? 4) Do we want to modify the function to take two separate covariates (tp53_strict and telomerase_strict) instead of one pheno_strict I will start implementing these once we have a consensus as to what to do next.

jharenza commented 2 years ago

Should we use median instead of 0.5?

I'm not sure if we should do this because the median will be biased higher or lower based on the cancer type in some cases and not necessarily reflect a predicted classification phenotype, for example, in HGG, where most classifier scores are high and TP53 is altered. We would not expect balanced groups there. (This was also seen previously in Osteosarcomas, where TP53 is the major driver, hence almost all tumors had scores >0.7). This is why I had looked at HGG separately from the rest of the samples and had planned to have the main figure of all cancers but a supplement showing the HGG only and one with HGG removed confirming that with or without removal, the overall results are the same.

My question here is more: should we use the top and bottom extremes and classify the rest as "mid" because we don't have the exact cutoff (0.5 was used previously in the original paper describing TP53 classifier and there wasn't one used for telomerase high/low), or just split at 0.5?

Additionally, as you mentioned @jashapiro we lose a lot of data using only extremes but the middle samples may also be harder to interpret using that terminology.

It is ok to drop pheno_extremes?

Sure - I think we can subset out points for the plots using pheno_strict, which was essentially what was done currently

Should we re-write the notebook to center around pheno_strict and then add analysis for other pheno later on in the notebook?

This may be ok if you also mean you'd keep the individual tel and TP53 analyses as well.

Do we want to modify the function to take two separate covariates (tp53_strict and telomerase_strict) instead of one pheno_strict I will start implementing these once we have a consensus as to what to do next.

I'm not an expert in survival analyses, either, but I'm not sure these variables are independent as biologically, they are linked, which is why I tried to bin the samples into high for both, low for both, mid for both, or a mix or high/low/mid. Can you do some research on this @runjin326 - what constitutes covariates and how to analyze if they may be dependent on one another?

Maybe we also have to add the Cox proportional hazards function within this notebook for the covariates? https://www.datacamp.com/community/tutorials/survival-analysis-R

jashapiro commented 2 years ago

Maybe we also have to add the Cox proportional hazards function within this notebook for the covariates? https://www.datacamp.com/community/tutorials/survival-analysis-R

I think this may be the correct solution, as it would allow us to use the scores directly, rather than by turning them into a discrete variable. You don't get pretty KM plots out of it, but the stats are more appropriate to the data.

runjin326 commented 2 years ago

Thanks both for the input - I will do the following: 1) keep 0.5 as a cutoff for now 2) drop pheno_extreme category and first plot out survival using pheno_strict and then the rest (e.g., tel 0.5 and tp53 0.5) 3) Research on what to do when two covariates might be inter-dependent when defining the model 4) Implement CoxPH survival analysis and see how the results look (I can also work on making better figures from CoxPH models). Will ping you once everything is pushed and ready for another look.

jashapiro commented 2 years ago

@runjin326 I think we may want to hold off a bit before you go much further on this branch. We have been thinking about some changes that may require a more substantial restructuring. From discussions with @jaclyn-taroni, I think it will include transitioning to the coxph model, which should allow incorporating multiple covariants as well as continuous variables.

So I think steps 3 & 4 are probably fine to start to think about, but you may want to start working in a different branch/notebook. We will post more of a plan soon!

runjin326 commented 2 years ago

@jashapiro , sure thanks! I will hold off until a more solid plan is in place.

jaclyn-taroni commented 2 years ago

Okay, I considered waiting and making this more detailed but I have a busy day tomorrow and better here, less detailed, where others can see it than only in my head/notebook!

From what's here and chatting with @jharenza, here's my understanding of the hypotheses we want to test:

High TP53 classifier scores mean poorer OS
High telomerase activity scores mean poorer OS

We are also interested in whether TP53 alterations (where we will use the classifier score as our proxy) in the presence of high telomerase activity improves survival based on some literature.

We also know that disease types have different OS.

We don't necessarily need to try to identify cut points to discretize the scores; we can instead use Cox regression.

As @jashapiro mentioned, we chatted about this a bit today and we came up with the following general plan.

We understand that it is common practice to do univariate analyses before you perform the multivariate analysis. We propose we do the following univariate analyses:

TP53 classifier score (as a continuous variable)
EXTEND score (as a continuous variable)
Cancer group, removing cancer groups with less than 10 samples
HGG vs. non-HGG

I understand from @jharenza that we are interested in demonstrating if a relationship exists between the TP53 scores and telomerase activity and OS in brain cancers. That can be accomplished with these univariate analyses (but we also may decide/find out that's not a good idea because we need to include disease label as a covariate!).

Finally, we can build a multivariate Cox regression analysis (TP53 classifier score, EXTEND score and whatever disease label we determine is appropriate). @envest has generously agreed to help us out with the specifics of setting up that analysis to make sure we're doing what we want to be doing! One of the ways I think about all of this, as a non-expert, is that we want to control for disease type. Then we can plan to include something like a forest plot to display this information.

To summarize the biggest departures from what we're currently doing: 1. we won't "slice" by histology, we'll include it in the model instead 2. we won't try to discretize the scores

As far as moving forward with this plan, should folks agree: I think it might be easier to start a new branch and notebook (that would be my preference as a coder/analyst), but we should keep this branch around.

I also want to mention that we don't need to worry about publication ready figures just yet. We have this notion of the separate figures directory because sometimes it's helpful to keep initial analysis and figure polishing separate. I think it's better to focus on getting the analysis right with the notebook I'm proposing and then we can polish the viz later!

jharenza commented 2 years ago

Thank you @jaclyn-taroni! This all sounds good!

As far as moving forward with this plan, should folks agree: I think it might be easier to start a new branch and notebook (that would be my preference as a coder/analyst), but we should keep this branch around.

@runjin326 when you start on the above, will you create a new branch? Thanks!

runjin326 commented 2 years ago

Yes - I will start working on this from a new branch.

jaclyn-taroni commented 2 years ago

I'm going to close this to avoid confusion – we should be taking a look at #1212 now!

AlexsLemonade / OpenPBTA-analysis