Prelim results added for survival analysis

runjin326 commented 2 years ago

Purpose/implementation Section

What scientific question is your analysis addressing?

This PR addresses the discussion we have in this PR

What was your approach?

Currently, the notebook only does univariate analysis for the following:

TP53 classifier score (as a continuous variable)
EXTEND score (as a continuous variable)
Cancer group, removing cancer groups with less than 10 samples
HGG vs. non-HGG

The cox regression cannot be plotted but the pvals were output. And for categorical variables (point 3 and 4), the survival plots were generated.

What GitHub issue does your pull request address?

NA

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

I did not use the survival_analysis function in the notebook to make things easier - please check to see whether function to fit survival model and generate plots make sense.

Is there anything that you want to discuss further?

Based on the results - it looks like disease label definitely have an effect on survival (as well as tp53 score and telomerase activity). What would we use as disease label for the multivariate analysis?
Any additional analysis that would be of interest? Should we set up the forest plot?

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Yes.

Results

What types of results are included (e.g., table, figure)?

Plots

coxph_survival_per_hgg_group.png
coxph_survival_per_short_histology.png

Results

cox_reg_results_per_telomerase_score.tsv
cox_reg_results_per_tp53_score.tsv
log_rank_survival_per_short_histology.RDS
log_rank_survival_per_hgg_group.RDS

What is your summary of the results?

Looks like all 4 models that we looked at generated significant results.

Reproducibility Checklist

[x] The dependencies required to run the code in this pull request have been added to the project Dockerfile.
[x] This analysis has been added to continuous integration.

Documentation Checklist

[ ] This analysis module has a README and it is up to date.
[ ] This analysis is recorded in the table in analyses/README.md and the entry is up to date.
[x] The analytical code is documented and contains comments.

jaclyn-taroni commented 2 years ago

I'm going to take the first look at this and then I'll loop in @envest for thoughts on next steps!

runjin326 commented 2 years ago

@envest - thanks so much for the detailed review! I have made the suggested changes and commented/resolved some of your comments above. I think this is ready for another look! @jaclyn-taroni

jaclyn-taroni commented 2 years ago

CI failure is unrelated to the changes in this PR. I just hit rerun.

jaclyn-taroni commented 2 years ago

I notice that the step testing these changes was after the one that timed out, so in https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/1212/commits/4ad379100f07710967c24a517c9b1feaeab19f9f I moved it up so, if we still have the time out problem, we can at least understand if everything for these changes looks okay!

runjin326 commented 2 years ago

A summary of the development: 1) Using broad_histology_display still does not converge - hence I added the oncoprint_group which seemed to be converging and generated informative results 2) Density plot stratified by oncoprint_group is also included

Things to discuss for next steps: 1) Do we want to only keep the Initial CNS Tumor during the histology file filtering step? 2) Does oncoprint_group make sense? If so, I can then remove broad_histology_display part. Alternatively, other suggestions for grouping them are welcome.

envest commented 2 years ago

A summary of the development:

Using broad_histology_display still does not converge - hence I added the oncoprint_group which seemed to be converging and generated informative results

Density plot stratified by oncoprint_group is also included

The addition of oncoprint_group looks interesting -- according to the model, within the same oncoprint_group, tp53 and extend are less important as predictors.

Things to discuss for next steps:

Do we want to only keep the Initial CNS Tumor during the histology file filtering step?

Does oncoprint_group make sense? If so, I can then remove broad_histology_display part. Alternatively, other suggestions for grouping them are welcome.

My feeling on oncoprint_group is: would this analysis make sense a priori before looking at the data? If yes, then that's something to consider including. Unfortunately I am not up to speed on the biological implications of oncoprint_group for this project.

With the multivariate models and visualizations in place 👍 , I think it's best I leave it to @jaclyn-taroni to help wrap up / summarize next steps.

jaclyn-taroni commented 2 years ago

To weigh in on the oncoprint_group discussion, I'm not sure that makes sense here or anywhere outside of the specific purpose it is used for – we expect it to only be used in display for Oncoprints, where individual cancer groups are also displayed and when we have essentially curated lists of genes to display.

runjin326 commented 2 years ago

@jaclyn-taroni , thanks for the feedback! Yes - I tap into this column since the broad_histology_display groups are too granular for the multivariate analysis so I am trying to see whether there are even broader terms to use. Since this is not desired, should we just drop it and broad histology and only keep HGAT vs. non-HGAT for our final analysis?

jaclyn-taroni commented 2 years ago

Since this is not desired, should we just drop it and broad histology and only keep HGAT vs. non-HGAT for our final analysis?

Yea I think that sounds good @runjin326, thank you! Those comparisons both seem well-justified to me but only one of them (HGAT vs. non-HGAT) appears to have sufficient data.

runjin326 commented 2 years ago

@jaclyn-taroni, changes pushed - now the only question would be the sample selection portion.

jaclyn-taroni commented 2 years ago

now the only question would be the sample selection portion.

I can definitely see an argument for sticking with Initial CNS tumor only. In that case, I don't know why we need to use the primary plus list of independent specimens unless there's no other way to consistently pick an Initial CNS Tumor specimen (if there are multiple) across different analyses without using that list.

runjin326 commented 2 years ago

@jaclyn-taroni, I see your point now! So I modified to not use independent RNA primary-plus list and instead, called distinct(sample_id, .keep_all=TRUE on the meta-indep after combining TP53 and telomerase scores. Please check to see whether it looks good now.

runjin326 commented 2 years ago

@jaclyn-taroni - I have made corresponding changes. Please review!

jaclyn-taroni commented 2 years ago

Thanks @runjin326 - looks good! I'll merge once CI passes.

AlexsLemonade / OpenPBTA-analysis