AlexsLemonade / OpenPBTA-analysis

The analysis repository for the Open Pediatric Brain Tumor Atlas Project
Other
99 stars 66 forks source link

Survival analysis for immune cells added #1264

Closed runjin326 closed 2 years ago

runjin326 commented 2 years ago

Purpose/implementation Section

What scientific question is your analysis addressing?

Determining whether there is an immune cell type which might predict better or worse prognosis for cancer types and/or molecular subtypes within a cancer group.

What was your approach?

  1. Perform multivariate survival analysis using immune cell types (minus uncharacterized), PDL1 expression, and cancer group as covariates
  2. Within each cancer group, perform survival analysis using immune cell types (minus uncharacterized), PDL1 expression, and molecular subtype as covariates (while acknowledging we may have low Ns here)

What GitHub issue does your pull request address?

https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/1258

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

Two things: 1) For using immune cell results - each immune cell type is considered a covariate and I used the score for fitting the model - let me know whether this is the right way. 2) After subsetting for each cancer group + molecular subtypes, none of them converge and cannot generate meaningful hazard plots - maybe we need to think about alternatives.

Is there anything that you want to discuss further?

See above.

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Yes.

Results

What types of results are included (e.g., table, figure)?

1) html file with forest plot 2) tsv files with the model fit output with names of cox_reg_results_in_*.tsv (* is cancer group of interest).

What is your summary of the results?

Macrophage_M1, Macrophage_M2, and all cancer groups are significant. However, expression of CD274 and other immune cell types are not.

Reproducibility Checklist

Documentation Checklist

jharenza commented 2 years ago

Hi @runjin326. Thanks for this! Can you double check the subtype analyses - I would think we would have enough samples for medullo and some LGG subtypes to perform this analysis.

I also should have asked for one more analysis:

runjin326 commented 2 years ago

@jharenza - I have now added figures for each cancer group. For molecular subtypes, the issue for plotting is either there are some covariates that would give Inf for 95CI or NA for coefficient OR they just would not converge at all (see red warning messages in the html file). However, I did print out the summary of the model in the console so that we have some numeric data and we can generate some figures ourselves if necessary.

I also tried to clean up the model based on the fit results (i.e., if the coef is NA or 95CI is Inf), I just filter them out and then re-build the model with remaining covariates (not so sure whether this is legit though). I was hoping this way I can at least make some figures. However, this method only worked for DMG (included the figure at the end) and maybe we should just generate figures using the summary(fit) output.

jharenza commented 2 years ago

@jharenza - I have now added figures for each cancer group. For molecular subtypes, the issue for plotting is either there are some covariates that would give Inf for 95CI or NA for coefficient OR they just would not converge at all (see red warning messages in the html file). However, I did print out the summary of the model in the console so that we have some numeric data and we can generate some figures ourselves if necessary.

I also tried to clean up the model based on the fit results (i.e., if the coef is NA or 95CI is Inf), I just filter them out and then re-build the model with remaining covariates (not so sure whether this is legit though). I was hoping this way I can at least make some figures. However, this method only worked for DMG (included the figure at the end) and maybe we should just generate figures using the summary(fit) output.

@envest, would you be able to provide some guidance on whether the above is sound? Thanks!

envest commented 2 years ago

Thanks -- I think my instinct would be to try to understand why the Inf and NA results are showing up (e.g. low sample size for a group, lack of variation in that predictor, etc.). I would feel better about excluding predictors from a biologically-motivated model if there was a known reason to justify it and that reason was documented in the code. Good luck!

runjin326 commented 2 years ago

@sjspielman - I cleaned up the notebook a little bit and added more explanation. Additionally, I finally figured out why the plots were not printing - it was because the CI range was too big so I modified a bit. The way I did it is I took the original ggforest function and modified some lines to do the following:

  1. change the y axis range (main reason why plots don't plot is because the ranges are too big to be plotted)
  2. from the coefficient table, remomve any covaraites that are inf/NA so that they do not get plotted

This modification does make the plots a bit ugly (there are more overlap and some p values were cutoff) but should be good for initial review and if we decided to use any one of them in the figures, we can modify accordingly.

The figures are also saved in folders.

sjspielman commented 2 years ago

Thanks for making some of those changes, @runjin326! We still need to get the code more modular though. I will do a thorough review on this in the next few days about how we can structure that.

jharenza commented 2 years ago

Hi @runjin326! Thanks for figuring out why those plots weren't rendering - that makes sense! Is there any way you can expand the plotting frame so we can see the p-values on the plots? Can you also check your code for why there are multiple references in these analyses?

runjin326 commented 2 years ago

@jharenza, @sjspielman - thanks so much for the comments. Before I make any additional modifications, I think it would be better if we reach consensus as to how to proceed.

1) Do we feel comfortable removing covariates with inf/na in the plot? - The reason I did it was NOT to "cherry pick" anything. It is solely for the purpose of plotting. And we can report a table with all covariates and then a figure with just the ones that can be plotted. However, this can be misleading as well. So open to ideas.

2) Do we want to modify the function to make every possible plot? - The way I think about this module is more so to generate figures that can be used for the paper. Not so much to generate every single possible figure. And the reason for modifying the function was just to make sure they can be plotted and we can choose which one to use. So maybe we can decide which ones we would likely use and fine-tune those ones? It is really hard to find a parameter that works for every plot (some would have overlap of axis and some would have cutoff pvalue).

3) I can do a similar function once we decided on the above. Haven't done that since I want to print everything out (how the model was build etc.) for easier review. Once we decided on the above, I can make changes accordingly.

Please let me know your thoughts.

sjspielman commented 2 years ago

Do we feel comfortable removing covariates with inf/na in the plot?

To me, the issue is not about where to report the NA and Inf. It's more about - why are we getting these values in the first place? It suggests there is an issue with the data or model itself. My guess is this happens when there is insufficient signal in the data to build a strong model in the first place.

Do we want to modify the function to make every possible plot? - The way I think about this module is more so to generate figures that can be used for the paper

The figures for the paper will be made in figures/scripts/, and we can think of this notebook as a guide for what figures to make that are pub-ready. My point is: I do not suggest spending a lot of time making publication-ready figures as part of this notebook. The plots that come out of ggforest are really busy with a lot of visual noise that doesn't need to be included in the figures. I have templated out code for making our own forest plots based on model results that is ready to go when this analysis module is completed and we decide which model(s) should be included in the manuscript.

My overall sense is: this PR and #1276 are doing a lot of exploratory work on survival models. While this is not necessarily a problem in and of itself, I'm getting worried we're moving into data dredging territory. Building a bunch of models and then choosing one that "looks the best" for the manuscript is problematic. Indeed, both this and #1276 build, for example, a model for all cancer groups and then models for each individual cancer group, which is getting very into multiple testing gray areas.

I think more discussion is merited here @jaclyn-taroni @jharenza to solidify specific hypotheses about survival that we can build model(s) for.

runjin326 commented 2 years ago

@jharenza - to answer your question about multiple references - I think it is due to the fact that they are all indistinguishable from each other and have a 0 in estimate.1 (which is what is used to plot the graph). Here is a screenshot of the data frame that is used for plotting (ependymoma in this case) image And you can see both CD8 and Tregs are 0 - hence all coded as references.

jharenza commented 2 years ago

@jharenza - to answer your question about multiple references - I think it is due to the fact that they are all indistinguishable from each other and have a 0 in estimate.1

I see, that makes sense. I didn't realize default behavior was to use the lowest value as the ref. I think there is also a way to set the reference, which we might want to do. Let me go through the comments above and we can figure out the best course of action.

sjspielman commented 2 years ago

Noting some important updates:

sjspielman commented 2 years ago

I have gone through and substantially updated the survival notebook @jaclyn-taroni @jharenza -

My overall impressions of these analyses are that we don't have sufficient signal in the data to detect much of anything.

jharenza commented 2 years ago

@sjspielman based on some things I am writing about immune microenvironment for figure 5, can you also add "Neurofibroma Plexiform" and "Schwannoma" to the cancer groups of interest here? Thanks!

sjspielman commented 2 years ago

For this comment,

For the Run multivariate analysis for individual cancer group parsed by molecular subtypes , I am not seeing medulloblastoma group 3 samples in the table and I just went back and checked that they are all stranded, so can you see if you can find out why they are missing?

Maybe we are missing each other on what this analysis is meant to be. I interpreted it as modeling survival differences across molecular subtypes within a given cancer group. Therefore, as I see it, we can only assess cancer groups where each subtype has >= 3 deceased. This isn't the case for MB, so it wasn't included. There are no deceased in WNT.

  cancer_group    molecular_subtype    DECEASED LIVING
  <chr>           <chr>                   <dbl>  <dbl>
1 Medulloblastoma MB, Group3                  4      7
2 Medulloblastoma MB, Group4                 12     36
3 Medulloblastoma MB, SHH                     5     18
4 Medulloblastoma MB, To be classified        7      9
5 Medulloblastoma MB, WNT                     0     10

Did you have something else in mind? Or a different approach?

jharenza commented 2 years ago

Oh I see what you're doing. No, we won't want to put any limitations on the subtypes because then we won't be able to recapitulate know. OS differences. If we focused only on subtypes with poor prognosis, we'd use that approach, but this is an overall ask whether our subtyping aligns with what is known, and do we see additional differences?

sjspielman commented 2 years ago

I've updated this notebook to have less stringent checks to perform molecular subtype analyses. Cancer groups with >=3 deceased overall, and those with more than 1 subtype are analyzed and results are exported.

Due to very small sample sizes, none of the resulting estimates are reliable - CIs are often infinite.

jharenza commented 2 years ago
  • That being said, I think we should probably group all HGG together in this plot (would match up with the HGG tp53 telomerase plot TBM - to be made 😄) and if possible, use HGG, H3 wildtype as a reference.

Hi @sjspielman - can you group all HGGs together here to do the univariate x subtype analysis, which would match our HGG, non-HGG analysis in the previous notebook? Thanks! After that is in, I think we are good to go.

sjspielman commented 2 years ago

Ahhh, @jharenza I think I misunderstood. Do you mean two models, one for HGG and one for non-HGG, each univariate with subtype?

jharenza commented 2 years ago

Ahhh, @jharenza I think I misunderstood. Do you mean two models, one for HGG and one for non-HGG, each univariate with subtype?

Sorry for being confusing and for the late response! No, only meant to combine all HGG cancer groups together. We had separated them mainly because of subtypes. For example, DIPGs can be broken down into DMG H3K28 or DMG H3K35 or HGG H3 wildtype. The DMG designation was something clinicians had wanted us to use for only those with an H3 alteration. However, there are also midline H3 wildtype tumors which may or may not be diffuse, but we would need to confirm with radiology. As a result, we have this umbrella of brainstem/pons-originating tumors within groups such as HGG, DIPG, and DMG, so for this specific analysis, which is testing subtype, I think we ought to combine them.

sjspielman commented 2 years ago

Edited:

I updated to run a model on HGG groups, defined as - hgg_groups <- c("High-grade glioma astrocytoma", "Diffuse midline glioma", "Diffuse intrinsic pontine glioma"). This code is on lines 208-223.

Nothing is significant in this model either.

jharenza commented 2 years ago

This should be ready for merge once CI completes.