Closed runjin326 closed 2 years ago
Hi @runjin326. Thanks for this! Can you double check the subtype analyses - I would think we would have enough samples for medullo and some LGG subtypes to perform this analysis.
I also should have asked for one more analysis:
@jharenza - I have now added figures for each cancer group.
For molecular subtypes, the issue for plotting is either there are some covariates that would give Inf
for 95CI or NA for coefficient OR they just would not converge at all (see red warning messages in the html file). However, I did print out the summary of the model in the console so that we have some numeric data and we can generate some figures ourselves if necessary.
I also tried to clean up the model based on the fit results (i.e., if the coef is NA or 95CI is Inf
), I just filter them out and then re-build the model with remaining covariates (not so sure whether this is legit though). I was hoping this way I can at least make some figures. However, this method only worked for DMG (included the figure at the end) and maybe we should just generate figures using the summary(fit) output.
@jharenza - I have now added figures for each cancer group. For molecular subtypes, the issue for plotting is either there are some covariates that would give
Inf
for 95CI or NA for coefficient OR they just would not converge at all (see red warning messages in the html file). However, I did print out the summary of the model in the console so that we have some numeric data and we can generate some figures ourselves if necessary.I also tried to clean up the model based on the fit results (i.e., if the coef is NA or 95CI is
Inf
), I just filter them out and then re-build the model with remaining covariates (not so sure whether this is legit though). I was hoping this way I can at least make some figures. However, this method only worked for DMG (included the figure at the end) and maybe we should just generate figures using the summary(fit) output.
@envest, would you be able to provide some guidance on whether the above is sound? Thanks!
Thanks -- I think my instinct would be to try to understand why the Inf
and NA
results are showing up (e.g. low sample size for a group, lack of variation in that predictor, etc.). I would feel better about excluding predictors from a biologically-motivated model if there was a known reason to justify it and that reason was documented in the code. Good luck!
@sjspielman - I cleaned up the notebook a little bit and added more explanation. Additionally, I finally figured out why the plots were not printing - it was because the CI range was too big so I modified a bit. The way I did it is I took the original ggforest function and modified some lines to do the following:
This modification does make the plots a bit ugly (there are more overlap and some p values were cutoff) but should be good for initial review and if we decided to use any one of them in the figures, we can modify accordingly.
The figures are also saved in folders.
Thanks for making some of those changes, @runjin326! We still need to get the code more modular though. I will do a thorough review on this in the next few days about how we can structure that.
Hi @runjin326! Thanks for figuring out why those plots weren't rendering - that makes sense! Is there any way you can expand the plotting frame so we can see the p-values on the plots? Can you also check your code for why there are multiple references in these analyses?
@jharenza, @sjspielman - thanks so much for the comments. Before I make any additional modifications, I think it would be better if we reach consensus as to how to proceed.
1) Do we feel comfortable removing covariates with inf/na in the plot? - The reason I did it was NOT to "cherry pick" anything. It is solely for the purpose of plotting. And we can report a table with all covariates and then a figure with just the ones that can be plotted. However, this can be misleading as well. So open to ideas.
2) Do we want to modify the function to make every possible plot? - The way I think about this module is more so to generate figures that can be used for the paper. Not so much to generate every single possible figure. And the reason for modifying the function was just to make sure they can be plotted and we can choose which one to use. So maybe we can decide which ones we would likely use and fine-tune those ones? It is really hard to find a parameter that works for every plot (some would have overlap of axis and some would have cutoff pvalue).
3) I can do a similar function once we decided on the above. Haven't done that since I want to print everything out (how the model was build etc.) for easier review. Once we decided on the above, I can make changes accordingly.
Please let me know your thoughts.
Do we feel comfortable removing covariates with inf/na in the plot?
To me, the issue is not about where to report the NA and Inf. It's more about - why are we getting these values in the first place? It suggests there is an issue with the data or model itself. My guess is this happens when there is insufficient signal in the data to build a strong model in the first place.
Do we want to modify the function to make every possible plot? - The way I think about this module is more so to generate figures that can be used for the paper
The figures for the paper will be made in figures/scripts/
, and we can think of this notebook as a guide for what figures to make that are pub-ready. My point is: I do not suggest spending a lot of time making publication-ready figures as part of this notebook. The plots that come out of ggforest
are really busy with a lot of visual noise that doesn't need to be included in the figures. I have templated out code for making our own forest plots based on model results that is ready to go when this analysis module is completed and we decide which model(s) should be included in the manuscript.
My overall sense is: this PR and #1276 are doing a lot of exploratory work on survival models. While this is not necessarily a problem in and of itself, I'm getting worried we're moving into data dredging territory. Building a bunch of models and then choosing one that "looks the best" for the manuscript is problematic. Indeed, both this and #1276 build, for example, a model for all cancer groups and then models for each individual cancer group, which is getting very into multiple testing gray areas.
I think more discussion is merited here @jaclyn-taroni @jharenza to solidify specific hypotheses about survival that we can build model(s) for.
@jharenza - to answer your question about multiple references - I think it is due to the fact that they are all indistinguishable from each other and have a 0 in estimate.1 (which is what is used to plot the graph). Here is a screenshot of the data frame that is used for plotting (ependymoma in this case) And you can see both CD8 and Tregs are 0 - hence all coded as references.
@jharenza - to answer your question about multiple references - I think it is due to the fact that they are all indistinguishable from each other and have a 0 in estimate.1
I see, that makes sense. I didn't realize default behavior was to use the lowest value as the ref. I think there is also a way to set the reference, which we might want to do. Let me go through the comments above and we can figure out the best course of action.
Noting some important updates:
I have gone through and substantially updated the survival notebook @jaclyn-taroni @jharenza -
util/survival_models.R
to perform fitting and export, increasing overall modularity
PDL-1 expression + library
(polya and stranded levels) across the entire cohort. This result showed mostly significant effect of library, so it seemed prudent to just proceed with stranded data and avoid the batch effect. All subsequent models consider only stranded dataPDL-1 expression
across the entire cohortcell type fractions + PDL-1 expression
across the entire cohortcell type fractions + PDL-1 expression
per cancer group, considering only groups with suitable sample size: there have to be at least 3 DECEASED in the stranded datamolecular signature + PDL-1 expression
per cancer group, considering only groups were all molecular subtypes have at least 3 DECEASED in the stranded data. There is only 1 suitable group: DIPGMy overall impressions of these analyses are that we don't have sufficient signal in the data to detect much of anything.
@sjspielman based on some things I am writing about immune microenvironment for figure 5, can you also add "Neurofibroma Plexiform" and "Schwannoma" to the cancer groups of interest here? Thanks!
For this comment,
For the Run multivariate analysis for individual cancer group parsed by molecular subtypes , I am not seeing medulloblastoma group 3 samples in the table and I just went back and checked that they are all stranded, so can you see if you can find out why they are missing?
Maybe we are missing each other on what this analysis is meant to be. I interpreted it as modeling survival differences across molecular subtypes within a given cancer group. Therefore, as I see it, we can only assess cancer groups where each subtype has >= 3 deceased. This isn't the case for MB, so it wasn't included. There are no deceased in WNT
.
cancer_group molecular_subtype DECEASED LIVING
<chr> <chr> <dbl> <dbl>
1 Medulloblastoma MB, Group3 4 7
2 Medulloblastoma MB, Group4 12 36
3 Medulloblastoma MB, SHH 5 18
4 Medulloblastoma MB, To be classified 7 9
5 Medulloblastoma MB, WNT 0 10
Did you have something else in mind? Or a different approach?
Oh I see what you're doing. No, we won't want to put any limitations on the subtypes because then we won't be able to recapitulate know. OS differences. If we focused only on subtypes with poor prognosis, we'd use that approach, but this is an overall ask whether our subtyping aligns with what is known, and do we see additional differences?
I've updated this notebook to have less stringent checks to perform molecular subtype analyses. Cancer groups with >=3 deceased overall, and those with more than 1 subtype are analyzed and results are exported.
Due to very small sample sizes, none of the resulting estimates are reliable - CIs are often infinite.
- That being said, I think we should probably group all HGG together in this plot (would match up with the HGG tp53 telomerase plot TBM - to be made 😄) and if possible, use
HGG, H3 wildtype
as a reference.
Hi @sjspielman - can you group all HGGs together here to do the univariate x subtype analysis, which would match our HGG, non-HGG analysis in the previous notebook? Thanks! After that is in, I think we are good to go.
Ahhh, @jharenza I think I misunderstood. Do you mean two models, one for HGG and one for non-HGG, each univariate with subtype?
Ahhh, @jharenza I think I misunderstood. Do you mean two models, one for HGG and one for non-HGG, each univariate with subtype?
Sorry for being confusing and for the late response! No, only meant to combine all HGG cancer groups together. We had separated them mainly because of subtypes. For example, DIPGs can be broken down into DMG H3K28 or DMG H3K35 or HGG H3 wildtype. The DMG designation was something clinicians had wanted us to use for only those with an H3 alteration. However, there are also midline H3 wildtype tumors which may or may not be diffuse, but we would need to confirm with radiology. As a result, we have this umbrella of brainstem/pons-originating tumors within groups such as HGG, DIPG, and DMG, so for this specific analysis, which is testing subtype, I think we ought to combine them.
Edited:
I updated to run a model on HGG groups, defined as - hgg_groups <- c("High-grade glioma astrocytoma", "Diffuse midline glioma", "Diffuse intrinsic pontine glioma")
. This code is on lines 208-223.
Nothing is significant in this model either.
This should be ready for merge once CI completes.
Purpose/implementation Section
What scientific question is your analysis addressing?
Determining whether there is an immune cell type which might predict better or worse prognosis for cancer types and/or molecular subtypes within a cancer group.
What was your approach?
What GitHub issue does your pull request address?
https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/1258
Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.
Which areas should receive a particularly close look?
Two things: 1) For using immune cell results - each immune cell type is considered a
covariate
and I used the score for fitting the model - let me know whether this is the right way. 2) After subsetting for each cancer group + molecular subtypes, none of them converge and cannot generate meaningful hazard plots - maybe we need to think about alternatives.Is there anything that you want to discuss further?
See above.
Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?
Yes.
Results
What types of results are included (e.g., table, figure)?
1) html file with forest plot 2) tsv files with the model fit output with names of
cox_reg_results_in_*.tsv
(* is cancer group of interest).What is your summary of the results?
Macrophage_M1, Macrophage_M2, and all cancer groups are significant. However, expression of CD274 and other immune cell types are not.
Reproducibility Checklist
Documentation Checklist
README
and it is up to date.analyses/README.md
and the entry is up to date.