AlexsLemonade / OpenPBTA-analysis

The analysis repository for the Open Pediatric Brain Tumor Atlas Project
Other
99 stars 66 forks source link

TP53 and telomerase panels for figure 4 #1280

Closed sjspielman closed 2 years ago

sjspielman commented 2 years ago

This PR partially addresses Issue #1272 and creates a new script fig4-tp53-telomerase.R for populating figure 4 panels.

fig4-tp53-telomerase.R produces the following figures:

Panel Figure File
A figures/pdfs/fig4/panels/tp53_stranded_roc_panel.pdf
B figures/pdfs/fig4/panels/tp53_scores_by_altered_panel.pdf
C figures/pdfs/fig4/panels/tp53_expression_by_altered_panel.pdf
D figures/pdfs/fig4/panels/tp53_scores_boxplot_panel.pdf and figures/pdfs/fig4/panels/tp53_scores_boxplot_legend.pdf
E figures/pdfs/fig4/panels/telomerase_scores_boxplot_panel.pdf
F Hazard ratio figure NOT MADE in this PR (see bullets below)

Notable changes:

Places for reviewers

Documentation Checklist

sjspielman commented 2 years ago

However, I also had an idea that these can be y-faceted by score, and we can align by TP53 score(?).

I'm not sure what you mean "faceted by score." Score is a continuous variable, and I do not see any reason to discretize it.

In terms of the "other" tp53 samples, we can keep them in to show the distribution, but I am really uncomfortable comparing the distributions with the classified activated vs loss categories with a formal stats test, because I still don't see that there is no clear hypothesis about what the "other" samples are. Not having certain mutations (biased by what mutations have been studied already that we have access to), in my mind, is absence of evidence for any hypothesis. @jaclyn-taroni, do you have any thoughts?

Should this be median +/sd?

If anything else probably should do mean + IQR, since sd tends to "go with" mean and median is more appropriate for a nonparametric. Can update this.

jharenza commented 2 years ago

I'm not sure what you mean "faceted by score." Score is a continuous variable, and I do not see any reason to discretize it.

Oh I just meant having the facets on the y axis with the scores still continuous but have the plots paneled by row, like in the sketch below. Top panel being TP53 scores, bottom being EXTEND; and they share the x-axis cancer groups with groups being ordered by TP53 score median as you have them currently.

image

sjspielman commented 2 years ago

With the faceting they'd end up with a shared axis, so we'll also see how it looks when the full figure PDF is compiled for whether separate or shared panel labels looks better!

sjspielman commented 2 years ago

I've just pushed some changes:

Violins

Boxplots

jharenza commented 2 years ago

"Other" is now included in violin plots, but it's not included in the statistical tests. The p-value labels were moved closer to lost/activated groupings to hopefully emphasize this. But I can see how this could be confusing.

Can you add a line between activated/lost showing that the p-value comparison goes to those groups? Maybe we do this in illustrator @jaclyn-taroni ?

Now as a single figure faceted vertically by scores where x-axis is in tp53 order, and we're using the mutator colors. If we want to use cancer group colors, then this place has to go back to being 2 separate vertical panels. What do we think?

I like this because you can start to see trends that we saw in the correlation plots - some groups have high TP53 and high telomerase scores, but others (meningioma) have the opposite trend.

I have "Telomerase score" as a label - do we prefer "Normalized EXTEND scores?"

Commented on #1283 that I think either "Telomerase score" or "Telomerase score (EXTEND)" is good

sjspielman commented 2 years ago

Updated with stat_pvalue_manual(), removed old legend file, and updated expression violin plots to show log(fpkm+1) which is also now reflected in the axis title.

jharenza commented 2 years ago

P-values look good!

One more thing- can we print out N, R, and p-values (and adjusted p when necessary), or add N to the x-axis in parenthesis for each group, for the respective plots for the manuscript legends? I think this info was previously in notebooks and/or tsv files.

sjspielman commented 2 years ago

One more thing- can we print out N, R, and p-values (and adjusted p when necessary), or add N to the x-axis in parenthesis for each group, for the respective plots for the manuscript legends? I think this info was previously in notebooks and/or tsv files.

Which plots are you referring to? We don't have any correlations in these plots. Do you mean adding N to the tp53 violin plots? I can definitely label those x-axes!

jharenza commented 2 years ago

One more thing- can we print out N, R, and p-values (and adjusted p when necessary), or add N to the x-axis in parenthesis for each group, for the respective plots for the manuscript legends? I think this info was previously in notebooks and/or tsv files.

Which plots are you referring to? We don't have any correlations in these plots. Do you mean adding N to the tp53 violin plots? I can definitely label those x-axes!

Oh yes was generalizing this with the telomerase TERT/TERC plots. I'm good whichever way- whether on plots, printed in notebook, or exported table.

sjspielman commented 2 years ago

Oh yes was generalizing this with the telomerase TERT/TERC plots.

Ok, these are over in PR #1283 and already have R labels. For this PR, I'll add sample sizes to the violin plot x-axis labels.

sjspielman commented 2 years ago

@jharenza Are the labels I added here what you had in mind? I added N= info to x-axes for violin plots and to the mutation status legend.

jharenza commented 2 years ago

@jharenza Are the labels I added here what you had in mind? I added N= info to x-axes for violin plots and to the mutation status legend.

yes, looks great!