figure 5 correlation added

runjin326 commented 2 years ago

Purpose/implementation Section

What scientific question is your analysis addressing?

Adding tp53 vs. normEXTEND and tp53 vs. breaks density plots

What was your approach?

For tp53 scores vs. normEXTEND - 1) table S3 from the supplementary tables is used 2) results are joined using Kids_First_Biospecimen_ID_RNA

For tp53 scores vs. breaks density - 1) breakpoint density data from chromosome instability module is used 2) results are joined using Kids_First_Biospecimen_ID_DNA

What GitHub issue does your pull request address?

https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/1237

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

Please check to the see whether the legends, font, color etc for the figures look good.

Is there anything that you want to discuss further?

No.

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Yes.

Results

What types of results are included (e.g., table, figure)?

2 figures in pdf format.

What is your summary of the results?

tp53 vs. NormEXTEND - R = 0.3, p <2.2e-16 tp53 vs. breaks density - R = 0.25, p = 1.2e-13

Reproducibility Checklist

[x] The dependencies required to run the code in this pull request have been added to the project Dockerfile.
[ ] This analysis has been added to continuous integration.

Documentation Checklist

[ ] This analysis module has a README and it is up to date.
[ ] This analysis is recorded in the table in analyses/README.md and the entry is up to date.
[x] The analytical code is documented and contains comments.

sjspielman commented 2 years ago

Hi @runjin326, thanks for getting started on this! The immediate thing I notice is, I'm concerned that lumping together cancer groups may not be the right way to go here. This is because, for example, there may be different relationships when comparing values cancer groups. I've seen some previous plots made where there are negative correlations between This means the overall positive relationship is a sort of "paradox" resulting from combining different groupings, essentially this phenomenon.

Can you make these scatterplots faceted by cancer group as well? This will give us a sense of how pervasive this issue might be and we can go from there!

In addition, for final plots, we'll want to make sure to use ggpubr::theme_pubr() as described here.

runjin326 commented 2 years ago

@sjspielman thanks so much for the feedback. I tried running facet by cancer group but it looks like there are way too many cancer groups for this to be meaningful. Should we try either broad/short histology instead?

And for the theme part: do you mean the following? I.e., instead of theme_bw() just add ggtheme = theme_pubr() in the ggplot? Should I also remove the theme(axis.text.x ...) part?

ggplot(data = tp53_extend , 
       aes(x = NormEXTENDScores_fpkm, y = tp53_score),
       ggtheme = theme_pubr()) + 
  geom_point() + 
  xlab("NormEXTEND Scores") + 
  ylab("TP53 Scores") + 
  geom_smooth(method=lm, se=TRUE) + 
  stat_cor(method = "pearson", label.x = 0.05, label.y = 1, size = 6) + 
  theme(axis.text.x = element_text(size = 14), axis.title.x = element_text(size = 16),
        axis.text.y = element_text(size = 14), axis.title.y = element_text(size = 16))

cc: @jharenza

sjspielman commented 2 years ago

Hi @runjin326, for faceting across cancer groups we want to focus on the names in the column cancer_group_display, not the cancer_group column. This information is here, so you will need to do some joining in there.

The other theme additions are fine - text size always needs to be tweaked a bit on a per site basis! The idea is just that the overall "look" is a pubr theme :)

runjin326 commented 2 years ago

@sjspielman - thanks so much for the feedback. The figures are updated now. Please check to see whether that is what you expected.

sjspielman commented 2 years ago

This looks good faceted, thanks! One more quick change to the figures would be great - can we remove the NA cancer groups so there are no NA panels?

I think we might have an issue with the correlations themselves, though. Because we are now showing these separately, we may need to correct P-values for multiple testing. The stat_cor() function doesn't perform any corrections , so the P-values it returns are not right for this scenario. The correlation coefficients are right, just not the significance.

One way to deal with this in the code is to calculate correlation coefficients and associated P-values separately (before plotting!), and correct those P-values using Bonferroni correction simply by multiplying P-values by the number of tests performed. Then, this information can be added to the plots with geom_text() from a separate data frame, without using stat_cor().

Let me know if this makes sense and/or how I can clarify anything!

runjin326 commented 2 years ago

@sjspielman - thanks so much for the feedback. The figures are modified accordingly. Please check the code to see whether the calculation was performed as what you expected, particularly the Bonferroni correction. I looked up online and seemed like the correction should be decided by the number of tests instead of multiply? I used division but please double check to see which one is correct.

sjspielman commented 2 years ago

I looked up online and seemed like the correction should be decided by the number of tests instead of multiply? I used division but please double check to see which one is correct.

It depends on how you are assessing significance. With multiple tests, we expect more false positives so we want to be more stringent with our assessments. Therefore, we want p-values to increase (because we expect fewer things are significant!) when you perform multiple comparisons, so you multiply P-values by the number of tests performed. Another perspective is to not change P-values but instead reconsider the alpha significance threshold and lower it by dividing by the number of tests, so that fewer P-values pass the threshold. In this case, we want to modify the P-values with multiplication, and I'll have a look at what you did!

runjin326 commented 2 years ago

@sjspielman - thanks! The problem with multiplying is that we then would have p value that is >1, which does not make much sense. Please let me know whether there is an alternative method for doing the correction.

sjspielman commented 2 years ago

The problem with multiplying is that we then would have p value that is >1, which does not make much sense. It's actually exactly right, and just what happens when you perform corrections for multiple comparisons! When you make the correction, you max out P-values at 1 by setting any P-value that becomes >= 1 back down to 1.

I'm actually going through your code now to make some of these updates directly.. stay tuned here shortly for commits on your branch!

sjspielman commented 2 years ago

I've updated some of your code to ensure accuracy of the P-value calculation as well as specify size of the output plots, so we can ensure they will always be the same size (7x7 inches for now) reproducibly!

These figures are mostly done, but now that we have corrected the P-values, there is very little that is still statistically significant. I'm not necessarily sure this should still be a main text figure or we should move to SI. Let's see what thoughts @jaclyn-taroni and @jharenza might have now that we don't see any salient trends here.

jaclyn-taroni commented 2 years ago

These figures are mostly done, but now that we have corrected the P-values, there is very little that is still statistically significant. I'm not necessarily sure this should still be a main text figure or we should move to SI.

Hm yea, let's change the output and script name to indicate that this is a supplemental figure for now. It is easy enough to "convert" to a main text figure if that's what we decide to do later.

jharenza commented 2 years ago

These figures are mostly done, but now that we have corrected the P-values, there is very little that is still statistically significant. I'm not necessarily sure this should still be a main text figure or we should move to SI. Let's see what thoughts @jaclyn-taroni and @jharenza might have now that we don't see any salient trends here.

I agree with @jaclyn-taroni on changing the output / script. I am not sure if we should use in SI either, but we can save as supplemental for now @runjin326. Thanks!

runjin326 commented 2 years ago

@jaclyn-taroni, @sjspielman, @jharenza - thanks so much for the feedbacks and updates! I have now moved the script and figures to the supp folder.

sjspielman commented 2 years ago

Noting issue #1250 - build fails likely because MRAN is down. This PR is not immediately urgent, so let's keep this PR open (instead of overriding the merge) for a little bit until MRAN comes back and we can pass the CI checks.

sjspielman commented 2 years ago

Noting this wasn't merged because MRAN was down at the time. Coming back now to re-run CI and merge.

AlexsLemonade / OpenPBTA-analysis