LieberInstitute / zandiHyde_bipolar_rnaseq

RNA-seq analysis for psychENCODE R01
6 stars 1 forks source link

TWAS: sACC vs Amygdala + TWAS vs case-control signal #4

Closed lcolladotor closed 3 years ago

lcolladotor commented 3 years ago

Hi Arta,

Similar to #3, once #2 is done, can you make plots like Figure S45 and S47 from https://www.cell.com/cms/10.1016/j.neuron.2019.05.013/attachment/0f56af5c-ec8d-42eb-b6d8-33bf7495e635/mmc1.pdf? Thanks!

Figure S45

Figure S45 involves this code https://github.com/LieberInstitute/brainseq_phase2/blob/master/twas/explore_twas_psycm.R#L262-L325 though we could also make a gene-level only plot like https://github.com/LieberInstitute/brainseq_phase2/blob/master/twas/explore_twas_psycm.R#L327-L345. The code is the same one that was then adapted for https://github.com/LieberInstitute/twas/blob/master/bsp2/read_twas.R#L528-L575.

We need to use the same colors for the regions as the ones Emily used for the rest of the BPD project.

Figure S47

For Figure S47, we need to adapt code from https://github.com/LieberInstitute/brainseq_phase2/blob/master/twas/explore_twas_psycm.R#L662-L680 if I'm remembering correctly. This involves reading in the BPD case-control differential expression results, so adapting this code might involve multiple steps.

It might be easier to have a single R script for each plot.

Best, Leo

aseyedia commented 3 years ago

Attaching the figures for future reference

image image

aseyedia commented 3 years ago

We need to use the same colors for the regions as the ones Emily used for the rest of the BPD project.

Where can I find examples of these colors? So I know for the Manhattan plots as well

lcolladotor commented 3 years ago

Hm....

image

Those are colors Peter used in one figure. Though it might be best to double check with him. Emily had used other colors at https://github.com/LieberInstitute/zandiHyde_bipolar_rnaseq/blob/master/case_control/venn_qSVAjoint_gene.png

aseyedia commented 3 years ago

B5ADE6 (Lavender)

B6D2E6 (Light Blue)

FFFF7F (Yellow)

B2CCFF (sACC blue)

BFBFBF (Gray)

aseyedia commented 3 years ago

Having a bit of trouble understanding the ggplot code here, specifically lines 328-329 and lines 337-338:

ggplot(subset(region_twas_z, feature == 'gene'),
    aes(x = DLPFC, y = HIPPO, color = FDR.5perc, shape = in_both))

region_twas_z was generated using a custom function called get_variable_by_region on line 262, defined here , wherein the argument TWAS.Z was input .

I'm having trouble understanding what this function does with this column without having the data and going through the function myself. I see that only two lines actually use the var input on lines 43 & 44, but it seems a bit beyond me what that means.

Of course, I could also try to recreate this plot on my own using the data I have available, which might be the best course of action, but seeing that we have a meeting together tomorrow, I will convene with you and see what you think. Regardless, I don't think figuring out how to adapt this code would be difficult, but it's not as simple as change the names of the arguments in ggplot.

aseyedia commented 3 years ago

I find myself to be a bit lost. Since I'm not familiar with the source code, I'm not sure what the input arguments to the ggplot are supposed to be. Here's what I'm working with:

fdr.z <- fdrtool(twas_z$TWAS.Z)

ggplot(twas_z,
    aes(
        x = twas_z[region == "amygdala",]$TWAS.Z,
        y = twas_z[region == "sacc"]$TWAS.Z,
        color = fdr.z$pval,
        shape = in_both
    )) +
    geom_point() +
    # facet_grid(BEST.GWAS.status ~ feature) +
    coord_fixed() +
    theme_bw(base_size = 30) +
    ggtitle('TWAS Z by brain region') +
    scale_color_manual(values = c('grey80', 'dark orange', 'skyblue3', 'purple'))

I'm using fdrtool because I'm not totally familiar with how FDR.5perc was generated in the source code, so I went with this package that seems to provide FDR-corrected values for Z-scores.

I'm also not sure about the nature of the aes() arguments. As it stands right now, I am getting the following error whenever I try to run this ggplot:

Error: Aesthetics must be either length 1 or the same as the data (18975): x and y

Which, to me, means that both arguments need to be 18975 values in length.

When I tried to test this ggplot using 18975 random values for each x and y argument:

x_test <- runif(18975, min = -3, max = 3)
y_test <- runif(18975, min = -3, max = 3)

I get this:

Error: Continuous value supplied to discrete scale

Which I think has something to do with the color argument in aes():

aes(
        x = twas_z[region == "amygdala",]$TWAS.Z,
        y = twas_z[region == "sacc"]$TWAS.Z,
        color = fdr.z$pval,
        shape = in_both
    )

One way I could figure out how to solve this issue is to get the source data that was used in the script from which I am adapting this ggplot, but that does not seem like the most efficient way to approach this problem.

lcolladotor commented 3 years ago

FDR.5perc is a factor with 4 levels that gets calculated at https://github.com/LieberInstitute/brainseq_phase2/blob/master/twas/twas_functions.R#L55-L58 and https://github.com/LieberInstitute/brainseq_phase2/blob/master/twas/twas_functions.R#L75 which says whether both brain regions (DLPFC and HIPPO in the BrainSeq Phase II case, sACC and Amygdala here) are have TWAS FDR-adjusted p-values less than 0.05 (both, DLPFC only, HIPPO only, or none of them).

The above function is used at https://github.com/LieberInstitute/brainseq_phase2/blob/master/twas/explore_twas_psycm.R#L262 for the psycm GWAS. I made it into a function because I used it also for exploring the results from a second GWAS. It's likely that we'll need to tweak that function for the sACC and Amygdala data you are working on.

The FDR-adjusted p-values are calculated using p.adjust() as in https://github.com/LieberInstitute/brainseq_phase2/blob/master/twas/explore_twas_psycm.R#L148. I wouldn't use a different package (fdrtool) for that, to keep it consistent with what we've done. I'm not familiar with fdrtool, so we could look into it later in an R club session, but let's avoid using it for now.

aseyedia commented 3 years ago

With @lahuuki's help, I was able to generate the following: image

Doesn't look exactly like what you've linked but it's a good start. I'm going to push the changes I've made to JHPCE and we can meet next Monday.

aseyedia commented 3 years ago

The plot above is colored according to quadrant and not according to significance, and is not plotting fdr-corrected values.

After our meeting this morning, we came up with the following actions for the above plot:

aseyedia commented 3 years ago

image Here's an update. Turns out I was supposed to be using FDR-corrected P-values and not Z-scores. Now I have to figure out how to get in_both back.

aseyedia commented 3 years ago

image Here we go.

lcolladotor commented 3 years ago

Nice, it looks better! Though the colors in the cross need some tweaks.

Good job Arta ^^

aseyedia commented 3 years ago

image

aseyedia commented 3 years ago

image Fixed it after it became apparent during the Moods meeting that the FDR.5perc column was inaccurately categorizing based on raw P values and not on FDR-corrected P values.

lcolladotor commented 3 years ago

Nice, thanks Arta! This looks good =)

aseyedia commented 3 years ago

Thanks Leo.

Posting relevant code for the second plot here:

image

https://github.com/LieberInstitute/brainseq_phase2/blob/e1a67fe8900ffcc637ad7f405b620dc15a93d4af/twas/explore_twas_psycm.R#L662-L680

tt_sigonly <- tt[tt$TWAS.FDR < 0.05, ]
with(tt_sigonly, addmargins(table(BEST.GWAS.status, EQTL.status, useNA = 'ifany')))
#                 EQTL.status
# BEST.GWAS.status Index Other Proxy  Sum
#            Index     2   660   113  775
#            Other     0  6912     0 6912
#            Proxy     0  1787   367 2154
#            Sum       2  9359   480 9841

tt_sigonly_bonf <- tt[tt$TWAS.Bonf < 0.05, ]
with(tt_sigonly_bonf, addmargins(table(BEST.GWAS.status, EQTL.status, useNA = 'ifany')))
#                 EQTL.status
# BEST.GWAS.status Index Other Proxy  Sum
#            Index     2   252    60  314
#            Other     0   230     0  230
#            Proxy     0   517   285  802
#            Sum       2   999   345 1346

create_gwas_or_eqtl(tt_sigonly, 'pdf/twas_fdr5perc_vs_gwas_or_eqtl.pdf', 'FDR')

https://github.com/LieberInstitute/brainseq_phase2/blob/e1a67fe8900ffcc637ad7f405b620dc15a93d4af/twas/twas_functions.R#L86-L145

aseyedia commented 3 years ago

@lcolladotor, I am able to generate a plot using the code you supplied to me for the second scatterplot that is supposed to resemble Figure S47, but I am a bit lost as to what exactly you want me to plot, because the figure you showed me as an example and the code you sent me to be adapted are a bit different. Here's what I have:

image

twas_z_scatter <-
    select(twas_z,
        geneid,
        genesymbol,
        TWAS.Z,
        TWAS.P,
        EQTL.Z,
        BEST.GWAS.Z,
        region) %>%
    as.data.table()

twas_z_scatter$region <-
    factor(twas_z_scatter$region,
        levels = c('amygdala', 'sacc'))

print(
    ggplot(twas_z_scatter,
        aes(
            x = TWAS.Z,
            y = EQTL.Z,
            color = pnorm(BEST.GWAS.Z) < 5e-08
        )) + geom_point()
    + labs("") + facet_grid( ~ region) +
        theme_bw(base_size = 30) +
        scale_color_discrete(name = "Best\nGWAS\nP-value") +
        labs(caption = 'Risk Loci by EQTL')
)

Does not quite look similar to the plots you showed me so I'll wait until lab meeting to ask you what exactly you're looking for.

lcolladotor commented 3 years ago

Use the following code to obtain the t-statistics for the case-control differences at the gene level in both amygadala and the sACC.

library("SummarizedExperiment")
load("/dcl01/lieber/ajaffe/lab/zandiHyde_bipolar_rnaseq/case_control/bipolarControl_deStats_byRegion_qSVAjoint_withAnnotation.rda", verbose=TRUE)
statOutGene <- statOut[statOut$Type == "Gene",]

You'll want to use the t_Amyg and the t_sACC columns.

> head( statOutGene)
DataFrame with 6 rows and 17 columns
                  logFC_Amyg AveExpr_Amyg    t_Amyg P.Value_Amyg adj.P.Val_Amyg
                   <numeric>    <numeric> <numeric>    <numeric>      <numeric>
ENSG00000227232.5 -0.0330728     1.340582 -0.414687   0.67877922       0.909528
ENSG00000278267.1  0.1612828    -1.412959  1.236884   0.21746318       0.658652
ENSG00000269981.1  0.0810748    -2.146991  0.470339   0.63858384       0.896763
ENSG00000279457.3  0.2116777     2.184323  3.114915   0.00208757       0.117127
ENSG00000228463.9  0.0745462     1.862994  0.635151   0.52599712       0.852000
ENSG00000236679.2  0.1817031    -0.492242  2.053826   0.04118791       0.365443
                     B_Amyg logFC_sACC AveExpr_sACC    t_sACC P.Value_sACC
                  <numeric>  <numeric>    <numeric> <numeric>    <numeric>
ENSG00000227232.5  -5.84638 0.00132064      1.35091 0.0162341    0.9870610
ENSG00000278267.1  -4.75310 0.05092425     -1.30810 0.4402495    0.6601466
ENSG00000269981.1  -5.16546 0.33776932     -2.44735 2.0371945    0.0427113
ENSG00000279457.3  -1.57665 0.11896450      2.20583 1.8870357    0.0603442
ENSG00000228463.9  -5.84781 0.01144101      1.78457 0.1266352    0.8993336
ENSG00000236679.2  -3.75354 0.08722632     -0.55163 1.0176247    0.3098656
                  adj.P.Val_sACC    B_sACC         gencodeID      Symbol
                       <numeric> <numeric>       <character> <character>
ENSG00000227232.5       0.996119  -5.73480 ENSG00000227232.5      WASH7P
ENSG00000278267.1       0.886615  -5.12574 ENSG00000278267.1   MIR6859-1
ENSG00000269981.1       0.290631  -3.84247 ENSG00000269981.1
ENSG00000279457.3       0.340016  -4.28670 ENSG00000279457.3
ENSG00000228463.9       0.971597  -5.82341 ENSG00000228463.9
ENSG00000236679.2       0.683774  -4.91603 ENSG00000236679.2   RPL23AP24
                         Type       Class  EntrezID
                  <character> <character> <integer>
ENSG00000227232.5        Gene       InGen        NA
ENSG00000278267.1        Gene       InGen 102466751
ENSG00000269981.1        Gene       InGen        NA
ENSG00000279457.3        Gene       InGen 102723897
ENSG00000228463.9        Gene       InGen        NA
ENSG00000236679.2        Gene       InGen        NA
> dim(statOutGene)
[1] 25136    17

That's the information we want on the Y-axis. On the X-axis we want the TWAS Z score. I wouldn't worry about faceting by "risk locus" or coloring by "best GWAS p-value < e5-08".

Also, for the TWAS Amygadala vs TWAS sACC plot, can you use the colors for the regions you listed earlier? The orange & blue colors you used are those that we had used in the past for DLPFC and HIPPO in the BSP2 paper. Thanks.

aseyedia commented 3 years ago

I'm on it.

Also, for the TWAS Amygadala vs TWAS sACC plot, can you use the colors for the regions you listed earlier? The orange & blue colors you used are those that we had used in the past for DLPFC and HIPPO in the BSP2 paper. Thanks.

I tried earlier and the colors were illegible/did not look very good, especially the faded pastel yellow against a white background. I can generate it again so you can see what I mean.

aseyedia commented 3 years ago

image Here's a draft of the second scatterplot. It looks pretty bare; let me know if you want me to add anything to it or embellish it at all. I would have facet_wrapped by region but the subregion data are in separate columns so it was just easier to make them as separate plots.

image Above is the other plot with the colors you requested. I used #FFFF7F ("Laser Lemon" according to coolors.co) for amygdala, #B2CCFF ("Baby Blue Eyes") for sACC, #B2CCBF (Opal) for Both, and a slightly darker shade of gray (#BFBFBF or "Gray X 11 Gray") for neither. I would strongly recommend using a different color palette even though it would be inconsistent with the one used throughout the rest of the paper. If it's too much trouble, this could obviously be used, and of course, it's ultimately not my decision, but as the person who generated the plot, I will say that it's a) difficult to even look at the amygdala points and b) difficult to distinguish "both" points from "sACC."

aseyedia commented 3 years ago

Here are the TWAS Z vs t-stat scatterplots:

image image

re: the "TWAS Z by Brain Region" scatterplot, we decided we are going to wait until next Tuesday to see what their idea is on changing the color palette. Once we get an answer from them, I will update the colors and then (unless there is anything else to cover) close this issue.

lcolladotor commented 3 years ago

The new plots look good. Is the red line a regression line? It's basically at y = 0.

Can you make the X and Y axis legends bigger? This can be done using theme_bw(base_size = 20) for example (though the exact number will depend on the size of your PDF file). Basically, imagine that someone is showing this plot in a presentation. Right now, the axis labels and axis ticks are too small to read.

aseyedia commented 3 years ago

The new plots look good. Is the red line a regression line? It's basically at y = 0.

Yeah, I thought you wanted me to add one?

Can you make the X and Y axis legends bigger?

The labels were originally at theme_bw(base_size = 30) but it was too hard to look at so I set them to default. But sure, I'll tweak them to make it look good.

lcolladotor commented 3 years ago

Err no, I meant the correlation coefficients, like at https://github.com/LieberInstitute/brainseq_phase2/tree/master/twas#figure-1.

And ok, tweaking the base_size sounds good.

aseyedia commented 3 years ago

image image

Unfortunately, I wasn't able to figure out how to force the rho value to display in scientific notation with sigfigs, despite formatting it that way more than once.

amyg_rho <- format(cor(merged_t$TWAS.Z_amygdala, merged_t$t_Amyg), digits = 3, scientific = TRUE)
sacc_rho <- format(cor(merged_t$TWAS.Z_sacc, merged_t$t_sACC), digits = 3, scientific = TRUE)

ggplot(merged_t,
       aes(x = TWAS.Z_amygdala,
           y = t_Amyg)) + geom_point() + labs(title = "TWAS vs BD differential expression in Amygdala", x = "TWAS Z score", y = "BD vs control t-statistic") + annotate("text", x = -5.5, y = 6, label = paste0("rho == ", formatC(amyg_rho, format = "e")), parse = TRUE) + scale_y_continuous(breaks = c(-6, -3, 0, 3, 6)) + xlim(-6, 6) +
    theme_bw(base_size = 20)

ggplot(merged_t,
       aes(x = TWAS.Z_sacc,
           y = t_sACC)) + geom_point() + labs(title = "TWAS vs BD differential expression in sACC", x = "TWAS Z score", y = "BD vs control t-statistic")+ annotate("text", x = -5.5, y = 6, label = paste0("rho == ", formatC(sacc_rho, format = "e")), parse = TRUE) + scale_y_continuous(breaks = c(-6, -3, 0, 3, 6)) +
    scale_x_continuous(breaks = waiver()) +
    theme_bw(base_size = 20)
aseyedia commented 3 years ago

I also updated the legend titles for the scatterplot and base_size: image

aseyedia commented 3 years ago
aseyedia commented 3 years ago

image It still doesn't really look good if I turn down the brightness and turn up the saturation. I think yellow and green are just bad colors for a white background.

aseyedia commented 3 years ago

generate_twas_plots.R is basically done. I haven't run it on the cluster yet; I've only run parts of it and then uploaded the results onto the cluster, but it should work. I'm keeping this issue open for when we settle on a color scheme for the above scatter plot, but that should only take a day or so.

aseyedia commented 3 years ago

After meeting with Peter, we agreed that the original color scheme with the purple and orange is the best for the plot (https://github.com/LieberInstitute/zandiHyde_bipolar_rnaseq/issues/4#issuecomment-729117914), and there's no real need to use the same color scheme throughout the paper as long as the plot is clearly labeled. I will develop an SGE script that will generate these plots on the cluster and eventually run it to regenerate the plots, but for the time being, I will close this issue.