LieberInstitute / zandiHyde_bipolar_rnaseq

RNA-seq analysis for psychENCODE R01
6 stars 1 forks source link

Re-create Gandal et al 2018 figure #3

Closed lcolladotor closed 3 years ago

lcolladotor commented 3 years ago

Hi Arta,

Once you are done with #2, using the gene-level TWAS results (all of them, not just the significant ones), can you make a plot like Figure 4 from https://science.sciencemag.org/content/362/6420/eaat8127? We don't have code for this, but we'll likely need to use ggplot2 and ggrepel to make the labels. We need two plots here. One for the gene-level TWAS results for sACC and a second one for Amygdala.

Make this into a single script, and potentially into a function that we can pass to it the gene-level sACC data or the gene-level Amygdala data. By having that function, we could potentially re-use it for the exon, exon-exon junction or transcript-level results if we need to later.

Thanks, Leo

aseyedia commented 3 years ago

F5 large

uploading the image for posterity

aseyedia commented 3 years ago

I intend on using this hyperlinked guide to making a Manhattan plot for these data, before I figure out how to modify the y-axis to include negative values. In order to do that, I need to compute the "the cumulative position of SNP."

First of all, we need to compute the cumulative position of SNP.


don <- gwasResults %>% 

Compute chromosome size

group_by(CHR) %>% summarise(chr_len=max(BP)) %>%

Calculate cumulative position of each chromosome

mutate(tot=cumsum(chr_len)-chr_len) %>% select(-chr_len) %>%

Add this info to the initial dataset

left_join(gwasResults, ., by=c("CHR"="CHR")) %>%

Add a cumulative position of each SNP

arrange(CHR, BP) %>% mutate( BPcum=BP+tot)


However, base-pair positions are not included with the output of `read_twas.R`, `twas_exp.Rdata`. So, I need to modify `read_twas.R` in order to retrieve and include the positions of each gene.

`rowRanges()` gives me a GRanges object with the base-pair location of each gene in the `rse` object:

rowRanges(rse) GRanges object with 25136 ranges and 10 metadata columns: seqnames ranges strand | Length gencodeID

| ENSG00000227232.5 chr1 14404-29570 - | 1351 ENSG00000227232.5 ENSG00000278267.1 chr1 17369-17436 - | 68 ENSG00000278267.1 ENSG00000269981.1 chr1 137682-137965 - | 284 ENSG00000269981.1 ENSG00000279457.3 chr1 184923-200322 - | 1982 ENSG00000279457.3 ENSG00000228463.9 chr1 257864-297502 - | 4039 ENSG00000228463.9 ... ... ... ... . ... ... ENSG00000198695.2 chrM 14149-14673 - | 525 ENSG00000198695.2 ENSG00000210194.1 chrM 14674-14742 - | 69 ENSG00000210194.1 ENSG00000198727.2 chrM 14747-15887 + | 1141 ENSG00000198727.2 ENSG00000210195.2 chrM 15888-15953 + | 66 ENSG00000210195.2 ENSG00000210196.2 chrM 15956-16023 - | 68 ENSG00000210196.2 ```

My plan is to extract IRanges for each gene, calculate the average to find the rough "midpoint" for each gene present in the TWAS data, and then include that with the output.

Of course, I could be overthinking this all and "doing too much," but it does seem clear to me that I need the base-pair position of each gene expressed along the transcriptome as in the figure you pointed me to in the first post.

I was able to find Dr. Gusev's code as an alternative for creating a "mirrored" Manhattan plot here, but after cloning it and examining the code, it seemed even more difficult to approach than what I already had.

I also explored other options, such as this package for making mirrored Manhattan plots, or this one (with an example here), but I figured I'd touch base with you. Let me know what you think.

aseyedia commented 3 years ago

Update: I was successfully able to generate the plot. I'm now working on the plotly html that allows the user to pinpoint and view each gene, but I am having trouble with this.

aseyedia commented 3 years ago

I have successfully generated an interactive Manhattan plot for our TWAS results using plotly. We can change it here and there and make modifications but the plot itself is finished. Interactive plot is accessible here: /dcl01/lieber/ajaffe/lab/zandiHyde_bipolar_rnaseq/dev_twas/BIP_TWAS_Plotly.html. Plots.zip

aseyedia commented 3 years ago

From Leo:

Can you separate it into 2, one per region. Also, can you add a horizontal line (it’ll be 2, one positive and one negative) to show the Bonf p-value < 0.05? Thanks!

lcolladotor commented 3 years ago

Can you also add the Gene symbol and a search box? For the gene symbol, that just means adding it as a column to the ggplot2 data.frame you are using. For the search box, that will involve some of the options from https://gist.github.com/lcolladotor/350ac8843e153d135f64880a1d029b14#file-2020-09-11_plotly-r-L88-L95

aseyedia commented 3 years ago

The TWAS Z score Manhattan plot PDF as well as the interactive (with gene symbol search box) plot is now available in dev_twas/. I am not going to close this issue yet, because there still remains the problem of adding the Bonferroni P-value threshold https://github.com/LieberInstitute/zandiHyde_bipolar_rnaseq/issues/3#issuecomment-718159960. I'm having trouble understanding how to apply the Bonferroni-corrected P value threshold on a Manhattan plot for TWAS Z scores, but maybe after a meeting we could clear it up. For now, I'm going to move onto generating the other plots.

lcolladotor commented 3 years ago

So in the past we used p.adjust() to compute the Bonferroni-adjusted p-values for each region / feature at a time (see https://github.com/LieberInstitute/brainseq_phase2/blob/master/twas/explore_twas_psycm.R#L149). In this case, we have only gene-level results and 2 brain regions, so we need to compute 2 Bonferroni thresholds.

Now, for the plot, since it's showing Z-scores, we need to go from a Bonferroni adjusted p-value back to Z values. Our test is two sided, so we divide our initial alpha (0.05) by 2. Then we can use the qnorm() function to find the corresponding Z value that we need to set a horizontal line at (both + Z and -Z).

> qnorm(1 - (0.05 / 2))
[1] 1.959964

Now that we know that, we need to find the number of tests. So I initially loaded your data following https://github.com/LieberInstitute/zandiHyde_bipolar_rnaseq/blob/master/dev_twas/generate_twas_plots.R#L1-L12. I see tht we have about 15 to 16k TWAS results by brain region.

> table(twas_exp_fin$region)

amygdala     sacc
   15278    16331

So the corresponding Z value we need for each can be calculated from.

> qnorm(1 - 0.025 / table(twas_exp_fin$region))

amygdala     sacc
4.652920 4.666639

However, I then noticed the NAs.

> summary(twas_exp_fin$TWAS.Z)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's
 -5.860  -0.846  -0.011  -0.015   0.830   6.221   12634
> summary(twas_exp_fin$TWAS.P)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's
  0.000   0.149   0.401   0.429   0.692   1.000   12634

which I see that you removed yourself with https://github.com/LieberInstitute/zandiHyde_bipolar_rnaseq/blob/master/dev_twas/generate_twas_plots.R#L12 and we had done something similar in BrainSeq Phase II at https://github.com/LieberInstitute/brainseq_phase2/blob/master/twas/explore_twas_psycm.R#L17-L18.

>
> summary(twas_z$TWAS.P)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
 0.0000  0.1490  0.4010  0.4292  0.6920  1.0000
> summary(twas_z$TWAS.Z)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max.
-5.86030 -0.84645 -0.01097 -0.01538  0.83035  6.22090

So, after removing the NA TWAS results, we have about 9 to 10k results per brain region. So then, the Z values you need for your plots are:

> table(twas_z$region)

amygdala     sacc
    8975    10000
> qnorm(1 - 0.025 / table(twas_z$region))

amygdala     sacc
4.542048 4.564788
aseyedia commented 3 years ago

Done. I've added the Bonferroni-corrected Z-score threshold and generated the resulting plots. I will post all plots to the Slack channel but they are available in the dev_twas/analysis/ path directory.