Enrichment tests on TWAS FDR < 5% genes

aseyedia commented 3 years ago

Peter mentioned that he wanted us to do enrichment tests on the FDR significant genes for each subregion and for both. He also mentioned that Leo has code that I could adapt to this end. I'm opening this issue as a place holder until I meet with Leo and/or Peter reaches out via email and and gives more information (also re: TWAS up/downregulation).

lcolladotor commented 3 years ago

Leo - we discussed the possibility of Arta running the following pathway enrichment tests:

GO enrichment tests of TWAS significant genes (FDR<5%) for sACC

GO enrichment tests of TWAS significant genes (FDR<5%) for AMYG

Go enrichment tests of TWAS significant genes (FDR<5%) for union of two brain regions

Arta, for each of the 3 sets of genes that Peter mentioned you need to:

Define the background universe of genes (for the union, use the union across both brain regions) based on the genes that you had p-values for (that is, remove the NAs). You might want to find the set of Gencode gene IDs for such genes (as well as the TWAS significant ones) , so you can then subset the rse_gene object in the next step (well, rowData(rse_gene)$EntrezID).
For the background universe of genes and the TWAS FDR-significant genes, you then need to find their Entrez gene ID (not all of them have them, so we'll lose genes on both sets). It looks like this info is already stored in the rse_gene objects like at https://github.com/LieberInstitute/zandiHyde_bipolar_rnaseq/blob/master/wgcna/analyze_results.R#L65-L70
Run clusterProfiler and save the Rdata file similar to https://github.com/LieberInstitute/zandiHyde_bipolar_rnaseq/blob/master/wgcna/analyze_results.R#L72-L75
Export the results to a CSV (so 3 files in the end) as.data.frame() + write.csv() like in https://github.com/LieberInstitute/zandiHyde_bipolar_rnaseq/blob/master/wgcna/analyze_results.R#L81. Actually, order them by q-value (FDR-adjusted p-value) and do the trick for Excel as in https://github.com/LieberInstitute/zandiHyde_bipolar_rnaseq/blob/65a0ef6ee48462e1417155fae11dffae9bede13d/case_control/run_gene_set.R#L96-L99.

lcolladotor commented 3 years ago

If we want to make a plot, we could also make a heatplot() as in http://research.libd.org/SPEAQeasy-example/bootcamp_intro#53_GO_enrichment_analyses though the 3 table files (or a single Excel file with 3 sheets) is what Peter is looking for now.

aseyedia commented 3 years ago

As of right now, I've completed the first three steps and I'm running compareCluster as a job submission as it appears to be a computationally intense step and therefore not feasible on the command line. I'll check back in a little bit later, maybe on the weekend even, to see if it ran correctly. If so, hopefully the remaining steps will be easy and we'll have something ready by next Tuesday. You can take a look at enrichment_test.R if you'd like, but I found it to be pretty simple.

aseyedia commented 3 years ago

**** Job starts ****
Fri Dec  4 14:35:16 EST 2020
...
Error in compareCluster(twas_z_both_fdr$fdr, univ = twas_z_both$fdr, OrgDb = "org.Hs.eg.db",  :
  No enrichment found in any of gene cluster, please check your input...
Execution halted
**** Job ends ****
Fri Dec  4 16:31:39 EST 2020

Above is the error generated last Friday by the enrichment test script. Not only does compareCluster take two hours to run, it also apparently does not work with the data I've supplied it.

Upon further inspection, this is due to the fact that I input a column that does not exist in any of the tables for all three compareCluster commands. In trying to run on the FDR-adjusted P-values, I called dt$fdr, when I should have called dt$fdr.p.

twas_z_both$fdr.p <- p.adjust(twas_z_both$TWAS.P, 'fdr')
twas_z_amyg$fdr.p <- p.adjust(twas_z_amyg$TWAS.P, 'fdr')
twas_z_sacc$fdr.p <- p.adjust(twas_z_sacc$TWAS.P, 'fdr')

twas_z_both_fdr <- twas_z_both[fdr.p < 0.05,]
twas_z_amyg_fdr <- twas_z_amyg[fdr.p < 0.05,]
twas_z_sacc_fdr <- twas_z_sacc[fdr.p < 0.05,]
...
go_both <- compareCluster(twas_z_both_fdr$fdr.p, univ = twas_z_both$fdr.p,
    OrgDb = "org.Hs.eg.db", ont = "ALL",
    readable = TRUE, pvalueCutoff = 1, qvalueCutoff = 1)

go_amyg <- compareCluster(twas_z_amyg_fdr$fdr.p, univ = twas_z_amyg$fdr.p,
    OrgDb = "org.Hs.eg.db", ont = "ALL",
    readable = TRUE, pvalueCutoff = 1, qvalueCutoff = 1)

go_sacc <- compareCluster(twas_z_sacc_fdr$fdr.p, univ = twas_z_sacc$fdr.p,
    OrgDb = "org.Hs.eg.db", ont = "ALL",
    readable = TRUE, pvalueCutoff = 1, qvalueCutoff = 1)

It's a simple mistake. I'm going to rerun it now and see if it produces anything differently.

aseyedia commented 3 years ago

Since my last post, I have also discovered that I am supposed to be inputting Entrez ID values into compareCluster and not p-values.

go_both <- compareCluster(twas_z_both_fdr$EntrezID, univ = twas_z_both$EntrezID,
    OrgDb = "org.Hs.eg.db", ont = "ALL",
    readable = TRUE, pvalueCutoff = 1, qvalueCutoff = 1)

go_amyg <- compareCluster(twas_z_amyg_fdr$EntrezID, univ = twas_z_amyg$EntrezID,
    OrgDb = "org.Hs.eg.db", ont = "ALL",
    readable = TRUE, pvalueCutoff = 1, qvalueCutoff = 1)

go_sacc <- compareCluster(twas_z_sacc_fdr$EntrezID, univ = twas_z_sacc$EntrezID,
    OrgDb = "org.Hs.eg.db", ont = "ALL",
    readable = TRUE, pvalueCutoff = 1, qvalueCutoff = 1)

Unfortunately, I am still receiving an error, but now it's a different type of error:

The following `from` values were not present in `x`: .id
Error in `$<-.data.frame`(`*tmp*`, "Cluster", value = integer(0)) :
  replacement has 0 rows, data has 6128
Calls: compareCluster -> $<- -> $<-.data.frame
Execution halted
**** Job ends ****
Mon Dec  7 16:20:33 EST 2020

It's obvious that I am doing something wrong, so I'm going to focus on going through the documentation carefully and setting up the package that way. There are probably steps that I am neglecting to do.

aseyedia commented 3 years ago

It just occurred to me that you were perhaps giving me an example in your original post about using clusterProfiler and not suggesting that I adapt specifically that function, compareCluster. I'm going to continue to read the documentation more thoroughly but now with consideration to other functions that would better suit our particular circumstance.

aseyedia commented 3 years ago

I'm right now working with the enrichGO function and this Bioconductor support post is informative re: the generation of both "FDR-adjusted p-values" and "q-values" by this package.

aseyedia commented 3 years ago

It appears as though using enrichGO instead of compareCluster will work, according to the test subset that I used:

> test_twas_z_both_fdr <- twas_z_both_fdr[1:100]
> test_twas_z_both <- twas_z_both[1:100]
> go_test_both <- enrichGO(gene = test_twas_z_both_fdr$EntrezID, OrgDb = "org.Hs.eg.db",
+          keyType = "ENTREZID", ont = "ALL", pvalueCutoff = 1,
+          pAdjustMethod = "fdr", universe = test_twas_z_both$EntrezID,
+          qvalueCutoff = 1)
> go_test_both
#
# over-representation test
#
#...@organism    Homo sapiens
#...@ontology    GOALL
#...@keytype     ENTREZID
#...@gene        chr [1:100] "7812" "2011" "9671" "23259" "55741" "22803" "51291" "4069" ...
#...pvalues adjusted by 'fdr' with cutoff <1
#...1921 enriched terms found
'data.frame':   1921 obs. of  10 variables:
 $ ONTOLOGY   : chr  "BP" "BP" "BP" "BP" ...
 $ ID         : chr  "GO:0030203" "GO:1904526" "GO:0006022" "GO:0006024" ...
 $ Description: chr  "glycosaminoglycan metabolic process" "regulation of microtubule binding" "aminoglycan metabolic process" "glycosaminoglycan biosynthetic process" ...
 $ GeneRatio  : chr  "5/80" "2/80" "5/80" "4/80" ...
 $ BgRatio    : chr  "158/18866" "10/18866" "172/18866" "108/18866" ...
 $ pvalue     : num  0.000561 0.000782 0.000823 0.00115 0.001344 ...
 $ p.adjust   : num  0.365 0.365 0.365 0.365 0.365 ...
 $ qvalue     : num  0.358 0.358 0.358 0.358 0.358 ...
 $ geneID     : chr  "SLC10A7/LYG1/SDC2/B4GAT1/CHST1" "MARK2/GAS8" "SLC10A7/LYG1/SDC2/B4GAT1/CHST1" "SLC10A7/SDC2/B4GAT1/CHST1" ...
 $ Count      : int  5 2 5 4 2 4 3 3 2 2 ...
#...Citation
  Guangchuang Yu, Li-Gen Wang, Yanyan Han and Qing-Yu He.
  clusterProfiler: an R package for comparing biological themes among
  gene clusters. OMICS: A Journal of Integrative Biology
  2012, 16(5):284-287

This is good news. I've saved my changes and am now running this script on the cluster. Will report back with results.

aseyedia commented 3 years ago

Running enrichGO was a success.

aseyedia commented 3 years ago

BD_EnrichmentTest.xlsx

Attached is the final enrichment test that I successfully ran yesterday. With it comes 3 sheets, one for each enrichment test. I also converted the geneID to their corresponding gene symbols. I will email this to Peter to see what he thinks. Since he didn't ask us for a heat map or anything beyond just producing the table(s), I will close this issue for now, but will re-open it if there's more work to do.

aseyedia commented 3 years ago

Cross-posting @lcolladotor 's email from email thread (omitting the part about union vs intersection in twas_z_both):

Hi Arta,

Can you calculate the TWAS FDR for each brain region using all the genes that have a non-NA TWAS p-value? That is do https://github.com/LieberInstitute/zandiHyde_bipolar_rnaseq/blob/master/dev_twas/enrichment_test.R#L69-L73 before dropping the genes without Entrez IDs https://github.com/LieberInstitute/zandiHyde_bipolar_rnaseq/blob/master/dev_twas/enrichment_test.R#L61-L64.

Once you do that, you can do:

with(twas_z_sacc, addmargins(table(is.na(EntrezID), fdr.p < 0.05)))
with(twas_z_amyg, addmargins(table(is.na(EntrezID), fdr.p < 0.05)))

Thanks

aseyedia commented 3 years ago

# with(twas_z_sacc, addmargins(table(is.na(EntrezID), fdr.p < 0.05, deparse.level = 2)))
               fdr.p < 0.05
is.na(EntrezID) FALSE  TRUE   Sum
          FALSE  7122   124  7246
          TRUE   2722    32  2754
          Sum    9844   156 10000
# with(twas_z_amyg, addmargins(table(is.na(EntrezID), fdr.p < 0.05, deparse.level = 2)))
               fdr.p < 0.05
is.na(EntrezID) FALSE TRUE  Sum
          FALSE  6343   94 6437
          TRUE   2507   31 2538
          Sum    8850  125 8975

Running the script on JHPSCE now.

lcolladotor commented 3 years ago

Hm... I don't think that this matters, but well, in https://github.com/LieberInstitute/zandiHyde_bipolar_rnaseq/blob/master/dev_twas/enrichment_test.R#L91-L94 we want to use unique(twas_z_both_fdr$EntrezID) instead of twas_z_both_fdr$EntrezID for the genes of interest and unique(twas_z_both$EntrezID) for the universe. I don't think that it matters, because maybe enrichGO does the unique() step inside also. Can you check this? Thanks! Like, I wouldn't want the GO results to be weight weirdly genes that show up twice in the set of genes of interest and/or the universe background.

aseyedia commented 3 years ago

BD_EnrichmentTest.xlsx The script was successful, although {xlsx} once again gave me this bizarre warning:

WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.poi.util.SAXHelper (file:/jhpce/shared/jhpce/core/conda/miniconda3-4.6.14/envs/svnR-4.0.x/R/4.0.x/lib64/R/site-library/xlsxjars/java/poi-ooxml-3.10.1-20140818.jar) to method com.sun.org.apache.xerces.internal.util.SecurityManager.setEntityExpansionLimit(int)
WARNING: Please consider reporting this to the maintainers of org.apache.poi.util.SAXHelper
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release

Does not bode well for reproducibility. Regardless, the table was generated and I will be sending it to Peter.

aseyedia commented 3 years ago

From Peter:

Thanks, Arta! Unfortunately, the results didn't change all that much and the few pathways that come up as q<0.05 are not ones we would think are relevant.... hmm... Anyway, appreciate you running them again, pz

LieberInstitute / zandiHyde_bipolar_rnaseq

Enrichment tests on TWAS FDR < 5% genes #6