Compute enrichment of gene sets in our predictions

tmmurali commented 4 years ago

We have a ranked list of predictions coming from network propagation or from host-virus PPI prediction. This issue is relevant mainly for human proteins. We also have a set of gene sets, e.g., from https://amp.pharm.mssm.edu/covid19/. We want to assess to what extent each gene set is enriched in our list of predictions.

There are two approaches I suggest:

For every top-k predictions, use Fisher's exact test (hypergeometric test) to compute the p-value of the intersection of the top-k predictions with a gene set. Plot the absolute value of the logarithm of the p-value as we increase k. Alternately, plot the size of the overlap and colour the point differently based on whether the overlap is statistically significant or not. There is no need to try all values of k. It may be sufficient to use increments of 10, 50, or 100. This value can be a parameter to the code.
Use an enrichment method such as GSEA that can consider the entire ranked list of predictions.

We must correct for testing multiple hypotheses.

tmmurali commented 4 years ago

Let us catalogue gene sets here. We need to download each one (see #5) and add it to the enrichment analysis.

[ ] COVID-19 Crowd Generated Gene and Drug Set Library Ignore all the gene sets with Krogan in the name, since we are already using them for prediction.
[ ] Gene expression datasets
[ ] Protein expression datasets

jlaw9 commented 4 years ago

Currently the downloadable gmt file available for the COVID-19 Crowd Generated Gene sets does not have the main descriptor text of the gene set in the file, making most gene sets unidentifiable.

I made an issue on their repo (#82) asking them to fix it.

jlaw9 commented 4 years ago

Just found out that besides running GSEA, GSEApy also has an enrichr module, which lets you run Enrichr's analysis using its api. Could be very useful as Enrichr has tons of gene sets!

jlaw9 commented 4 years ago

They fixed the gmt file for the COVID-19 Crowd Generated Gene!

tmmurali commented 4 years ago

@jlaw9 @n-tasnina what is the status of running our enrichment pipeline on the COVID-19 gene sets?

jlaw9 commented 4 years ago

We have the COVID-19 gene sets in GMT format, just need to update our scripts to test for enrichment of them. Here's the clusterProfiler documentation for our own gene sets. @n-tasnina can you add a function for that in our enrichment.py?

n-tasnina commented 4 years ago

Yeah, sure.I will add a function in enrichment.py to do this.

On Fri, May 22, 2020, 2:27 PM Jeff Law notifications@github.com wrote:

We have the COVID-19 gene sets in GMT format, just need to update our scripts to test for enrichment of them. Here's the clusterProfiler documentation for our own gene sets https://guangchuangyu.github.io/2015/05/use-clusterprofiler-as-an-universal-enrichment-analysis-tool/ . @n-tasnina https://github.com/n-tasnina can you add a function for that in our enrichment.py?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Murali-group/SARS-CoV-2-network-analysis/issues/6#issuecomment-632841259, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANSAMM23BOLFRTZ2HLHTQ7LRS272RANCNFSM4LZYUK7A .

n-tasnina commented 3 years ago

We can close this issue as well. Here is the link to the python script where we did enrichment analysis. https://github.com/Murali-group/SARS-CoV-2-network-analysis/blob/enrichment/src/Enrichment/fss_enrichment.py

Murali-group / SARS-CoV-2-network-analysis

Compute enrichment of gene sets in our predictions #6