A Snakemake 8 workflow for enrichment analysis and visualization of human (hg19 or hg38) or mouse (mm9 or mm10) based genomic region sets and (ranked) gene sets. Together with the respective background region/gene sets, the enrichment within the configured databases is determined using LOLA, GREAT, GSEApy (over-representation analysis (ORA) & preranked GSEA), pycisTarget, RcisTarget and results saved as CSV files. Additionally, the most significant results are plotted for each region/gene set, database queried, and analysis performed. Finally, the results within the same "group" (e.g., stemming from the same analysis) are aggregated per database and analysis in summary CSV files and visualized using hierarchically clustered heatmaps and bubble plots. For collaboration, communication and documentation of results, methods and workflow information a detailed self-contained HTML report can be generated.
[!NOTE]
This workflow adheres to the module specifications of MR.PARETO, an effort to augment research by modularizing (biomedical) data science. For more details, instructions, and modules check out the project's repository.⭐️ Star and share modules you find valuable 📤 - help others discover them, and guide our focus for future work!
[!IMPORTANT]
If you use this workflow in a publication, please don't forget to give credit to the authors by citing it using this DOI 10.5281/zenodo.7810621.
This project wouldn't be possible without the following software and their dependencies:
Software | Reference (DOI) |
---|---|
Enrichr | https://doi.org/10.1002/cpz1.90 |
ggplot2 | https://ggplot2.tidyverse.org/ |
GREAT | https://doi.org/10.1371/journal.pcbi.1010378 |
GSEA | https://doi.org/10.1073/pnas.0506580102 |
GSEApy | https://doi.org/10.1093/bioinformatics/btac757 |
LOLA | https://doi.org/10.1093/bioinformatics/btv612 |
pandas | https://doi.org/10.5281/zenodo.3509134 |
pheatmap | https://cran.r-project.org/package=pheatmap |
pycisTarget | https://doi.org/10.1038/s41592-023-01938-4 |
RcisTarget | https://doi.org/10.1038/nmeth.4463 |
rGREAT | https://doi.org/10.1093/bioinformatics/btac745 |
Snakemake | https://doi.org/10.12688/f1000research.29032.2 |
This is a template for the Methods section of a scientific publication and is intended to serve as a starting point. Only retain paragraphs relevant to your analysis. References [ref] to the respective publications are curated in the software table above. Versions (ver) have to be read out from the respective conda environment specifications (workflow/envs/*.yaml file
) or post-execution in the result directory (enrichment_analysis/envs/*.yaml
). Parameters that have to be adapted depending on the data or workflow configurations are denoted in squared brackets e.g., [X].
The outlined analyses were performed using the programming languages R (ver) [ref] and Python (ver) [ref] unless stated otherwise. All approaches statistically correct their results using expressed/accessible background genomic region/gene sets from the respective analyses that yielded the query region/gene sets.
Genomic region set enrichment analyses
LOLA. Genomic region set enrichment analysis was performed using LOLA (ver) [ref], which uses Fisher’s exact test. The following databases were queried [lola_databases].
GREAT. Genomic region set enrichment analysis was performed using GREAT [ref] implemented with rGREAT (ver) [ref]. The following databases were queried [local_databases].
pycisTarget. Genomic region set TFBS (Transcription Factor Binding Site) motif enrichment analysis was performed using pycisTarget (ver) [ref]. The following databases were queried [pycisTarget_databases].
Furthermore, genomic regions (query- and background-sets) were mapped to genes using GREAT (without background) and then analyzed as gene sets as described below for a complementary and extended perspective.
Gene set enrichment analyses (GSEA)
Over-representation analysis (ORA). Gene set ORA was performed using Enrichr [ref], which uses Fisher’s exact test (i.e., hypergeometric test), implemented with GSEApy's (ver) [ref] function enrich. The following databases were queried [local_databases].
Preranked GSEA. Preranked GSEA was performed using GSEA [ref], implemented with GSEApy's (ver) [ref] function prerank. The following databases were queried [local_databases].
RcisTarget. Gene set TFBS (Transcription Factor Binding Site) motif enrichment analysis was performed using RcisTarget (ver) [ref]. The following databases were queried [RcisTarget_databases].
Aggregation The results of all queries belonging to the same analysis [group] were aggregated by method and database. Additionally, we filtered the results by retaining only the union of terms that were statistically significant (i.e. [adj_pvalue]<=[adjp_th]) in at least one query.
Visualization All analysis results were visualized in the same way.
For each query, method and database combination an enrichment dot plot was used to visualize the most important results. The top [top_n] terms were ranked (along the y-axis) by the mean rank of statistical significance ([p_value]), effect-size ([effect_size]), and overlap ([overlap]) with the goal to make the results more balanced and interpretable. The significance (adjusted p-value) is denoted by the dot color, effect-size by the x-axis position, and overlap by the dot size.
The aggregated results per analysis [group], method and database combination were visualized using hierarchically clustered heatmaps and bubble plots. The union of the top [top_terms_n] most significant terms per query were determined and their effect-size and significance were visualized as hierarchically clustered heatmaps, and statistical significance ([adj_pvalue] < [adjp_th]) was denoted by *. Furthermore, a hierarchically clustered bubble plot encoding both effect-size (color) and statistical significance (size) is provided, with statistical significance denoted by *. All summary visualizations’ values were capped by [adjp_cap]/[or_cap]/[nes_cap] to avoid shifts in the coloring scheme caused by outliers.
The analysis and visualizations described here were performed using a publicly available Snakemake (ver) [ref] workflow [10.5281/zenodo.7810621].
The five tools LOLA, GREAT, pycisTarget, RcisTarget and GSEApy (over-representation analysis (ORA) & preranked GSEA) are used for various enrichment analyses. Databases to be queried can be configured (see ./config/config.yaml
). All approaches statistically correct their results using the provided background region/gene sets.
\*.bed
)
lola_databases
) taken from LOLA Region Databases or custom created using these instructionslocal_databases
), additional resources are downloaded automatically during the analysis.pycistarget_parameters:databases
) from the cisTarget resources.\*.txt
) over-representation analysis (ORA_GSEApy)
local_databases
).Rcistarget_parameters:databases
) from the cisTarget resources.\*.bed
) over-representation analysis (ORA_GSEApy) & TFBS motif enrichment analysis (RcisTarget)
\*.csv
) enrichment analysis (preranked_GSEApy)
local_databases
).local_databases
) for rGREAT and GSEApy
{adjp_th}
) terms per query is also saved as a long-format table (CSV).{top_n}
terms are ranked (along the y-axis) by the mean rank of statistical significance ({p_value}
), effect-size ({efect_size}
e.g., log2(odds ratio) or normalized enrichemnt scores), and overlap ({overlap}
e.g., coverage or support) with the goal to make the ranking more balanced and interpretable{top_terms_n}
most significant terms per query, method, and database within a group is determined. \*
(PDF).\*
(PNG).{adjp_cap}
/{or_cap}
/{nes_cap}
) to avoid shifts in the coloring scheme caused by outliers.{result_path}/enrichment_analysis
)
{query}
and {group}
{query}/{method}/{database}/
containing:
{query}\_{database}.csv
{query}\_{database}.{png}
{group}/{method}/{database}/
containing
{group}\_{database}\_all.csv
{group}\_{database}\_sig.csv
{top_terms_n}
terms (PDF): {group}\_{database}\_{adjp|effect}\_heatmap.pdf
{group}\_{database}\_summary.{png}
Note:
column_names:GREAT
.Here are some tips for the usage of this workflow:
Detailed specifications can be found here ./config/README.md
We provide four example queries across all tools with four different databases:
score=-log10(p-value) * sign(LFC)
).Follow these steps to run the complete analysis:
Download all necessary data (query and resources)
# change working directory to the cloned worklfow/module enrichment_analysis
cd enrichment_analysis
# download and extract the region set test data
wget -c http://cloud.databio.org.s3.amazonaws.com/vignettes/lola_vignette_data_150505.tgz -O - | tar -xz -C test/data/
# create and enter resources folder
mkdir resources
cd resources
# download LOLACore databases and move to the correct location
wget http://big.databio.org/regiondb/LOLACoreCaches_180412.tgz
tar -xzvf LOLACoreCaches_180412.tgz
mv nm/t1/resources/regions/LOLACore/ .
rm -rf nm
# download a local database
wget https://data.broadinstitute.org/gsea-msigdb/msigdb/release/2023.2.Hs/c2.cgp.v2023.2.Hs.symbols.gmt
# download cisTarget resources
mkdir cistarget
cd cistarget
wget https://resources.aertslab.org/cistarget/databases/homo_sapiens/hg38/refseq_r80/mc_v10_clust/gene_based/hg38_500bp_up_100bp_down_full_tx_v10_clust.genes_vs_motifs.rankings.feather
wget https://resources.aertslab.org/cistarget/databases/homo_sapiens/hg38/screen/mc_v10_clust/region_based/hg38_screen_v10_clust.regions_vs_motifs.rankings.feather
wget https://resources.aertslab.org/cistarget/motif2tf/motifs-v10nr_clust-nr.hgnc-m0.001-o0.0.tbl
# change your working directory back to the root of the module
cd ../../
conda activate snakemake
# dry-run
snakemake -p --use-conda --configfile test/config/example_enrichment_analysis_config.yaml -n
# real run
snakemake -p --use-conda --configfile test/config/example_enrichment_analysis_config.yaml
# report
snakemake --report test/report.html --configfile test/config/example_enrichment_analysis_config.yaml
The following publications successfully used this module for their analyses.