This repository contains all the necessary code to perform the evaluations and analyses from our preprint available on bioRxiv.
Analyses discussed in the Differential state analysis of mouse cortex exposed to LPS treatment results section are provided as a browsable workflowr
1 website HERE.
For installation of the required libraries, we'll fist install the r BiocStyle::Biocpkg("BiocManager")
package:
install.packages("BiocManager")
The code in this repository was developed using R v3.6.2 and Bioconductor v3.10. Versions of R and Bioconductor that are currently being run should be checked via:
version
BiocManager::version()
Finally, the code chunk below will install all package dependencies:
# install 'ggrastr' from GitHub
BiocManager::install("VPetukhov/ggrastr")
# install packages from CRAN & Bioconductor
pkgs <- c("AnnotationDbi","circlize","countsimQC","cowplot","data.table",
"DESeq2","DropletUtils","dplyr","edgeR","ggplot2","iCOBRA","kSamples",
"jsonlite","limma","M3C","magrittr","MAST","Matrix","muscat","msigdbr",
"org.Mm.eg.db","pheatmap","purrr","RColorBrewer","readxl","reshape2",
"S4Vectors","scater","scDD","scds","scran","sctransform","Seurat",
"SingleCellExperiment","topGO","UpSetR","viridis","workflowr","yaml")
BiocManager::install(pkgs, ask = FALSE)
R version and library have to be specified under R
in the config.yaml
file (e.g., R: "R_LIBS_USER=/path/to/library /path/to/R/executable"
). If you run into any issues, I recommend running the specified character string from the command line and assuring that the outputs of version
and .libPaths()
are what you expect them to be.
Without modifications, the Snakemake
comparison relies on 2 reference datasets for data simulation, method execution and comparison. These (or any other references) have to be downloaded, saved as .rds
objects with appropriate names, and placed inside the data/raw_data
directory. This is exemplified here for the Kang et al. reference used in the preprint:
# install & load 'ExperimentHub'
BiocManager::install("ExperimentHub")
library(ExperimentHub)
# initialize hub instance
eh <- ExperimentHub()
# list data available in 'muscData'
(q <- query(eh, c("Kang", "muscData")))
# load 'SingleCellExperiment's using IDs from above
sce <- eh[[q$ah_id]]
# save as .rds; name should be '<id>_sce0.rds'
fn <- "kang_sce0.rds"
dir <- file.path("...", "data", "raw_data")
saveRDS(sce, file.path(dir, fn))
Finally, execution of the Snakemake
file requires running the setup.R
script once to create all required directories as well as simulation, method and run parameters:
# from within R
source("setup.R")
# from the terminal
Rscript setup.R
The Snakemake
should run now. A couple more points to note:
sim/run/meth_pars.R
in the scripts
are re-exected with every Snakemake
run, and any changes made to them will automatically be recognized (e.g., when a new simulation scenario or method is added).<id>_sce0.rds
has to be in place as described above"<id>"
has to be added under dids
in the config.yaml
filescripts/prep_<id>.R
has to be added to, for example, assure unique sample identifiers exist, remove un-assigned cells or cell multiplets etc.dids
in config.yaml
data.frame
constructed in scripts/meth_pars.R
id
under ids
in the first line of scripts/meth_pars.R
and code to construct a data.frame
of appropriate format (must include a id
column; see current methods for examples). Secondly, add a apply_<id>.R
script under scripts
that takes as input a SCE with colData
columns cluster/sample/group_id
and returns a data.frame
with p_adj.loc
and p_adj.glb
values for each cluster-gene (see current apply_x.R
scripts for exmples)ids
in the first line of scripts/meth_pars.R
(e.g., to skip all mixed model based methods, one would comment out "mm"
)data.frame
in scripts/meth_pars.R
In brief, our Snakemake
workflow for method comparison is organized into
config.yaml
file specify key parameters and directoriesscripts
folder housing all utilized scripts (see below)data
folder containing raw (reference) and simulated datameta
folder for simulation, runmode, and method parametersresults
folder where all results are generated (as .rds
files)plots
folder where all output plots are generated.pdf
or .png
and .rds
files for ggplot
objects)The table below summarizes the different R scripts in scripts
:
script | description |
---|---|
prep_X |
generates a references SCE for simulation by i) keeping samples from one condition only; and, ii) unifying relevant cell metadata names to "cluster/sample/group_id" |
prep_sim |
prepares a reference SCE for simulation by i) retaining subpopulation-sample combinations with at least 100 cells; and, ii) estimating cell / gene parameters (offsets / coefficients and dispersions) |
sim_pars |
for ea. simulation ID, generates a .json file in meta/sim_pars that specifies simulation parameters (e.g., prob. of DS, nb. of simulation replicates) |
run_pars |
for ea. reference and simulation ID, generates a .json file in meta/run_pars that specifies runmode parameters (e.g., nb. of cells/genes to sample, nb. of run replicates) |
meth_pars |
for ea. method ID, generates a .json file in meta/meth_pars that specifies method parameters |
sim_data |
provided with a reference dataset and simulation parameters, simulates data and writes a SCE to data/sim_data |
apply_X |
wrapper to run DS method of type X (pb , mm , ad , mast , scdd ) |
run_meth |
reads in simulated data, method parameters, and performs DS analysis by running the corresponding apply_X script |
run_meth_lps |
wrapper to apply method to the LPS dataset |
plot_null |
for ea. reference ID, plots nominal p-value distributions for all null simulations |
plot_perf_cat |
plots TPR-FDR-points across DD categories for ea. p-value adjustment type (p_adj.loc/glb ) |
plot_perf_by_nx |
plots TPR-FDR-points across the nb. of x (cells = c , samples = s ) |
plot_perf_by_xs |
plots TPR-FDR-points across increasingly unbalanced sample/group-sizes |
plot_perf_by_expr |
plots TPR-FDR-points across expression-level groups |
plot_upset |
plots an upset plot for the top gene-subpopulation combinations across methods and simulation replications |
plot_lfc |
scatter plots of simulated vs. estimated logFC stratified by method and DD category |
plot_pb_mean_disp |
provided with a reference dataset, simulates a null dataset (no DS, no type-genes) and plots pseudobulk-level mean-dispersion estimates for simulated vs. reference data |
plot_runtimes |
barplots of runtimes vs. nb. of genes/cells |
utils |
various helpers for data handling, formatting, and plotting |
session_info |
generates a .txt file capturing the output of session_info() |
[1]:
John Blischak, Peter Carbonetto and Matthew Stephens (2019).
workflowr: A Framework for Reproducible and Collaborative Data Science.
R package version 1.4.0. https://CRAN.R-project.org/package=workflowr