ZhangLabGT / scMultiSim

A simulator for single cell multi-omics and spatial omics data that provides ground truth to benchmark a wide range of methods.
19 stars 4 forks source link

Table of contents

scMultiSim

scMultiSim is an in silico simulator that generates multi-modality data of single-cells, including gene expression, chromatin accessibility, RNA velocity, and spatial location of cells. It takes a cell differential tree and a gene regulatory network (GRN) as input, and simulates spliced and unspliced counts while accounting for the relationships between modalities. The output single cell gene expression data is determined by three factors: cell-cell interactions, within-cell GRNs and chromatin accessibility. Users can tune the effect of each factor on the output data and set various parameters for the underlying model. Furthermore, the GRN can be set in a time-varying mode where the network's structure changes temporally to reflect the dynamic nature of biological networks. We also provide options to simulate technical variations such as batch effects. scMultiSim can be used to benchmark challenging computational tasks on single-cell multi-omics data, including the inference of GRNs, estimation of RNA velocity, integration of single-cell datasets from multiple batches and modalities, and analysis of cell-cell interaction using the cell spatial location data.

Please refer to the following vignettes for examples and tutorials.

Alternatively, you can also browse the documentation and vignettes after installing the package.

Overview

Please Cite

Li, H., Zhang, Z., Squires, M., Chen, X., & Zhang, X. (2023). scMultiSim: simulation of multi-modality single cell data guided by cell-cell interactions and gene regulatory networks. BioRxiv, 2022.10.15.512320. https://doi.org/10.1101/2022.10.15.512320

Results

The following figure briefly shows results from the same cell differential tree:

  1. Connected scATAC-seq and scRNA-seq, in continuous or discrete mode. Visualized by t-SNE.
  2. GRN correlation heatmap, where genes regulated by the same regulator have similar correlations with others.
  3. Unspliced counts and RNA velocity ground truth visualized by t-SNE.
  4. Spatial cell locations and cell-cell interaction ground truth.
  5. Discrete cell population with added batch effects.

Results

Installation

scMultiSim can be installed from BioConductor using the following command:

if (!require("BiocManager")) {
  install.packages("BiocManager")
}
BiocManager::install("scMultiSim")

Alternatively, you can install the latest version from GitHub:

devtools::install_github("ZhangLabGT/scMultiSim")

Installing from the main branch

scMultiSim is still under active development. This branch is for the BioConductor submission, therefore, to ensure stability, it may not contain the latest changes. To reproduce the results in the paper, please use the main branch.

Go to the main branch.

devtools::install_github("ZhangLabGT/scMultiSim@main")

Basic Workflow

scMultiSim provides several example GRN and trajectory data:

A typical workflow consists these three main steps:

  1. Simulate the true count:
    results <- sim_true_counts(list2(
       GRN = GRN_params_100,
       tree = Phyla5(),
       num.cells = 1000,
       # optional options
       num.cif = 50,
       discrete.cif = F,
       cif.sigma = 1,
       do.velocity = T,
       # ... other options
    ))
  2. Add technical noise to the dataset:
    add_expr_noise(results)
  3. Add batch effects:
    divide_batches(results, nbatch = 4)

Options

scMultiSim requires users to provide the following options:

Typically, you may also want to adjust the following options to control the basic cell population:

Here, we provide a quick guide on adjusting the effect of each biological factors:

Options: General

rand.seed

integer (default: 0)

scMultiSim should produce the same result if all other parameters are the same.

threads

integer (default: 1)

Use multithreading only when generating the CIF matrix. It will not speed up the simulation a lot, thus not recommended.

Options: Genes

GRN

A data frame with 3 columns as below. Supply NA to disable the GRN effect. (required)

Column Value
1 target gene ID: integer or character;
2 regulator gene ID: integer or character;
3 effect: number.

If num.genes presents, the gene IDs should not exceed this number. The gene IDs should start from 1 and should not ship any intermidiate numbers.

Two sample datasets GRN_params_100 and GRN_params_1000 from Dibaeinia, P., & Sinha, S. (2020) are provided for testing and inspection.

num.genes

integer (default: NULL)

If a GRN is supplied, override the total number of genes. It should be larger than the largest gene ID in the GRN. Otherwise, the number of genes will be determined by N_genes * (1 + r_u), where r_u is unregulated.gene.ratio.

If GRN is disabled, this option specifies the total number of genes.

unregulated.gene.ratio

number > 0 (default: 0.1)

Ratio of unreulated to regulated genes. When a GRN is supplied with N genes, scMultiSim will simulate N * r_u extra (unregulated) genes.

giv.mean, giv.sd, giv.prob

(default: 0, 1, 0.3)

The parameters used to sample the GIV matrix. With probability giv.prob, the value is sampled from N(giv.mean, giv.sd). Otherwise the value is 0.

dynamic.GRN

list (default: NULL)

Enables dynamic (cell-specific GRN). Run scmultisim_help("dynamic.GRN") to see more explaination.

hge.prop, hge.mean, hge.sd

(default: 0, 5, 1)

Treat some random genes as highly-expressed (house-keeping) genes. A proportion of hge.prop genes will have expression scaled by a multiplier sampled from N(hge.mean, hge.sd).

hge.range

integer (default: 1)

When selecting highly-expressed genes, only choose genes with ID > hge.range.

hge.max.var

number (default: 500)

When selecting highly-expressed genes, only choose genes with variation < hge.max.var.

Options: Cells

num.cells

integer (default: 1000)

The number of cells to be simulated.

tree

phylo (default: Phyla5())

The cell differential tree, which will be used to generate cell trajectories (if discrete.cif = T) or clusters (if discrete.cif = F). In discrete population mode, only the tree tips will be used. Three demo trees, Phyla5(), Phyla3() and Phyla1(), are provided.

discrete.cif

logical (default: FALSE)

Whether the cell population is discrete (continuous otherwise).

discrete.min.pop.size, discrete.min.pop.index

integer, integer (default: 70, 1)

In discrete population mode, specify one cluster to have the smallest cell population. The cluster will contain discrete.min.pop.size cells. discrete.min.pop.index should be a valid cluster index (tree tip number).

discrete.pop.size

integer vector (default: NA); e.g. c(200, 250, 300)

Manually specify the size of each cluster.

Options: CIF

num.cifs

integer (default: 50)

Total number of differential and non-differential CIFs, which can be viewed as latent representation of cells.

diff.cif.fraction

number (default: 0.9)

Fraction of differential CIFs. Differential CIFs encode the cell type information, while non-differential CIFs are randomly sampled for each cell.

cif.center, cif.sigma

(default: 1, 0.1)

The distribution used to sample CIF values.

use.impulse

logical (default: FALSE)

In continuous population mode, when sampling CIFs along the tree, use the impulse model rather than the default gaussian random walk.

Options: Simulation - ATAC

atac.effect

number ∈ [0, 1] (default: 0.5)

The influence of chromatin accessability data on gene expression.

region.distrib

vector of length 3, should sum to 1 (default: c(0.1, 0.5, 0.4))

The probability that a gene is regulated by 0, 1, 2 consecutive regions, respectively.

atac.p_zero

number ∈ [0, 1] (default: 0.8)

The proportion of zeros we see in the simulated scATAC-seq data.

riv.mean, riv.sd, riv.prob

(default: 0, 1, 0.3)

The parameters used to sample the RIV (Region Identity Vectors). With probability riv.prob, the value is sampled from N(riv.mean, riv.sd). Otherwise the value is 0.

Options: Simulation - RNA

do.velocity

logical (default: FALSE)

When set to TRUE, simulate using the full kinetic model and generate RNA velocity data. Otherwise, the Beta-Poission model will be used.

beta

number (default: 0.4)

The splicing rate of each gene in the kinetic model.

d

number (default: 1)

The degradation rate of each gene in the kinetic model.

num.cycles

number (default: 3)

The number of cycles run before sampling the gene expression of a cell.

cycle.len

number (default: 1)

In velocity mode, a multiplier for the cell cycle length. It is multiplied by the expected time to transition from k_on to k_off and back to form the the length of a cycle.

Options: Simulation - Spatial Cell-Cell Interaction

cci

Enables cell-cell interaction. See scmultisim_help("cci") for details.

Technical Noise and Batch Effects

add_expr_noise

Options:

divide_batches

Options:

Output

scMultiSim returns an environment with the following fields:

If do.velocity is enabled, it has these additional fields:

If dynamic GRN is enabled, it has these additional fields:

If cell-cell interaction is enabled, it has these additional fields:

If it is a debug session (debug = TRUE), a sim field is available, which is an environment contains all internal states and data structures.

Reference

Hechen Li, Ziqi Zhang, Michael Squires, Xi Chen, and Xiuwei Zhang. 2023. “scMultiSim: Simulation of Multi-Modality Single Cell Data Guided by Cell-Cell Interactions and Gene Regulatory Networks.” bioRxiv.

FAQ

Running Speed

Simulations should finish in a reasonable time in most cases. On a machine with an i7-12700K CPU and 64GB RAM, using 1000 cells, 100 genes and 50 CIFs, the simulation took under 1 mimute to generate both scRNA-seq and scATAC-seq data. If also generating unspliced and spliced counts, or enabling cell-cell interactions, the running time is longer (~3 minutes when RNA velocity is enabled, and 30 minutes for 500 cells with spatial cell-cell interaction enabled).

Contact

GitHub issues are welcomed. It is also possible to send email to the main author Hechen Li (hli691 at gatech.edu).