PayamDiba / SERGIO

A simulator for single-cell expression data guided by gene regulatory networks
GNU General Public License v3.0
55 stars 29 forks source link

SERGIO (Single-cell ExpRession of Genes In silicO)

DOI

SERGIO v1.0.0

Saurabh Sinha’s Lab, University of Illinois at Urbana-Champaign Sinha Lab

Developed by Payam Dibaeinia

Description

SERGIO is a simulator for single-cell expression data guided by gene regulatory networks. A command-line, easy-to-use version of SERGIO will be soon uploaded to PyPI. Here is the documentation for using SERGIO v1.0.0 as a module in python.

Dependencies

Python >= 2.7.14

numpy >= 1.13.3

scipy >= 1.1.0

networkx >= 2.0

The tool has been succefully tested on MacOS Sierra (v10.12.6) and ScientificLinux 6.9.

Getting Started

To download SERGIO, clone the repository via the following command (should take < 1 minute):

git clone https://github.com/PayamDiba/SERGIO

Usage

run_sergio.ipynb is a jupyter notebook that runs SERGIO for steady-state and differentiation simulations as well as adding technical noise. SERGIO with an easier interface for simulations and adding technical noise will be soon uploaded to PyPI.

Simulating Clean Data

A synthetic data set can be simulated in four lines of python code:

  1. An instance of SERGIO simulator is constructed as below:
import numpy as np
from sergio import sergio
sim = sergio(number_genes, number_bins, number_sc, noise_params,
    noise_type, decays, dynamics, sampling_state, dt, bifurcation_matrix, 
    noise_params_splice, noise_type_splice, splice_ratio, dt_splice)
  1. GRN structure and master regulators’ profile is fed into the simulator by invoking build_graph method:
sim.build_graph(input_file_taregts, input_file_regs, shared_coop_state)
Note: Before preparing the input files, use zero-based numerical indexing for naming all gene IDs (both master regulators and non-master regulators) in the GRN. For example if there are 10 genes in the GRN, naming them starting 0 to 9.
  1. For running steady-state simulations invoke simulate method:
    sim.simulate()

For running differentiation simulations invole simulate_dynamics method:

sim.simulate_dynamics()
  1. To get the clean simulated expression matrix after steady_state simulations invoke getExpressions method:
    expr = sim.getExpressions()

This returns a 3d numpy array (#cell_types #genes #cells_per_type). To convert into a 2d matrix of size (#genes * #cells) do:

expr = np.concatenate(expr, axis = 1)

Now each row represents a gene and each column represents a simulated single-cell. Gene IDs match their row in this expression matrix, also cell types are groupd by columns such that the first #cells_per_type columns correspond to the first simulated cell type, the next #cells_per_type columns correpond to the second cell type and ... .

To get the clean simulated expression matrix after differentiation simulations invoke getExpressions_dynamics method:

exprU, exprS = sim.getExpressions_dynamics()

This returns two 3d numpy array (#cell_types #genes #cells_per_type) for unspliced (exprU) and spliced (exprS) transcripts. To convert them into a 2d matrix of size (#genes * #cells) do:

exprU = np.concatenate(exprU, axis = 1)
exprS = np.concatenate(exprS, axis = 1)

Now each row represents a gene and each column represents a simulated single-cell. Gene IDs match their row in this expression matrix, also cell types are groupd by columns such that the first #cells_per_type columns correspond to the first simulated cell type, the next #cells_per_type columns correpond to the second cell type and ... .

Adding Technical Noise

SERGIO can add three type of technical noise (outlier genes, library size, and dropouts) to the clean simulated data. These noise modules can be invoked in any combination and order. Also, there is a fourth module that converts an expression matrix to an mRNA count matrix. All of these modules work on the 3d expression matrix (not the 2d concatenated version).

First use SERGIO to simulate a clean data set and obtain the 3d expression matrix:
In steady-state simulations:

expr = sim.getExpressions()

In differentiation simulations:

exprU, exprS = sim.getExpressions_dynamics()

Here we show how to add outlier genes followed by library size and then dropouts. Please refer to the manuscript for the definitions of the input parameters to the each of the noise modules:

  1. Outlier Genes:

In steady-state simulations invoke the outlier_effect method:

expr_O = sim.outlier_effect(expr, outlier_prob, mean, scale)

In differentiation simulations invoke the outlier_effect_dynamics method:

exprU_O, exprS_O = sim.outlier_effect_dynamics(exprU, exprS, outlier_prob, mean, scale)
  1. Library Size:

In steady-state simulations invoke the lib_size_effect method:

expr_O_L = sim.lib_size_effect(expr_O, mean, scale)

In differentiation simulations invoke the lib_size_effect_dynamics method:

exprU_O_L, exprS_O_L = sim.outlier_effect_dynamics(exprU_O, exprS_O, mean, scale)
  1. Dropouts:

In steady-state simulations invoke the dropout_indicator method:

binary_ind = sim.dropout_indicator(expr_O_L, shape, percentile)
expr_O_L_D = np.multiply(binary_ind, expr_O_L)

In differentiation simulations invoke the dropout_indicator_dynamics method:

binary_indU, binary_indS = sim.dropout_indicator_dynamics(exprU_O_L, exprS_O_L, shape, percentile)
exprU_O_L_D = np.multiply(binary_indU, exprU_O_L)
exprS_O_L_D = np.multiply(binary_indS, exprS_O_L)
  1. mRNA Count Matrix:

In steady-state simulations invoke the convert_to_UMIcounts method:

count_matrix = sim.convert_to_UMIcounts(expr_O_L_D)

In differentiation simulations invoke the convert_to_UMIcounts_dynamics method:

count_matrix_U = sim.convert_to_UMIcounts_dynamics(exprU_O_L_D)
count_matrix_S = sim.convert_to_UMIcounts_dynamics(exprS_O_L_D)

The output of each of these modules including the "count matrix conversion" module are 3d numpy arrays of size (#cell_types #gene #cells_per_type). To convert them into a 2d expression matrix invoke numpy.concatenate as shown before.

Repository Contents