Sun-lab / ideas

Individual level Differential Expression Analysis for Single cells
MIT License
22 stars 4 forks source link

R CRAN status DOI

Individual level Differential Expression Analysis for Single cells

This is the R package for differential expression analysis using single cell RNA-seq data of multiple individuals. The inputs are scRNA-seq data and cell level and/or individual-level covariates and the outputs are p-values for all genes tested. This project is licensed under the terms of the MIT license.

An overview of the IDEAS pipeline. Here,we illustrate a toy example with 2 cases and 3 controls, with 2 or 3 cells per individual.

Installation

To install this package in R, use

library("devtools");
install_github("Sun-lab/ideas")

Usage

Here is the example code to run IDEAS using simulated data. First load libraries and simulated data. Here we took 100 genes for illustration. A complete code can be found here

library(ideas)
library(foreach)
library(doRNG)
RNGkind("L'Ecuyer-CMRG")
library(doParallel)
registerDoParallel(cores=6)

simu_data_rds = "sim_data_ncase_10_nctrl_10_ncell_120_fold_mean_1.2_var_1.5.rds"
sim_data      = readRDS(paste0("data/", simu_data_rds))

count_matrix = sim_data$count_matrix[1:100,]
meta_cell    = sim_data$meta_cell
meta_ind     = sim_data$meta_ind

var2test      = "phenotype"
var2adjust    = "RIN"
var2test_type = "binary"
var_per_cell  = "cell_rd"

Next we ran the analysis in two steps. First calculate the distance matrix by function ideas_dist, and then evaluate the p-value using function permanova.

dist1 = ideas_dist(count_matrix, meta_cell, meta_ind, 
                   var_per_cell, var2test, var2test_type, 
                   d_metric = "Was", fit_method = "nb")

pval_ideas = permanova(dist1, meta_ind, var2test, var2adjust, 
  var2test_type, n_perm=999, r.seed=903)

From the above usage example, we can see the two functions that user need to use are ideas_dist, which calculate distance across all individuals, and permanova, which calculate the testing p-values given the distance matrix. Here we give a brief description of the input and output of these two functions.

ideas_dist

The output of ideas_dist is a three dimensional array with first dimension for the number of genes and the next two dimensions for the the number of individuals. For example, if we study 1000 genes and 20 individuals, it is an array of dimension 1000 x 20 x 20. Some parameters of ideas_dist that often need to be set by the users are listed below.

permanova

permanova take the distance matrix as input and its output is a vector of p-values for each gene. Most other inputs of permanova are the same as the inputs for ideas_dist, such as information for cells (meta_cell) and individuals (meta_ind).

Note on input for dca_direct

One of the inputs for dca_direct is the mean_norm.tsv output from DCA. DCA version before 2021 had mean_norm.tsv as one of the direct output files, but no longer provides it as of the version in Sept. 2021. The command line for running DCA as of Sept. 2021 is here. mean_norm.tsv needs to be reconstructed from the mean.tsv output of DCA and the original count matrix, and this can be done using this code.

Citation

Zhang, M., Liu, S., Miao, Z., Han, F., Gottardo, R., Sun, W. (2022). IDEAS: individual level differential expression analysis for single-cell RNA-seq data. Genome Biology, 23(1), 1-17. [HTML, PDF, Supplement]