Katsevich-Lab / sceptre

An R package for single-cell CRISPR screen data analysis emphasizing statistical rigor, massive scalability, and ease of use.
https://katsevich-lab.github.io/sceptre/
GNU General Public License v3.0
26 stars 8 forks source link

run_sceptre_in_memory and run_sceptre_gRNA_gene_pair results are uncorrelated #5

Closed LucasSilvaFerreira closed 3 years ago

LucasSilvaFerreira commented 3 years ago

Hi, I used some data simulated by the Moderately-sized analysis tutorial to run in the run_sceptre_gRNA_gene_pair. To my surprise the results of the function run_sceptre_in_memory is totally uncorrelated with the results generated by run_sceptre_gRNA_gene_pair. Even using the same seeds.

Should I expect some difference in these results?

timothy-barry commented 3 years ago

Hi,

It would be helpful if you included a reproducible example, but I can handle it time time around :)

Please see the code below, where I compute example p-values using run_sceptre_in_memory and run_sceptre_gRNA_gene_pair. You will see that the p-values produced by these two functions coincide exactly.

# download sceptre
devtools::install_github("katsevich-lab/sceptre")

# load packages
library(dplyr)
library(sceptre)

# generate a small, random dataset
set.seed(4)
n_genes <- 40
n_gRNAs <- 50
n_cells <- 5000
# gene expression matrix 
expression_matrix <- replicate(n = n_genes, rpois(n_cells, 1)) %>% t()
# perturbation matrix
perturbation_matrix <- replicate(n = n_gRNAs, rbinom(n_cells, 1, 0.05)) %>% t()
# assign row names
row.names(expression_matrix) <- paste0("gene", seq(1, nrow(expression_matrix)))
row.names(perturbation_matrix) <- paste0("gRNA", seq(1, nrow(perturbation_matrix)))
# simulate batch
batch <- sample(x = c("batch1", "batch2"), size = n_cells, replace = TRUE, prob = c(0.5, 0.5))
# compute log mRNA library size, construct covariate matrix
lg_mRNA_lib_size <- log(colSums(expression_matrix))
# remove cells with 0 expression (if any exist)
zero_idxs <- lg_mRNA_lib_size == 0
lg_mRNA_lib_size <- lg_mRNA_lib_size[!zero_idxs]
expression_matrix <- expression_matrix[,!zero_idxs]
batch <- batch[!zero_idxs]
# construct covariate matrix
covariate_matrix <- data.frame(lg_mRNA_lib_size = lg_mRNA_lib_size,
                               batch = factor(batch))
gene_gRNA_pairs <- expand.grid(gene_id = row.names(expression_matrix),
                               gRNA_id = row.names(perturbation_matrix)) %>% slice_sample(n = 10)
# set a couple additional arguments
storage_dir <- tempdir()
pod_sizes <- c(gene = 2, gRNA = 2, pair = 3)

# run scale function
result <- run_sceptre_in_memory(storage_dir = storage_dir,
                                expression_matrix = expression_matrix,
                                perturbation_matrix = perturbation_matrix,
                                covariate_matrix = covariate_matrix,
                                gene_gRNA_pairs =  gene_gRNA_pairs,
                                side = "left",
                                pod_sizes = pod_sizes,
                                regularization_amount = 0,
                                seed = 1,
                                B = 1000)
scale_p_vals <- result$p_value

# call the demo function, looping over the pairs
demo_p_vals <- sapply(seq(1L, nrow(gene_gRNA_pairs)), function(i) {
  curr_gene <- as.character(gene_gRNA_pairs[[i, "gene_id"]])
  curr_gRNA <- as.character(gene_gRNA_pairs[[i, "gRNA_id"]])
  curr_expressions <- expression_matrix[curr_gene,]
  curr_indicators <- perturbation_matrix[curr_gRNA,]
  res <- run_sceptre_gRNA_gene_pair(expressions = curr_expressions,
                                    gRNA_indicators = curr_indicators,
                                    covariate_matrix = covariate_matrix,
                                    side = "left",
                                    B = 1000,
                                    seed = 1,
                                    reduced_output = TRUE)
  return(res$p_value)
})

# verify p-values coincide
identical(scale_p_vals, demo_p_vals)

Are you certain that the arguments that you are passing to these functions are identical? For example, as noted in the documentation,run_sceptre_gRNA_gene_pair does not apply any regularization, thereby implicitly setting regularization_amount to zero. By contrast, the regularization_amount argument in run_sceptre_in_memory defaults to 0.1.

LucasSilvaFerreira commented 3 years ago

Hi Tim, I had a nice result when comparing a simulated example using side='left'. I tried to run the same comparison using the side parameter in the run_sceptre_in_memory = "both" , but the results are diverging.

I did notice that your side parameters is hardcoded to 'left' in your run_sceptre_in_memory implementation.

run at-scale analysis

foreach(pod_id = seq(1, dicts[["pairs"]])) %dopar% { run_gRNA_gene_pair_analysis_at_scale(pod_id = pod_id, gene_precomp_dir = dirs[["gene_precomp_dir"]], gRNA_precomp_dir = dirs[["gRNA_precomp_dir"]], results_dir = dirs[["results_dir"]], log_dir = dirs[["log_dir"]], expression_matrix = expression_matrix, perturbation_matrix = perturbation_matrix, covariate_matrix = covariate_matrix, regularization_amount = regularization_amount, side = "left", seed = seed, B = B) }

collect results

results <- collect_results(results_dir = dirs[["results_dir"]]) return(results) }

timothy-barry commented 3 years ago

Thanks, Lucas. This indeed was a bug. I fixed it. You should be able to download and install the latest version of the package from Github.

LucasSilvaFerreira commented 3 years ago

Thanks