SONGDONGYUAN1994 / scDesign3

scDesign3 generates realistic in silico data for multimodal single-cell and spatial omics
https://songdongyuan1994.github.io/scDesign3/docs/index.html
MIT License
86 stars 24 forks source link

Howt to generate spatially invariant genes #43

Closed rocketeer1998 closed 6 months ago

rocketeer1998 commented 6 months ago

Hi @SONGDONGYUAN1994 @qw130 , thanks for your contribution to scDesign3! I have an interesting question to discuss: Do you have any ideas for generating spatially invariant genes when simulating new spatial datasets? This is helpful to benchmark new spatially variable genes (SVGs) algorithms against SVGs detection accuracy after running each software and calculating AUC curve.

SONGDONGYUAN1994 commented 6 months ago

@rocketeer1998 Hi Jiayu, Please check this turorial: https://songdongyuan1994.github.io/scDesign3/docs/articles/scDesign3-DEanalysis-vignette.html

In this paper, another group independently uses scDesign3 for generating spatially variable/invariable genes.

Best, Dongyuan

rocketeer1998 commented 6 months ago

Thanks for your quick response! How did you pre-select the top spatial variable genes? I see that you chose top 200 genes to save time, on what basis?

rocketeer1998 commented 6 months ago

I've followed your instructions on above mentioned tutorial. But it seems like there is a memory error when I run in fit copula step. Below is my code and session info.

code

scDesign3_simulator <- function(h5ad_path,dataset_name){

  t1 <- Sys.time()

  ### read matrix
  anndata <- read_h5ad(h5ad_path, backed = 'r')
  mx <- anndata$X
  mx <- t(mx) # row as gene 
  mx <- Matrix(mx, sparse = T)

  ### read coord
  coord <- as.data.frame(anndata$obsm$spatial[,1:2])
  colnames(coord) <- c('X', 'Y')

  ### create SCE object
  example_sce <- SingleCellExperiment(list(counts = mx), 
                                      colData = DataFrame(spatial1 = coord$X, spatial2 = coord$Y))

  ### read pre-selected SVG and nSVG results to filter genes to speed up 
  svg_nsvg <- read.csv('Results/8-Simulate_SVG_nSVG/SVG_nSVG.csv') %>% filter(dataset == dataset_name)
  example_sce <- example_sce[svg_nsvg$gene,]

  ### remove mitochondrial genes
  mt_idx<- grep("mt-",rownames(example_sce))
  if(length(mt_idx)!=0){
    example_sce   <- example_sce[-mt_idx,]
  }

  ### scDesign3 step-by-step simulation
  set.seed(1)

  ### construct input dataset
  print(Sys.time())
  print("Step1 construct input dataset")
  example_data <- construct_data(
    sce = example_sce,
    assay_use = "counts",
    celltype = NULL,
    pseudotime = NULL,
    spatial = c("spatial1", "spatial2"),
    other_covariates = NULL,
    corr_by = "1"
  )

  ### fit regression models for each feature based on specific specification
  print(Sys.time())
  print("Step2 fit regression models for each feature based on specific specification")
  example_marginal <- fit_marginal(
    data = example_data,
    predictor = "gene",
    mu_formula = "s(spatial1, spatial2, bs = 'gp', k= 50)",
    sigma_formula = "1",
    family_use = "nb",
    n_cores = 20,
    usebam = FALSE
  )

  ### fit a copulam obtain AIC and BIC
  print(Sys.time())
  print("Step3 fit a copulam obtain AIC and BIC")
  example_copula <- fit_copula(
    sce = example_sce,
    assay_use = "counts",
    marginal_list = example_marginal,
    family_use = "nb",
    copula = "gaussian",
    n_cores = 20,
    input_data = example_data$dat
  )

  ### extract parameters
  print(Sys.time())
  print("Step4 extract parameters")
  example_para <- extract_para(
    sce = example_sce,
    marginal_list = example_marginal,
    n_cores = 20,
    family_use = "nb",
    new_covariate = example_data$newCovariate,
    data = example_data$dat
  )

  ### select SVG and non SVG
  dev_explain <- sapply(example_marginal, function(x){
    sum = summary(x$fit)
    return(sum$dev.expl)
  })
  dev_ordered <- order(dev_explain, decreasing = TRUE)
  num_de <- 100
  ordered <- dev_explain[dev_ordered]
  de_idx <- names(ordered)[1:num_de]
  non_de_idx <- names(ordered)[-(1:num_de)]
  non_de_mat <- apply(example_para$mean_mat[,non_de_idx], 2, function(x){
    avg <- (max(x)+min(x))/2
    new_mean <- rep(avg, length(x))
    return(new_mean)
  })
  example_para$mean_mat[,non_de_idx] <- non_de_mat

  ### simulate new data
  set.seed(1)
  example_newcount <- simu_new(
    sce = example_sce,
    mean_mat = example_para$mean_mat,
    sigma_mat = example_para$sigma_mat,
    zero_mat = example_para$zero_mat,
    quantile_mat = NULL,
    copula_list = example_copula$copula_list,
    n_cores = 1,
    family_use = "nb",
    input_data = example_data$dat,
    new_covariate = example_data$newCovariate,
    important_feature = rep(TRUE, dim(example_sce)[1]),
    filtered_gene = example_data$filtered_gene
  )

  simu_sce <- SingleCellExperiment(list(counts = example_newcount), colData = example_data$newCovariate)
  logcounts(simu_sce) <- log1p(counts(simu_sce))

  t2 <- Sys.time()

  return(list(SVG_index = de_idx, nSVG_index = non_de_idx, new_data = simu_sce, time = t2-t1))

}
result1 <- scDesign3_simulator(h5ad_path = MyData.h5ad',
                               dataset_name = 'MyData')

When I ran fit_copula, it threw me an error Error in serialize(data, node$con) : 链结书写发生了错误. I've tested my pipeline in a small datasets and it worked. This dataset has ~2700 spots and ~12000 genes. Is it a computer memory problem?

session info

image

SONGDONGYUAN1994 commented 6 months ago

Hi Jiayu, That sounds like a run-out-of-memory issue. The copula fitting can be space expensive since it will estimate a gene x gene corr matrix. One solution is to give up the copula part if you only care about marginal SVG, e.g., set corr_by = 'ind' in construct_data().

rocketeer1998 commented 6 months ago

I've made up my mind to upgrade my PC to a Linux server! Thanks a lot!