dmcable / spacexr

Spatial-eXpression-R: Cell type identification (including cell type mixtures) and cell type-specific differential expression for spatial transcriptomics
GNU General Public License v3.0
275 stars 67 forks source link

cell-type specific differential expression analysis among samples #100

Closed Feng-Zhang closed 1 year ago

Feng-Zhang commented 1 year ago

I am wondering whether we can run cell-type specific differential expression analysis among samples, rather than in one slice? Let's say we have 8 slices, of which 4 for control, and 4 for case. How to run differential expression analysis for same cell types between 4 control and 4 case? We finally can obtain differential expression genes in same cell types, not the differential expression genes in different cell types?

dmcable commented 1 year ago

C-SIDE has a feature to run DE across multiple samples, the function CSIDE.population.inference. It will estimate a population mean and standard error.

We don't have a particular implementation for the case vs. control setting, but I recommend running on cases and controls separately, and then doing a Z-test on the means afterwards (which requires the means and standard errors).

Feng-Zhang commented 1 year ago

@dmcable Thank you for your reply. How can I run CSIDE for each slices without explanatory.variable? Can we define the explanatory.variable as 0 for all spots in control slices and 1 for all spots in case slices?

dmcable commented 1 year ago

To run C-SIDE with just an intercept term, see https://github.com/dmcable/spacexr/issues/87#issuecomment-1192718351

Dylan

Feng-Zhang commented 1 year ago

@dmcable Hi Dylan, There are two questions still confusing me :

  1. How do I set the exvar_list variable when using run.CSIDE.replicates?
    barcodes <- intersect(names(myRCTD@spatialRNA@nUMI), colnames(myRCTD@spatialRNA@counts))
    X <- as.matrix(rep(1, length(barcodes)))
    rownames(X) <- barcodes
    myRCTD <- run.CSIDE(myRCTD, X, barcodes)

    The above example you give used X as a design matrix, however, the elements in exvar_list are integer objects as running run.CSIDE.replicates function blow.

    myRCTD.reps <- run.CSIDE.replicates(myRCTD.reps, exvar_list, cell_types, gene_threshold = .01, 
                                    cell_type_threshold = 5, fdr = 0.25) 
  2. If run.CSIDE.replicates is not suitable in this situation, can I run run.CSIDE(myRCTD, X, barcodes) for each sample one by one, and then using CSIDE.population.inference to obtain cell type specific DEG?
dmcable commented 1 year ago

This is a good point. I have updated the run.CSIDE.replicates function to handle this case. You should set de_mode = 'general', explanatory.variable.replicates = NULL, and X.replicates = X.replicates. X.replicates should be a list of the X matrices. Also, a warning that your parameters are for testing and the recommended default parameters for proper running.

Feng-Zhang commented 1 year ago

@dmcable Hi Dylan, I get an error:

Error in run.CSIDE.replicates(myRCTD_reps, de_mode = "general", explanatory.variable.replicates = NULL,  : 
  run.CSIDE.replicates: explanatory.variable.replicates must be a list of explanatory variable vectors for each replicate.

I guess it is due to this common in run.CSIDE.replicates function:

 if (class(explanatory.variable.replicates) != "list") 
        stop("run.CSIDE.replicates: explanatory.variable.replicates must be a list of explanatory variable vectors for each replicate.")

Could you please fix this small bug when you are free?

dmcable commented 1 year ago

Thanks for pointing this out. The error checking has been fixed now.

Feng-Zhang commented 1 year ago

@dmcable Hi Dylan,

I modified a little bit of your code in the vignette:

library(spacexr)
library(ggplot2)
library(Matrix)
library(doParallel)
prefix <- "spacexr-"
output_prefix <- file.path(work_dir,prefix)
savedir <- 'RCTD_batch_results'
if(!dir.exists(savedir))
  dir.create(savedir)

# load Reference object --------------------
refdir <- system.file("extdata",'Reference/Vignette',package = 'spacexr') # directory for the reference
counts <- read.csv(file.path(refdir,"dge.csv")) # load in counts matrix
rownames(counts) <- counts[,1]; counts[,1] <- NULL # Move first column to rownames
meta_data <- read.csv(file.path(refdir,"meta_data.csv")) # load in meta_data (barcodes, clusters, and nUMI)
cell_types <- meta_data$cluster; names(cell_types) <- meta_data$barcode # create cell_types named list
cell_types <- as.factor(cell_types) # convert to factor data type
nUMI <- meta_data$nUMI; names(nUMI) <- meta_data$barcode # create nUMI named list
reference <- Reference(counts, cell_types, nUMI)

# load SpatialRNA object and desing matrix --------------
spatial_replicates <- NULL
X_replicates <- NULL
samples <- c("Vignette","Vignette_rep2","Vignette_rep3","Vignette_rep4")
for(i in samples){
  # i <- 1
  datadir <- system.file("extdata",'SpatialRNA',i,package = 'spacexr')
  puck <- read.SpatialRNA(datadir, count_file = "MappedDGEForR.csv", coords_file = 'BeadLocationsForR.csv')
  spatial_replicates[[i]] <- puck
  barcodes <- intersect(names(puck@nUMI), colnames(puck@counts))
  X_design <- as.matrix(rep(1, length(barcodes)))
  rownames(X_design) <- barcodes
  X_replicates[[i]] <- X_design
}
group_ids <- c(1,1,1,1)

# run RCTD and CSIDE -----------------------------
myRCTD_reps <- create.RCTD.replicates(spatial_replicates, reference, samples, group_ids = group_ids, max_cores = 6)
myRCTD_reps <- run.RCTD.replicates(myRCTD_reps)
cell_types <- c('10','18')
myRCTD_reps <- run.CSIDE.replicates(myRCTD_reps, de_mode = 'general', explanatory.variable.replicates = NULL, X.replicates = X_replicates,cell_types=cell_types,
                                    gene_threshold = .01, cell_type_threshold = 0, fdr = 0.25)

After running run.CSIDE.replicates, I still have two problems confusing me.

  1. When cell_type_threshold=1, which is larger than 0, it would throw an error below:

    run.CSIDE.replicates: testing CSIDE for errors for replicate 1
    Error in choose_cell_types(myRCTD, barcodes, doublet_mode, cell_type_threshold,  : 
    choose_cell_types: cell types: 10, 18 detected using aggregate_cell_types to have less than the minimum cell_type_threshold of 1. To fix this issue, please remove these cell types or reduce the cell_type_threshold
    Warning message:
    In run.CSIDE.general(myRCTD, X1, X2, barcodes, cell_types, cell_type_threshold = cell_type_threshold,  :
    
    Error in choose_cell_types(myRCTD, barcodes, doublet_mode, cell_type_threshold, : 
    choose_cell_types: cell types: 10, 18 detected using aggregate_cell_types to have less than the minimum cell_type_threshold of 1. To fix this issue, please remove these cell types or reduce the cell_type_threshold

    If I set cell_type_threshold = 0 and keep the cell , myRCTD_reps@population_de_results is list() although no error was throwed.

  2. In this way, how do we design the group to define slice is case or control?

dmcable commented 1 year ago

Your code is correct. There was a bug in run.CSIDE.replciates which is fixed now.

  1. cell_type_threshold is the minimum that each cell type must appear from each cell type. Outside of debugging / testing, I recommend setting it to the default parameter. For the vignette data, it is set to 5. It is not recommended to set it to 1 or 0.

  2. You should run it separately for cases vs controls. C-SIDE will not be aware of whether you are running it for cases or controls.

Feng-Zhang commented 1 year ago

@dmcable Thank you for your reminder. I would use the default parameter on my real data. I have got the CSID results without any error, and the structure of myRCTD_reps is shown as :

> str(myRCTD_reps,max.level = 2)
Formal class 'RCTD.replicates' [package "spacexr"] with 5 slots
  ..@ RCTD.reps               :List of 4
  ..@ population_de_results   : list()
  ..@ population_sig_gene_list: list()
  ..@ population_sig_gene_df  : list()
  ..@ group_ids               : Named num [1:4] 1 1 1 1
  .. ..- attr(*, "names")= chr [1:4] "Vignette" "Vignette_rep2" "Vignette_rep3" "Vignette_rep4"

However, how can I run differential analysis grouped by slices in detail? What do you think I can get started with? As you mentioned above, doing a Z-test on the means and standard errors?

dmcable commented 1 year ago

You can obtain means and standard errors for each gene for each group, as documented within the package.

Then, you can do a Z-test for testing for gene expression differences across groups. The Z-test is directly computable from the C-SIDE output. The Z-statistic is computed is Z = (m1 - m2)/sqrt(s1^2+s2^2). Where m1,m2 are the estimated expression of each cell type, whereas s1,s2 are the standard errors. This can then be converted to a p-value using the normal distribution CDF.

Feng-Zhang commented 1 year ago

@dmcable Sorry for asking the questions below after you closed this issue. I still can't obtain the estimated expression of each type.

  1. when I checked out the estimated expression of each cell type, the myRCTD_reps@RCTD.reps[[4]]@de_results is list().
  2. how can I obtain the estimated expression of each cell type as I have multiple slices? I mean I will have 4 mean_val for each gene of each cell type in the control group and the case group, respectively. How can I get the one estimated expression for the control and case group?
dmcable commented 1 year ago

You need to get the population-level results. This is documented in the spatial transcriptomics vignette within "CSIDE Results": https://raw.githack.com/dmcable/spacexr/master/vignettes/replicates.html

In this case, you want to use all genes including nonsignificant, found here: myRCTD.reps@population_de_results.

Next time, please fully read the documentation before asking questions that are covered within the documentation.

Feng-Zhang commented 1 year ago

Sorry for misleading you. I have read and run this document before. I mean the run.CSIDE.replciates seems not to return these results, and myRCTD.reps@population_de_results and myRCTD_reps@RCTD.reps[[1]]@de_results are empty.

dmcable commented 1 year ago

Ok, thanks for pointing out this issue. There was a couple bugs in the run.CSIDE.replicates, which have now fixed this issue.

Feng-Zhang commented 1 year ago

Thank you for your prompt reply. Now I can obtain the population inference, but myRCTD_reps@population_de_results[["10"]] doesn't return all gene inference. It only has significant gene information as same as elements in myRCTD_reps@population_sig_gene_df. Should it return all gene inferences?

> dim(myRCTD_reps@population_de_results[["10"]])
[1] 10  9
> dim(myRCTD_reps@population_sig_gene_df[["10"]])
[1] 10 17

In addition, I am not familiar with biostatistics, and still don't understand how to draw the z statistic value grouped by slices, even though we can get all mean and standard variance. I also read the one_ct_genes and get_de_pop functions. If I want to define the first two samples as a control group and other samples as a case group and estimate Z-statistic value, using the group_ids <- c(1,1,2,2) and use.groups =T can make it, is it? If yes, how can I pass the use.groups =T to run.CSIDE.replicates function?

head(myRCTD_reps@population_sig_gene_df[["10"]])
      tau log_fc_est     sd_est     Z_est p_cross   ct_prop         expr         q_val             p
Plp1    0  -8.043942 0.32468262 -24.77479       1 1.0000000 2.029427e-05 1.676547e-135 1.676547e-135
Itm2b   0  -5.608008 0.12565190 -44.63130       1 1.0000000 4.975799e-04  0.000000e+00  0.000000e+00
Rora    0  -5.557170 0.14095045 -39.42640       1 1.0000000 2.325036e-03  0.000000e+00  0.000000e+00
Calm2   0  -4.994137 0.08651132 -57.72814       1 0.8599780 5.844512e-04  0.000000e+00  0.000000e+00
Lsamp   0  -4.743579 0.12002038 -39.52312       1 1.0000000 6.584614e-03  0.000000e+00  0.000000e+00
Calm1   0  -4.636973 0.08176442 -56.71138       1 0.8396981 1.895912e-04  0.000000e+00  0.000000e+00
         mean 1    mean 2    mean 3    mean 4      sd 1      sd 2      sd 3      sd 4
Plp1  -8.043942 -8.043942 -8.043942 -8.043942 0.6493652 0.6493652 0.6493652 0.6493652
Itm2b -5.608008 -5.608008 -5.608008 -5.608008 0.2513038 0.2513038 0.2513038 0.2513038
Rora  -5.557170 -5.557170 -5.557170 -5.557170 0.2819009 0.2819009 0.2819009 0.2819009
Calm2 -4.994137 -4.994137 -4.994137 -4.994137 0.1730226 0.1730226 0.1730226 0.1730226
Lsamp -4.743579 -4.743579 -4.743579 -4.743579 0.2400408 0.2400408 0.2400408 0.2400408
Calm1 -4.636973 -4.636973 -4.636973 -4.636973 0.1635288 0.1635288 0.1635288 0.1635288
dmcable commented 1 year ago

There is no concern; it is possible that all genes were considered significant in this case, but that should be ignored. These are all the genes.

No, the use of group_ids is not applicable in your case. You should create two separate RCTD.replicates objects for each group of samples and then compare the mean estimates (log_fc variable in the table) using the Z test formula I provided.

On Thu, Sep 22, 2022 at 10:00 PM fengzhang @.***> wrote:

Thank you for your prompt reply. Now I can obtain the population inference, but @._de_results[["10"]] doesn't return all gene inference. It only has significant gene information as same as elements in @._sig_gene_df. Should it return all gene inferences?

@._de_results[["10"]]) [1] 10 9 @._sig_gene_df[["10"]]) [1] 10 17

In addition, I am not familiar with biostatistics, and still don't understand how to draw the z statistic value grouped by slices, even though we can get all mean and standard variance. I also read the one_ct_genes and get_de_pop functions. If I want to define the first two samples as a control group and other samples as a case group, using the group_ids <- c(1,1,2,2) and use.groups =T can make it, is it? If yes, how can I pass the use.groups =T to run.CSIDE.replicates function?

@.***_sig_gene_df[["10"]]) tau log_fc_est sd_est Z_est p_cross ct_prop expr q_val p Plp1 0 -8.043942 0.32468262 -24.77479 1 1.0000000 2.029427e-05 1.676547e-135 1.676547e-135 Itm2b 0 -5.608008 0.12565190 -44.63130 1 1.0000000 4.975799e-04 0.000000e+00 0.000000e+00 Rora 0 -5.557170 0.14095045 -39.42640 1 1.0000000 2.325036e-03 0.000000e+00 0.000000e+00 Calm2 0 -4.994137 0.08651132 -57.72814 1 0.8599780 5.844512e-04 0.000000e+00 0.000000e+00 Lsamp 0 -4.743579 0.12002038 -39.52312 1 1.0000000 6.584614e-03 0.000000e+00 0.000000e+00 Calm1 0 -4.636973 0.08176442 -56.71138 1 0.8396981 1.895912e-04 0.000000e+00 0.000000e+00 mean 1 mean 2 mean 3 mean 4 sd 1 sd 2 sd 3 sd 4 Plp1 -8.043942 -8.043942 -8.043942 -8.043942 0.6493652 0.6493652 0.6493652 0.6493652 Itm2b -5.608008 -5.608008 -5.608008 -5.608008 0.2513038 0.2513038 0.2513038 0.2513038 Rora -5.557170 -5.557170 -5.557170 -5.557170 0.2819009 0.2819009 0.2819009 0.2819009 Calm2 -4.994137 -4.994137 -4.994137 -4.994137 0.1730226 0.1730226 0.1730226 0.1730226 Lsamp -4.743579 -4.743579 -4.743579 -4.743579 0.2400408 0.2400408 0.2400408 0.2400408 Calm1 -4.636973 -4.636973 -4.636973 -4.636973 0.1635288 0.1635288 0.1635288 0.1635288

— Reply to this email directly, view it on GitHub https://github.com/dmcable/spacexr/issues/100#issuecomment-1255725110, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJUR66LMRVTW3ZUW473HLJLV7UFKRANCNFSM6AAAAAAQKGMCNY . You are receiving this because you were mentioned.Message ID: @.***>

Feng-Zhang commented 1 year ago

Thank you. As your suggestion, I run the code on my visium data:

RCTD_controls <- create.RCTD.replicates(spatial_controls, reference, samples_controls, max_cores = 20)
RCTD_controls <- run.RCTD.replicates(RCTD_controls, doublet_mode = 'full')
cell_types <- as.character(unique(reference@cell_types))
CSIDE_controls <- run.CSIDE.replicates(RCTD_controls, de_mode = 'general', explanatory.variable.replicates = NULL, X.replicates = X_controls, cell_types=cell_types,
                                       gene_threshold = 5e-05, cell_type_threshold = 5, fdr = 0.01)

When doublet_mode = "full" to run RCTD analysis, the run.CSIDE.replicates will throw error:

run.CSIDE.replicates: running CSIDE for replicate 1
Error in if (gene_threshold == 0.01 || fdr == 0.25 || cell_type_threshold ==  : 
  missing value where TRUE/FALSE needed

However, it is no problem using the default value of doublet_mode on example data from this .

dmcable commented 1 year ago

Ok, I am happy to look into debugging the error. Can you please send me the script and input used to reproduce the error? Thanks.

Feng-Zhang commented 1 year ago

Due to the large input file, I have no idea how to post, and thus change to use the example data from your vignette. But the error is different from last time:

run.CSIDE.replicates: running CSIDE for replicate 1
Error in h(simpleError(msg, call)) : 
  error in evaluating the argument 'x' in selecting a method for function 'colSums': invalid character indexing
In addition: Warning message:
In run.CSIDE.general(myRCTD, X1, X2, barcodes, cell_types, cell_type_threshold = cell_type_threshold,  :
  run.CSIDE.general: some parameters are set to the CSIDE vignette values, which are intended for testing but not proper execution. For more accurate results, consider using the default parameters to this function.

The code used is here:

library(tidyverse)
library(spacexr)
library(ggplot2)
library(Matrix)
library(doParallel)
prefix <- "spacexr-"
output_prefix <- file.path(work_dir,prefix)
savedir <- 'RCTD_batch_results'
if(!dir.exists(savedir))
  dir.create(savedir)

# load Reference object --------------------
refdir <- system.file("extdata",'Reference/Vignette',package = 'spacexr') # directory for the reference
counts <- read.csv(file.path(refdir,"dge.csv")) # load in counts matrix
rownames(counts) <- counts[,1]; counts[,1] <- NULL # Move first column to rownames
meta_data <- read.csv(file.path(refdir,"meta_data.csv")) # load in meta_data (barcodes, clusters, and nUMI)
cell_types <- meta_data$cluster; names(cell_types) <- meta_data$barcode # create cell_types named list
cell_types <- as.factor(cell_types) # convert to factor data type
nUMI <- meta_data$nUMI; names(nUMI) <- meta_data$barcode # create nUMI named list
reference <- Reference(counts, cell_types, nUMI)

# load SpatialRNA object and desing matrix --------------
# control samples
spatial_controls <- NULL
X_controls <- NULL
samples_controls <- c("Vignette","Vignette_rep2")
for(i in samples_controls){
  # i <- samples_controls[2]
  datadir <- system.file("extdata",'SpatialRNA',i,package = 'spacexr')
  puck <- read.SpatialRNA(datadir, count_file = "MappedDGEForR.csv", coords_file = 'BeadLocationsForR.csv')
  spatial_controls[[i]] <- puck
  barcodes <- intersect(names(puck@nUMI), colnames(puck@counts))
  X_design <- as.matrix(rep(1, length(barcodes)))
  rownames(X_design) <- barcodes
  X_controls[[i]] <- X_design
}

# treatment samples
spatial_treats <- NULL
X_treats <- NULL
samples_treats <- c("Vignette_rep3","Vignette_rep4")
for(i in samples_treats){
  # i <- 1
  datadir <- system.file("extdata",'SpatialRNA',i,package = 'spacexr')
  puck <- read.SpatialRNA(datadir, count_file = "MappedDGEForR.csv", coords_file = 'BeadLocationsForR.csv')
  spatial_treats[[i]] <- puck
  barcodes <- intersect(names(puck@nUMI), colnames(puck@counts))
  X_design <- as.matrix(rep(1, length(barcodes)))
  rownames(X_design) <- barcodes
  X_treats[[i]] <- X_design
}

# run RCTD and CSIDE for control and treatment group -----------------------------
RCTD_controls <- create.RCTD.replicates(spatial_controls, reference, samples_controls, max_cores = 6)
RCTD_controls <- run.RCTD.replicates(RCTD_controls,doublet_mode = "full")
cell_types <- c('10','18')
CSIDE_controls <- run.CSIDE.replicates(RCTD_controls, de_mode = 'general', explanatory.variable.replicates = NULL, X.replicates = X_controls, cell_types=cell_types,
                                    gene_threshold = .01, cell_type_threshold = 5, fdr = 0.25,doublet_mode = F)
Feng-Zhang commented 1 year ago

Hi @dmcable , I have pushed a test data and script on here, which can reproduce the error:

run.CSIDE.replicates: running CSIDE for replicate 1
Error in if (gene_threshold == 0.01 || fdr == 0.25 || cell_type_threshold ==  : 
  missing value where TRUE/FALSE needed

Thanks.

dmcable commented 1 year ago

Thank you for discovering these errors. There were two separate bugs that are now fixed.

Feng-Zhang commented 1 year ago

It still throws an error using the data on google drive:

CSIDE.population.inference: running population DE inference with use.groups=FALSE
Error in CSIDE.population.inference(RCTD.replicates, log_fc_thresh = log_fc_thresh) : 
  CSIDE.population.inference: minimum of three replicates required for population mode.

I remove a cell type to avoid the error caused by the minimal number of cells so that the code was modified a bit. The code on google drive is updtaed.

cell_types <- as.character(unique(reference@cell_types))[1:2]
CSIDE_controls <- run.CSIDE.replicates(RCTD_controls, de_mode = 'general', explanatory.variable.replicates = NULL, X.replicates = X_controls,
                                       cell_types=cell_types,doublet_mode =F,
                                       cell_type_threshold=5)
dmcable commented 1 year ago

This is intended behavior of C-SIDE. Please refer to the error message above. The minimum is 3 replicates.

Feng-Zhang commented 1 year ago

It is weird because we have 4 replicates for the control group.

length(RCTD_controls@RCTD.reps)
[1] 4
dmcable commented 1 year ago

Interesting, I see. There was a bug in the code, but it is now fixed, thank you! I was able to run your code to completion in the population-DE mode.

Feng-Zhang commented 1 year ago

@dmcable really appreciated. There aren't any errors now. As you know, I am newer in biostatistics. I wonder if could you confirm the process of calculating the p-value?

cell_type <- 'Neurons'
z <- (CSIDE_controls@population_de_results[[cell_type]]["log_fc_est"]-CSIDE_treats@population_de_results[[cell_type]]["log_fc_est"])/sqrt(CSIDE_controls@population_de_results[[cell_type]]["sd_est"]^2+CSIDE_treats@population_de_results[[cell_type]]["sd_est"]^2)
2*pnorm(-abs(z$log_fc_est))

Is the code get the right p-value?

dmcable commented 1 year ago

Yes, that looks correct.

Dylan

On Sun, Sep 25, 2022 at 5:44 AM fengzhang @.***> wrote:

@dmcable https://github.com/dmcable really appreciated. There aren't any errors now. As you know, I am newer in biostatistics. I wonder if could you confirm the process of the p-value?

cell_type <- 'Neurons'

z <- @.**@*.**@*.**@._de_results[[cell_type]]["sd_est"]^2)

2*pnorm(-abs(z$log_fc_est))

Is the code get the right p-value?

— Reply to this email directly, view it on GitHub https://github.com/dmcable/spacexr/issues/100#issuecomment-1257157437, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJUR66IN63WHNSMJOEHTSETWAANI3ANCNFSM6AAAAAAQKGMCNY . You are receiving this because you were mentioned.Message ID: @.***>

Feng-Zhang commented 1 year ago

Thank you a lot.

glzhang90095 commented 1 year ago

@dmcable really appreciated. There aren't any errors now. As you know, I am newer in biostatistics. I wonder if could you confirm the process of calculating the p-value?

cell_type <- 'Neurons'
z <- (CSIDE_controls@population_de_results[[cell_type]]["log_fc_est"]-CSIDE_treats@population_de_results[[cell_type]]["log_fc_est"])/sqrt(CSIDE_controls@population_de_results[[cell_type]]["sd_est"]^2+CSIDE_treats@population_de_results[[cell_type]]["sd_est"]^2)
2*pnorm(-abs(z$log_fc_est))

Is the code get the right p-value?

Hi @dmcable @Feng-Zhang , Here, since the explanatory.variable.replicates = NULL in run.CSIDE.replicates,

  1. how do you interpret log_fc_est in CSIDE output? which is used as a baseline to compare?
  2. Does the log_fc_est column represent Fold Change between control and treatment in the Z-test output?
  3. Do you have to adjust the p-value calculated by Z-test?
  4. In population_de_results, the highest gene number saved in population_de_results among cell types (neuron, biggest population in brain) is only 2,000. Is that a little low? Is it because this setting -gene_threshold = 5e-5? How to increase the gene number?

Thanks in advance! Guanglin

dmcable commented 1 year ago
  1. log_fc_est represents the parameter of interest that is estimated by the model. In this case, it would be the intercept term, representing average expression of the cell type.
  2. No, it is the parameter value within one sample. The fold-change term would be a misnomer in this case.
  3. It would be a good idea to adjust for multiple hypothesis testing.
  4. This seems typical. It's not that low because there is an additional filtering step for genes per cell type (see the CT.PROP parameter to CSIDE.population.inference with default value 0.5. This parameter is the minimum ratio of gene expression within cell type compared to other cell types. To get more genes per cell type, it could be lowered, along with gene_threshold.
glzhang90095 commented 1 year ago

Hi Dylan and Feng @dmcable @Feng-Zhang ,

Following up on the last question, as you explained, "log_fc_est" is the intercept term, representing the average expression of the cell type. However, in my below dataset, they are all negative. Did you see this in your datasets? So how can I interpret the meaning?

Really appreciate your time and help! Guanglin

    head(test@population_de_results$neuron)
    tau log_fc_est  sd_est  Z_est   p_cross ct_prop expr    q_val   p
Mrpl15  0   -10.050433  0.33313872  -30.16891   0.9073508   0.5178239   5.128992e-05    6.751968e-200   6.059863e-200
Tcea1   0   -9.413501   0.24295646  -38.74563   0.6440019   0.6286491   5.394797e-05    0.000000e+00    0.000000e+00
Atp6v1h 0   -8.202751   0.07948133  -103.20348  0.9702251   1.0000000   1.328114e-04    0.000000e+00    0.000000e+00
Rb1cc1  0   -8.802166   0.11677720  -75.37573   0.8318435   1.0000000   4.525399e-04    0.000000e+00    0.000000e+00
Pcmtd1  0   -9.206746   0.16736923  -55.00859   0.6969403   0.4847225   2.016156e-04    0.000000e+00    0.000000e+00
Vcpip1  0   -9.399355   0.18967900  -49.55401   0.3559265   1.0000000   1.153743e-04    0.000000e+00    0.000000e+00
dmcable commented 1 year ago

Hey @glzhang90095 it is the average log expression, so it would be expected to be negative. If you take exp(log_fc_est) you will have units of 'counts per 1', multiply by 1,000,000 if you want counts per million etc.

krademaker commented 1 year ago

Hey @glzhang90095 it is the average log expression, so it would be expected to be negative. If you take exp(log_fc_est) you will have units of 'counts per 1', multiply by 1,000,000 if you want counts per million etc.

How would you go about to get the counts for a library size (from nUMI) to compare the predicted spatial gene expression, decomposed over cell-types, to the observed gene expression in your @spatialRNA object? I am curious to visualize how the count distributions looks like, using the cell type-specific expression estimates, but I can't seem to figure it out.

krademaker commented 1 year ago

Hey @glzhang90095 it is the average log expression, so it would be expected to be negative. If you take exp(log_fc_est) you will have units of 'counts per 1', multiply by 1,000,000 if you want counts per million etc.

How would you go about to get the counts for a library size (from nUMI) to compare the predicted spatial gene expression, decomposed over cell-types, to the observed gene expression in your @spatialRNA object? I am curious to visualize how the count distributions looks like, using the cell type-specific expression estimates, but I can't seem to figure it out.

dmcable commented 7 months ago

That's a great question! What you are looking for are the functions predict_CSIDE_all (summed across all cell types) and predict_CSIDE (a single cell type) in CSIDE_plots.R to obtain the predicted counts for each pixel in the CSIDE model. Note that these functions aren't documented/exported, so you should load them into memory. Hope this helps. We used these functions in our paper to make plots comparing CSIDE predictions to observed counts. I think it's a useful analysis.

Within the make_all_de_plots function that makes plots for CSIDE, the function make_de_plots_quant should generate plots of this nature and place them in the 'de_plots_quant' folder.

Best, Dylan

krademaker commented 6 months ago

Hi Dylan,

Thank you for the explanation, this was exactly what I was looking for and will be trying out!

Best, Koen