Nanostring-Biostats / SpatialDecon

The SpatialDecon library implements the SpatialDecon algorithm for mixed cell deconvolution in spatial gene expression datasets. (This algorithm also works in bulk expression profiling data.)
MIT License
33 stars 8 forks source link

Error in Differential Expression testing after spatialdecon and reverseDecon #36

Open gianfilippo opened 2 years ago

gianfilippo commented 2 years ago

Hi,

could you please clarify how to run Differential Expression on the cell type specific data generated by reverseDecon ?

Below is what I did so far:

I followed the GeoMxWorkflows tutorial to QC, filter, process the data from DCC files

Next, I used assayData(myData)[["q_norm"]] as final matrix for analysis and tried to follow the SpatialDecon tutorial: norm = assayData(myData)[["q_norm"]]

infer background using derive_GeoMx_background

run the spatialdecon function as follow: spatialdecon(norm = norm, bg = bg, X = mouseAllenBrain, align_genes = TRUE) where mouseAllenBrain = download_profile_matrix(species = "Mouse", age_group = "Adult", matrixname = "Brain_AllenBrainAtlas")

Then I used reverseDecon to get cell type specific gene expression levels rdecon = reverseDecon(norm = norm, beta = res$beta)

from here I derived the cell type specific gene expression levels (not sure about this). for instance, for Actrocytes, for (i in 1:ncol(norm)) normAstro[,i] = rdecon$coefs[,"Astro"] * res$beta["Astro",i] + rdecon$coefs[,1]

the normAstro matrix has some entries equal (or very close) to zero, so, log2(normAstro) not an option. I use log2(normAstro+1) and assign it to the GeoMXset object assayDataElement(object = myData, elt = "log_q_norm_Astro") = log2(normCellType[["Astro"]]+1.0)

At this point I do differential expression testing as follow (as in the tutorial) for each segment. NOTE: the experimental design is such that each slide has both "Ref" and "Pert" in testClass, but I may have two slides each individual, so I used a LMM without random slope

segment = "TypeA"
ind <- pData(target_demoData)$segment == segment
mixedOutmc <- mixedModelDE(target_demoData[, ind], elt = "log_q_norm_Astro", modelFormula = ~ testClass + (1 | slide),
                           groupVar = "testClass", nCores = parallel::detectCores(), multiCore = F)
r_test <- do.call(rbind, mixedOutmc["lsmeans", ])
tests <- rownames(r_test)
r_test <- as.data.frame(r_test)
r_test$Contrast <- tests
r_test$Gene <- unlist(lapply(colnames(mixedOutmc), rep, nrow(mixedOutmc["lsmeans", ][[1]])))
r_test$Subset <- segment
r_test$FDR <- p.adjust(r_test$`Pr(>|t|)`, method = "fdr")
r_test <- r_test[, c("Gene", "Subset", "Contrast", "Estimate", "Pr(>|t|)", "FDR")]
resultsCellTypeSpec <- rbind(resultsCellTypeSpec, r_test)

The code crashes at this point with the following error "log_q_norm_Astro" Error in checkForRemoteErrors(val) : 20 nodes produced errors; first error: invalid class “corMatrix” object: 'sd' slot has non-finite entries

Please let me know if you need more details

Thanks Gianfilippo

gianfilippo commented 2 years ago

Hi Maddy,

I take the reverseDecon output, "rdecon" in my case, and, for each sample, I generate cell type specific gene expression levels. See blow for an example:

for (i in 1:ncol(norm)) normAstro[,i] = rdecon$coefs[,"Astro"] * res$beta["Astro",i] + rdecon$coefs[,1]

I am not sure this is correct. Anyway, the resulting cell type specific gene expression matrix, normAstro is a gene X sample matrix, with the same gene names and sample IDs as the original matrix.

I then add it to the NanoStringGeoMxSet and git is a 'name', as follow: assayDataElement(object = myData, elt = "log_q_norm_Astro") = log2(normCellType[["Astro"]]+1.0)

Then I try to test for differential expression as described in my previous post.

Unless there is a problem with the way I add the assayDataElement to the NanoStringGeoMxSet, I cannot see the matrix format to be a problem. May be the way I estimate cell type specific expression levels by gene is wrong? what do you think ?

Thanks Gianfilippo

maddygriz commented 2 years ago

Hi Gianfilippo,

I was reading your issue incorrectly and noticed my mistake right after I sent my previous comment. I am looking into your issue, now that I understand it correctly, and I will get back to you shortly.

Maddy

maddygriz commented 2 years ago

Hi Gianfilippo,

So you are creating the cell type specific gene expression levels in the expected format so that is not the issue. The issue seems to stem from the significant number of zeros that are created from generating this matrix. Our biostatistician suggests

"limiting this analysis to genes that are mainly coming from the cell type of interest. E.g. if a gene's "astrocyte-specific" expression is >50% of the total background-subtracted expression, then it could be worth analyzing in this way. This kind of rule does remove most genes from consideration."

But if you don't want to reduce the gene list that much, you at least need to remove the genes that have zeros across all of the samples. Having the same value across the row is causing the standard deviation error.

Maddy

gianfilippo commented 2 years ago

How naive of me!! I did not think about re-filtering the matrix. Thank you (and your biostatistician) for noticing it. I will let you know how it goes Best Gianfilippo

gianfilippo commented 2 years ago

ok, I made some changes but I am still having problems. I get the following error

Error in if (maxmingrad > ccl$tol) { : missing value where TRUE/FALSE needed

Can you please help ?

Below is what I did. In the first few lines, I create a copy of the original data, then I filter out from the
myData is the original data object.

Copy data obj

myData_test = myData

Add cell type specific matrix to data Obj and store into "q_norm_Astro" assay element

assayDataElement(object=myData_test, elt="q_norm_Astro")) <- normCellType[["Astro"]]

Filter genes. Keep only genes with a min expr of 0.0001 in 9 samples or more

keep = which(rowSums(normCellType[["Astro"]]>=0.0001)>8)
myData_test = myData_test[keep,]

log cell type specific matrix and store into "log_q_norm_Astro" assay element

assayDataElement(object=myData_test, elt="log_q_norm_Astro") <- assayDataApply(myData_test, 2, FUN=log, base=2, elt="q_norm_Astro")

get index for Astrocyte specific samples to be tested

ind <- pData(myData_test)$segment == "AstroMarker"
mixedOutmc <- mixedModelDE(myData_test[, ind], elt = "log_q_norm_Astro", modelFormula = ~ testClass + (1 | slide), groupVar = "testClass", nCores = 20, multiCore = F)
r_test <- do.call(rbind, mixedOutmc["lsmeans", ])
tests <- rownames(r_test)
r_test <- as.data.frame(r_test)
r_test$Contrast <- tests
r_test$Gene <- unlist(lapply(colnames(mixedOutmc), rep, nrow(mixedOutmc["lsmeans", ][[1]])))
r_test$Subset <- segment
r_test$FDR <- p.adjust(r_test$`Pr(>|t|)`, method = "fdr")
r_test <- r_test[, c("Gene", "Subset", "Contrast", "Estimate", "Pr(>|t|)", "FDR")]
gianfilippo commented 2 years ago

Hi, I think I found the issue. The way I was getting the cell type specific gene expression estimate is wrong, since I was adding the intercept value to the gene expression estimate, i.e. for (i in 1:ncol(norm)) normAstro[,i] = rdecon$coefs[,"Astro"] * res$beta["Astro",i] + rdecon$coefs[,1]

In reality this intercept, as I understand it, is part of the model for overall gene expression. So, when I removed it, and estimated cell type gene expression as for (i in 1:ncol(norm)) normAstro[,i] = rdecon$coefs[,"Astro"] * res$beta["Astro",i] the code works. The reason for the crash, as you mentioned also, is constant expression across samples, which happens anywhere the cell-sample coeff is zero, making the only non-zero entry the intercept coefficient. Perhaps, a question is how good are these estimates for differential expression analysis, since, I believe, you have not assessed that in the paper. i can see that the difference from actual measurement is non-negligible. Also, what is the meaning of the intercept coefficient ? background noise ? Thanks Gianfilippo

patrickjdanaher commented 2 years ago

Hi Gianfilippo, Glad you found the fix! Answers to your questions: