RGLab / MAST

Tools and methods for analysis of single cell assay data in R
224 stars 57 forks source link

Low logFCs and failure to converge? #173

Open learning-MD opened 2 years ago

learning-MD commented 2 years ago

I've been attempting to use MAST to perform differential gene expression analysis between 3 conditions (control, disease_state1, and disease_state2) from single-cell data and have been running into issues with very low logFCs (all of them with absolute values of <1) that are different in comparison to using Wilcoxon Rank Sum test with Seurat or when I perform pseudobulk methods (such as DESeq2 or edgeR).

# Subsetting my cluster of interest
cluster_seurat_obj <- subset(seurat_obj, idents = cluster_1")
DefaultAssay(cluster_seurat_obj) <- "RNA"

# Converting the Seurat object into a SingleCellExperiment object
sce <- as.SingleCellExperiment(DietSeurat(cluster_seurat_obj, graphs = c("pca", "umap")))

# Log-2 transforming the raw counts
logcounts(sce) <- log2(counts(sce) + 1)

# Converting the SingleCellExperiment object into a MAST SingleCellAssay object
sca <- MAST::SceToSingleCellAssay(sce, class = "SingleCellAssay")

# Scale gene detection rate
colData(sca)$nFeature_RNA = scale(colData(sca)$nFeature_RNA)

# Cellular detection rate
cdr2 <- colSums(assay(sca)>0)
colData(sca)$cngeneson <- scale(cdr2)

# Using column header for my variable of interest
cond <- factor(colData(sca)$process)
cond <- relevel(cond, "control") #setting my control, as there are 2 different disease processes as well
colData(sca)$condition <- cond

# Adding batch to cond to control for batch effect (using each individual sample as its own batch)
cond <- factor(colData(sca)$sample_ID)
colData(sca)$batch_ID <- cond

# Filtering genes to only include those with non-zero expression in >5% of cells
expressed_genes <- freq(sca) >0.05
sca <- sca[expressed_genes,]

# Performing the DEG analysis
zlmCond <- suppressMessages(MAST::zlm(~condition + cngeneson + (1 | batch_ID), exprs_value = "logcounts", sca,method='glmer',ebayes = FALSE,strictConvergence = FALSE, parallel=TRUE))

# Using "condition<diseaseofinterest>" annotation to highlight my condition of interest
summaryCond <- summary(zlmCond, doLRT='conditiondisease1')

summaryDt <- summaryCond$datatable

fcHurdle <- merge(summaryDt[contrast=='conditiondisease1' & component=='H',.(primerid, `Pr(>Chisq)`)],
                  summaryDt[contrast=='conditiondisease1' & component=='logFC', .(primerid, coef, ci.hi, ci.lo)], by='primerid')

fcHurdle[,fdr:=p.adjust(`Pr(>Chisq)`, 'fdr')]

I do get the following warning when I run the zlm, suggesting that MAST cannot find a good model because the variability is too high. However, this is for a cluster with ~15,000 cells and a very common cell type, so I'm wondering if there's something wrong with my code itself:

Warning messages:
1: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv,  :
  Model failed to converge with max|grad| = 0.00241813 (tol = 0.002, component 1)
2: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv,  :
  Model failed to converge with max|grad| = 0.0464445 (tol = 0.002, component 1)
3: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv,  :
  Model failed to converge with max|grad| = 0.0719693 (tol = 0.002, component 1)
4: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv,  :
  Model is nearly unidentifiable: very large eigenvalue
 - Rescale variables?
5: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv,  :
  Model failed to converge with max|grad| = 0.0824029 (tol = 0.002, component 1)
6: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv,  :
  Model is nearly unidentifiable: very large eigenvalue
 - Rescale variables?;Model is nearly unidentifiable: large eigenvalue ratio
 - Rescale variables?

The results end up with around 70 significant genes (FDR <0.05) but in the "coef" column correlating to the logFC, the absolute values range from 0.03 to 0.9 or so. This is very different in comparison to the Wilcoxon Rank Sum test or other pseudobulk approaches (where the log2FCs are typically in the 3s for a lot of the genes). My suspicion is that I'm doing something wrong, as a relative beginner to MAST, in what I'm coding but I'm not sure what steps to take next to troubleshoot this. I've tried searching around on the GitHub, Biostars, Bioconductor forum, and scoured through Nature/Cell papers for code published with manuscripts - I've been unsuccessful, unfortunately (especially since nearly every manuscript uses the Seurat wrapper for MAST instead).

Any guidance would be much appreciated. Thanks in advance!