Error: grouping factors must have > 1 sampled level

esrasefik commented 3 years ago

Dear MAST Team, I am trying to use MAST for genotype-based differential expression analysis in a number of cell-type specific clusters that I identified in my single cell RNA-Seq data. I have two genotypes (wild type and a mutant strain with a hemizygous deletion of 21 genes). I have 4 replicates per genotype, so 8 mice in total (metadata on replicate ids is referred to as orig.ident below). Due to concerns related to pseudo-replication, I would like to add a random effect for replicate to my model. However, when I try to run the code pasted below, I get an error that states:

Error: grouping factors must have > 1 sampled level no 'nobs' method is available. 
Error: number of levels of each grouping factor must be < number of observations (problems: orig.ident).

The pipeline does not crash in response to this error, but I would like to understand what is causing the error message. Any feedback would be appreciated.

Note: I performed my data preprocessing and clustering using Seurat. Hence, I first had to transform my Seurat object into an SCA object. I believe I followed the vignette correctly, but in case the error is caused by a mistake in one of the steps prior to running zlm, I included all relevant code:

s_obj4.sce <- as.SingleCellExperiment(s_obj4, assay = "RNA") # s_obj4 is my Seurat object that includes the raw counts
s_obj4.sce <- logNormCounts(s_obj4.sce, log = TRUE) # normalization
s_obj4.sca <- SceToSingleCellAssay(s_obj4.sce, check_sanity = FALSE) 

cdr <-colSums(assay(s_obj4.sca)>0) # calculcate cellular detection rate
colData(s_obj4.sca)$cngeneson <- scale(cdr)

cond <-factor(colData(s_obj4.sca)$genotype)
cond <-relevel(cond, "1") # "genotype = 1" refers to wild type
colData(s_obj4.sca)$genotype <-cond

s_obj4.sca_cl1 <- subset(s_obj4.sca, CellType =='1') # subset the data to only cells from CellType =='1' to perform the analysis on only neurons.

zlm_cl1_test <- zlm(~ genotype + cngeneson + (1 | orig.ident),
     sca = s_obj4.sca_cl1, 
     exprs_value = 'logcounts', 
     method="glmer", 
     ebayes=FALSE,
      silent=T, 
     fitArgsD = list(nAGQ = 0))

Thank you for any help you can give me! Esra

gfinak commented 3 years ago

Look at the crosstabulation of what you're actually modeling. What does the table of genotype by orig.ident for cell type 1 look like? The error suggests that you have only one genotype or one value of orig.ident in your data, probably after subsetting to celltype 1. The error is probably only occuring for some genes, and may be in the continuous part of the model since that is conditional on a gene being expressed.

esrasefik commented 3 years ago

Ah that makes so much sense! Thank you for your quick response. In that case, do you think filtering the data for each cluster in the following way could help fix this issue: A gene should be expressed in at least 10% (just as an example) of the wild type or mutant cells of a given cluster. I believe the FindMarkers function in Seurat, which provides an option to use MAST for DE analysis, implements a thresholding rule like this. Is there a similar way I could filter my data in MAST prior to zlm or in the same function as zlm? Thank you again!

Esra

gfinak commented 3 years ago

Basically yes. When I do per cluster analysis I'll do the filtering in a loop. There's no specific API for this in MAST.

esrasefik commented 3 years ago

I see. Thank you again! Given the need for cluster-specific filtering prior to zlm with random effect, I assume that cellular detection rate should be re-calculated for each cluster after filtering. Could you please confirm if that is accurate?

Esra

RGLab / MAST

Error: grouping factors must have > 1 sampled level #146