How to calculate logFC for condition in a mixed effects model with MAST?

RGLab / MAST

Tools and methods for analysis of single cell assay data in R

224 stars 57 forks source link

How to calculate logFC for condition in a mixed effects model with MAST? #147

Closed esrasefik closed 2 years ago

esrasefik commented 3 years ago

Could you please help me understand how logFC for a condition variable of interest is calculated when MAST is ran by using a mixed effects model. In the code below, orig.ident indicates subject id which is added as a random effect, cngeneson indicates cellular detection rate, and genotype indicates my condition of interest with two levels (1 = wild type, 2 = mutant). I have single cell RNA-Seq data from an experiment that compares the gene expression profiles of two genotypes in specific cell types. My mutant line is known to harbor a hemizygous deletion of ~20 genes. So, my expected log2FC for those genes is approximately -1 (they should be expressed half as much in the mutant line relative to the wild type line). However, when I calculate the log2FC for these genes using the following approach, I get results that are very far from -1. Could you help me understand whether I am calculating logFC wrong or interpreting the results of your summary function wrong. Any feedback is appreciated!

I used the following code to fit the mixed effects model:

zlm_cl1_test <- zlm(~ genotype + cngeneson + (1 | orig.ident),
     sca = s_obj4.sca_cl1, 
     exprs_value = 'logcounts', 
     method="glmer", 
     ebayes=FALSE,
      silent=T, 
     fitArgsD = list(nAGQ = 0),
     strictConvergence = FALSE)

Here are the names of my modeled coefficients:

colnames(coef(zlm_cl1_test, 'D'))
[1] "(Intercept)" "genotype2"   "cngeneson"

Per your vignette, I used the following approach to identify differentially expressed genes due to the genotype effect, and performed a likelihood ratio test (LRT) by comparing the model with and without the genotype factor. I am only interested in testing the genotype coefficient and I want to calculate the logFC for each gene in genotype2 (mutant) relative to genotype 1 (reference level - wild type):

summaryCondtest <- summary(zlm_cl1_test, doLRT='genotype2') 
summaryDttest <- summaryCondtest$datatable
fcHurdletest <- merge(summaryDttest[contrast=='genotype2' & component=='H',.(primerid, `Pr(>Chisq)`)],
     summaryDttest[contrast=='genotype2' & component=='logFC', .(primerid, coef, ci.hi, ci.lo)], by='primerid')
fcHurdletest[,fdr:=p.adjust(`Pr(>Chisq)`, 'fdr')]

I had normalized my data by using the logNormCounts function, the default log base of which is base 2. So, the logFC I obtain by running the above copied code should be in log base 2. Attached is a screenshot of my summaryDttest output for one of those genes (Dlg1) that should have a log2FC of ~-1 in the mutant condition compared to wild type. The row corresponding to logFC as component, and genotype2 as contrast indicates that the estimated coefficient for this gene is -0.09, which is pretty far from my expected value of -1. Am I interpreting the results wrongly?

I also see in the same table that the estimated coefficient for the discrete ("D") component of genotype2 is -0.7, which is closer to -1. I wonder whether this is the value I should be paying attention to in order to identify the fold change in my mutant condition.

Thank you for all your help! Esra

combiz commented 3 years ago

Subscribing. Also, it's interesting that in the Seurat implementation, the logFC values calculated by MAST are seemingly discarded/substituted for a simple rowMeans log fold-change calculation:https://github.com/satijalab/seurat/blob/b56d194939379460db23380426d3896b54d91ab6/R/differential_expression.R#L553 .

gfinak commented 3 years ago

First make sure you have the latest MAST version from GitHub. MAST does normalization through the cngeneson model component. It's not clear to me that feeding it prenormalized log2 expression will still allow the model to behave as expected. That aside: The genotype2 coefficient is indeed the one you should be looking at for the log fold change if that's what you're interested in. I see you have orig.ident as a random effect, so I assume you have multiple clusters of cells. What do those correspond to in your experiment? Second most of the observed genotype effect is in the discrete component (cells either do or do not express the transcript), with not much difference in the expression level. You also have a significant cngeneson effect. Have you looked to see if that is correlated with your effect of interest? Basically I think there are a few things to investigate here the effect of feeding in prenormalized data. MAST wasn't designed for this. Do have a look at the vignette. The relationship between cngeneson and the effect of interest. For us to better understand: what is orig.ident in this model?

esrasefik commented 3 years ago

@gfinak Thank you for this detailed response!

I made sure that I am using the latest MAST version by installing the package via the following steps: install.packages("BiocManager") BiocManager::install("MAST") My current MAST package version is ‘1.16.0’. Is this the correct version?
I thought normalized gene expression values (log-transformed tpm data) were supposed to be used as input to MAST based on this tutorial. And this tutorial on the interoptability between MAST and SingleCellExperiment-derived packages states that MAST assumes that log-transformed approximately scale-normalized data is provided as input. This was the reason why I used the logNormCounts function prior to fitting the mixed effects model. Thank you for pointing out that pre-normalized values should not be used as input. In that case, how does the following approach look with raw UMI counts used as input?
Also, just to quickly summarize my experimental design: I have two genotypes (wild type and mutant), 8 mice (4 wild type, 4 mutant), and several clusters of cells that I already annotated as distinct brain cell types. I want to look at an individual cluster of cells (e.g., neurons) and determine the effect of genotype on gene expression, while also accounting for the correlation between cells obtained from the same animal (hence mouse id number is modeled as a random effect; orig.ident corresponds to mouse id number in my code).

s_obj.sce <- as.SingleCellExperiment(s_obj, assay = "RNA") # I start off with a Seurat object that contains the raw UMI data for multiple clusters of cells (which correspond to different cell types). I convert this Seurat object into a SingleCellExperiment object for compatibility.
s_obj.sca <- SceToSingleCellAssay(s_obj.sce) # I then upcast the SingleCellExperiment object to MAST’s subclass SingleCellAssay
cdr <-colSums(assay(s_obj4.sca)>0)
colData(s_obj.sca)$cngeneson <- scale(cdr) # I calculate CDR

cond <-factor(colData(s_obj.sca)$genotype)
cond <-relevel(cond,"1") # In order to have more interpretable coefficients, I set the reference level of my genotype condition to be my wild type cells.
colData(s_obj.sca)$genotype <-cond
s_obj.sca_cl1 <- subset(s_obj.sca, CellType =='1') # I subset my SingleCellAssay object to only those cells from cluster 1 (which represents a distinct cell type that I am interested in)

zlm_cl_test <- zlm(~ genotype + cngeneson + (1 | orig.ident), 
     sca = s_obj.sca_cl1, 
     method="glmer", 
     ebayes=FALSE,
      silent=T, 
     fitArgsD = list(nAGQ = 0),
     strictConvergence = FALSE)

Thank you again for all of your feedback! Esra

gfinak commented 3 years ago

The input data SHOULD be log2 transformed but NOT scaled (i.e. Normalized). If you do not log2(x+1) transform the data you will have meaningless estimates of log fold change since the data are assumed to be on the log scale. Again look carefully at the first tutorial you linked.

esrasefik commented 3 years ago

@gfinak I see! Thanks for that clarification. I modified my code in the following way, and added a line for log2(x+1) transformation of my raw UMI counts. Just to double check, is MAST version ‘1.16.0’. the correct version to use? And could you please confirm whether the following workflow looks right? For many of us who started off by using Seurat's interface of the MAST test (which currently doesn't support mixed effects modeling), I think it has been pretty challenging to find reliable and easy to follow documentation online that provides a step by step workflow of how to start with a Seurat object and carry it over to MAST for differential expression analysis.

s_obj.sce <- as.SingleCellExperiment(s_obj, assay = "RNA") # I start off with a Seurat object that contains the raw UMI data for multiple clusters of cells (which correspond to different cell types). I convert this Seurat object into a SingleCellExperiment object for compatibility.

assayNames(s_obj.sce) # the counts assay in this object includes raw UMI counts
**logcounts(s_obj.sce) <- log2(counts(s_obj.sce) + 1) #log2 transform the raw data**

s_obj.sca <- SceToSingleCellAssay(s_obj.sce) # I then upcast the SingleCellExperiment object to MAST’s subclass SingleCellAssay

cdr <-colSums(assay(s_obj4.sca)>0)
colData(s_obj.sca)$cngeneson <- scale(cdr) # I calculate CDR

cond <-factor(colData(s_obj.sca)$genotype)
cond <-relevel(cond,"1") # In order to have more interpretable coefficients, I set the reference level of my genotype condition to be my wild type cells.
colData(s_obj.sca)$genotype <-cond

s_obj.sca_cl1 <- subset(s_obj.sca, CellType =='1') # I subset my SingleCellAssay object to only those cells from cluster 1 (which represents a distinct cell type that I am interested in)

#Fit the mixed effects model:
zlm_cl1_test <- zlm(~ genotype + cngeneson + (1 | orig.ident), 
     sca = s_obj.sca_cl1,
     exprs_value = 'logcounts',
     method="glmer", 
     ebayes=FALSE,
      silent=T, 
     fitArgsD = list(nAGQ = 0),
     strictConvergence = FALSE)

#Likelihood ratio test
summaryCondtest <- summary(zlm_cl1_test, doLRT='genotype2') 
summaryDttest <- summaryCondtest$datatable

fcHurdletest <- merge(summaryDttest[contrast=='genotype2' & component=='H',.(primerid, `Pr(>Chisq)`)], 
                      summaryDttest[contrast=='genotype2' & component=='logFC', .(primerid, coef, ci.hi, ci.lo)], 
                      by='primerid')

fcHurdletest[,fdr:=p.adjust(`Pr(>Chisq)`, 'fdr')]

Thank you so much for your time and guidance!! Esra

gfinak commented 3 years ago

Latest version of MAST is 1.17.1 I think this workflow looks reasonable though we typically filter out genes that's aren't expressed in at least 10% of cells, though that depends a bit on how many cells are measured in total. A couple of questions: What is orig.ident in this case, and does this model converge when you set nAGQ=1? How many cells in each group for your cell type of interest?

combiz commented 3 years ago

@gfinak To be clear, the expected input for MAST is log2(TPM+1), right?

esrasefik commented 3 years ago

@gfinak It turns out I had to first update my R version to '4.0.3' to install the latest version of MAST via github. I now have version ‘1.17.3’ (screenshot attached). You had mentioned that version '1.17.1' was the latest version of the package. Is this difference something that I should be concerned about?

Regarding your question about how many cells are in each group for my cell type of interest: In my full dataset, I have 21,617 genes and 71,494 cells (from 4 wild type and 4 mutant mice). In cluster 1 (neurons), I have 3,542 wild type cells and 4,733 mutant cells. Would it be possible for you to share any code (or function) that we can use to add a filtering step to our pipeline to filter out genes that aren't expressed in at least 10% of the wild type or mutant cells? Im assuming this filtering step would come before CDR calculation, correct? If so, should I subset my data to only cluster 1 cells after CDR calculation or before?

After filtering, my next step will be to test if setting nAGQ=1 works. Thank you so much for all your help and time!!

@combiz I think TPM is a form of scaling that @gfinak recommended not to apply in this case. So, x in my log2(x+1) is raw UMI counts.

combiz commented 3 years ago

@combiz I think TPM is a form of scaling that @gfinak recommended not to apply in this case. So, x in my log2(x+1) is raw UMI counts.

@esrasefik Yes, I'm keen to clarify this as the MAST vignette's going back to 2016 use log2(TPM+1) inputs, e.g. https://www.bioconductor.org/help/course-materials/2016/BioC2016/ConcurrentWorkshops2/McDavid/MAITAnalysis.html and https://bioconductor.org/packages/release/bioc/vignettes/MAST/inst/doc/MAITAnalysis.html . Also, there are references in the literature to MAST using normalized/scaled inputs, e.g. "MAST [50] is applied to both log2(CPM+1) and log2(TPM+1) values, both with and without including the cellular detection rate (the fraction of genes that are detected with non-zero counts) as a covariate in the model." (https://www.biorxiv.org/content/10.1101/143289v1.full.pdf)

esrasefik commented 3 years ago

Hi @combiz - I am checking in to see if you made any progress in getting the clarification you needed about whether TPM should be used as input to MAST.

gfinak commented 3 years ago

TPM can be used, they should be log transformed however. MAST computes log fold changes assuming data are log transformed. They should not be recentered in any way however (i.e. no negative values allowed).

Greg Finak

On Fri, Feb 12, 2021, 12:41 Esra Sefik notifications@github.com wrote:

Hi @combiz https://github.com/combiz - I am checking in to see if you made any progress in getting the clarification you needed about whether TPM should be as input to MAST.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/RGLab/MAST/issues/147#issuecomment-778443236, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKSI6IK4BFLUX45WC3S3ADS6WG7NANCNFSM4UMCKRDA .

esrasefik commented 3 years ago

@gfinak Thank you for this clarification! Do you have a preference for TPM vs raw counts (assuming both are log transformed)? Or can we expect both inputs to give the same answer?

And I wanted to ask if you could respond to my questions above about which version of MAST to use and whether there is any code you can share with us for filtering genes based on percent cells expressed. Im copying the questions here again:

The version of MAST that I got from Github is ‘1.17.3’ (screenshot provided above). You had mentioned that version '1.17.1' was the latest version of the package. Is this difference something that I should be concerned about?
In my full dataset, I have 21,617 genes and 71,494 cells (from 4 wild type and 4 mutant mice). In cluster 1 (neurons), I have 3,542 wild type cells and 4,733 mutant cells. Would it be possible for you to share any code (or function) that we can use to add a filtering step to our pipeline to filter out genes that aren't expressed in at least 10% of the wild type or mutant cells? Im assuming this filtering step would come before CDR calculation, correct? If so, should I subset my data to only cluster 1 cells after CDR calculation or before?

Thank you for all your help! Esra

esrasefik commented 3 years ago

Hi @gfinak! I wanted to follow up regarding my last message to see if you have any recommendations for the questions I raised. Thank you!

danli349 commented 2 years ago

Hi Greg, In the tutorial: https://bioconductor.org/packages/release/bioc/vignettes/MAST/inst/doc/MAITAnalysis.html

summaryDt <- summaryCond$datatable
fcHurdle <- merge(summaryDt[contrast=='conditionStim' & component=='H',.(primerid, `Pr(>Chisq)`)], #hurdle P values
                      summaryDt[contrast=='conditionStim' & component=='logFC', .(primerid, coef, ci.hi, ci.lo)], by='primerid') #logFC coefficients

fcHurdle[,fdr:=p.adjust(`Pr(>Chisq)`, 'fdr')]
fcHurdleSig <- merge(fcHurdle[fdr<.05 & abs(coef)>FCTHRESHOLD], as.data.table(mcols(sca)), by='primerid')
setorder(fcHurdleSig, fdr)

Can you please tell me what is the relationship of the logFC coefficients with rowMeans log fold-change? Can I use the logFC coefficients as the rank to run GSEA?

Thanks a lot Dan

First make sure you have the latest MAST version from GitHub. MAST does normalization through the cngeneson model component. It's not clear to me that feeding it prenormalized log2 expression will still allow the model to behave as expected. That aside: The genotype2 coefficient is indeed the one you should be looking at for the log fold change if that's what you're interested in. I see you have orig.ident as a random effect, so I assume you have multiple clusters of cells. What do those correspond to in your experiment? Second most of the observed genotype effect is in the discrete component (cells either do or do not express the transcript), with not much difference in the expression level. You also have a significant cngeneson effect. Have you looked to see if that is correlated with your effect of interest? Basically I think there are a few things to investigate here the effect of feeding in prenormalized data. MAST wasn't designed for this. Do have a look at the vignette. The relationship between cngeneson and the effect of interest. For us to better understand: what is orig.ident in this model?

amcdavid commented 2 years ago

@danli349 Please see the documentation for getLogFC, your first question is answered there -- if you are still confused open a question on support.bioconductor.org. In general, I wouldn't use the logFC coefficients as input to GSEA unless you have a fairly simple (unadjusted) model. I'd use the -log10(pvalue) * sign(logFC) instead.

I'm going to close this issue because it's not clear if there is a defect, or even what the question is anymore.