RGLab / MAST

Tools and methods for analysis of single cell assay data in R
224 stars 57 forks source link

No significant DE genes after MAST (FDR < 0.05) #90

Open fbrundu opened 6 years ago

fbrundu commented 6 years ago

I am using MAST on log-transformed tpm data. The scRNA-seq matrix is composed by 1346 cells × 27933 genes. I used a similar matrix before and I had not any problem in getting meaningful results from MAST. Earlier I was using a different preprocessing procedure:

Reading the original MAST paper, I decided to use the proposed log(1+TPM) normalization. However in this case I get no significant False Discovery Rate for all genes. What may be the problem here?

Thanks, Francesco

gfinak commented 6 years ago

You are not providing enough information for us to help you. Does your data meet the assumptions of MAST? Is it bimodal, does it have exact zeros, is it approximately normal after log(1+TPM) transformation (this is not a normalization procedure). I question whether the UQ normalization is necessary for single cell RNA seq. We prefer to use the cellular detection rate to normalize within the model. You say you do UQ normalization and log transformation, then in the next paragraph you mention the log(1+TPM) transformation. Is this in addition to the other steps or instead of the UQ normalization and log transformation? I don't see any issues with the filtering criteria, but lots of questions left open by your description. Perhaps you could post some of your code and plots that show how the data look. Then we can hope to get some concrete answers.

fbrundu commented 6 years ago

Thanks @gfinak, I'll try to update this thread accordingly. I used the total distribution over the whole set of genes. Let me know if that's what you intended.

Let me know if I answered your questions, and if you need more information.

gfinak commented 6 years ago

Well none of that looks surprising, so that's good. Do you run the thresholding function? I'm wondering if the extra mode in the distribution is confusing the thresholding algorithm. Can you show your code? @amcdavid any thoughts?

fbrundu commented 6 years ago

Thanks, this is the code I'm using. Unfortunately I cannot disclose the data. I also noticed that with smaller matrices this issue is not present. But I'm trying to understand if there is a mistake somewhere in my code.

suppressPackageStartupMessages({
    library(ggplot2)
    library(GGally)
    library(GSEABase)
    library(limma)
    library(reshape2)
    library(data.table)
    library(knitr)
    library(TxDb.Hsapiens.UCSC.hg19.knownGene)
    library(stringr)
    library(NMF)
    library(rsvd)
    library(RColorBrewer)
    library(MAST)
})
n <- 2 #number of groups
#if you have multiple cores to spin
#options(mc.cores = detectCores() - 1) 
options(mc.cores = 1)
df <- read.table('matrix.txt'), sep = '\t', header = TRUE, row.names = 1, check.names = FALSE)
df <- t(df)
cData <- read.table('groups.txt', sep = '\t', row.names = 1, header = TRUE, check.names = FALSE)
colnames(cData) <- c('group')
fData <- data.frame(primerid = rownames(df))
for(cls in 0:(n-1)) {
    cData_cls <- cData
    cData_cls[cData_cls != cls, 'group'] <- 'rest'
    sca <- FromMatrix(as.matrix(df), cData = cData_cls, fData = fData)
    cdr2 <-colSums(assay(sca)>0)
    colData(sca)$cngeneson <- scale(cdr2)
    cond <- factor(colData(sca)$group)
    cond <- relevel(cond, 'rest')
    colData(sca)$group<-cond
    zlmCond <- zlm(~ group + cngeneson, sca, parallel = TRUE)
    groupName <- paste('group', cls, sep='')
    summary <- summary(zlmCond, doLRT=groupName)
    dt <- summary$datatable
    fcHurdle <- merge(dt[contrast==groupName & component=='H',.(primerid, `Pr(>Chisq)`)], #hurdle P values
                      dt[contrast==groupName & component=='logFC', .(primerid, coef, ci.hi, ci.lo)], by='primerid') #logFC coefficients
    fcHurdle[,fdr:=p.adjust(`Pr(>Chisq)`, 'fdr')]
    write.table(paste(groupName, '.MAST-DE.data.table.txt', sep=''), x = fcHurdle, sep = '\t', quote = FALSE, row.names = FALSE)
}
gfinak commented 6 years ago

A few more questions:

amcdavid commented 6 years ago

And one other question. When you say

I used a similar matrix before and I had not any problem in getting meaningful results from MAST.

do you mean, you literally had the same expression matrix (same cells) just with slightly different preprocessing, and now you get widely different results? Or is it different cells and different preprocessing?

fbrundu commented 6 years ago

Sorry for the late response:

@amcdavid on several other matrices (included the same matrix) but with the preprocessing defined in the original post.