No significant DE genes after MAST (FDR < 0.05)

fbrundu commented 6 years ago

I am using MAST on log-transformed tpm data. The scRNA-seq matrix is composed by 1346 cells × 27933 genes. I used a similar matrix before and I had not any problem in getting meaningful results from MAST. Earlier I was using a different preprocessing procedure:

Removal of ribosomal genes
Filtering of cells based on spike-ins, mitochondrial genes, and removal of such genes
Removal of genes expressed in few cells (less than 10%)
Upper quartile normalization and log-transformation

Reading the original MAST paper, I decided to use the proposed log(1+TPM) normalization. However in this case I get no significant False Discovery Rate for all genes. What may be the problem here?

Thanks, Francesco

gfinak commented 6 years ago

You are not providing enough information for us to help you. Does your data meet the assumptions of MAST? Is it bimodal, does it have exact zeros, is it approximately normal after log(1+TPM) transformation (this is not a normalization procedure). I question whether the UQ normalization is necessary for single cell RNA seq. We prefer to use the cellular detection rate to normalize within the model. You say you do UQ normalization and log transformation, then in the next paragraph you mention the log(1+TPM) transformation. Is this in addition to the other steps or instead of the UQ normalization and log transformation? I don't see any issues with the filtering criteria, but lots of questions left open by your description. Perhaps you could post some of your code and plots that show how the data look. Then we can hope to get some concrete answers.

fbrundu commented 6 years ago

Thanks @gfinak, I'll try to update this thread accordingly. I used the total distribution over the whole set of genes. Let me know if that's what you intended.

Is it bimodal: No, if you include zeroes, otherwise it's bimodal
Does it have exact zeros: yes
Is it approx. normal: approximately it may be, if you consider non-zero continuous measurements (if I understood correctly, the MAST paper concludes that "Although MAST has greatest efficiency when the continuous (log)- expression is normally distributed, transformations (such as the Box-Cox) could also be applied if the non-zero continuous measurements are skewed.")
Regarding TPM, yes we can say it is not a normalization procedure
UQ normalization is mutually exclusive with TPM transformation in my code, I used that in a previous work. With this one I decided to use log(1+TPM) to be more in line with the methodology of the paper

Let me know if I answered your questions, and if you need more information.

gfinak commented 6 years ago

Well none of that looks surprising, so that's good. Do you run the thresholding function? I'm wondering if the extra mode in the distribution is confusing the thresholding algorithm. Can you show your code? @amcdavid any thoughts?

fbrundu commented 6 years ago

Thanks, this is the code I'm using. Unfortunately I cannot disclose the data. I also noticed that with smaller matrices this issue is not present. But I'm trying to understand if there is a mistake somewhere in my code.

suppressPackageStartupMessages({
    library(ggplot2)
    library(GGally)
    library(GSEABase)
    library(limma)
    library(reshape2)
    library(data.table)
    library(knitr)
    library(TxDb.Hsapiens.UCSC.hg19.knownGene)
    library(stringr)
    library(NMF)
    library(rsvd)
    library(RColorBrewer)
    library(MAST)
})
n <- 2 #number of groups
#if you have multiple cores to spin
#options(mc.cores = detectCores() - 1) 
options(mc.cores = 1)
df <- read.table('matrix.txt'), sep = '\t', header = TRUE, row.names = 1, check.names = FALSE)
df <- t(df)
cData <- read.table('groups.txt', sep = '\t', row.names = 1, header = TRUE, check.names = FALSE)
colnames(cData) <- c('group')
fData <- data.frame(primerid = rownames(df))
for(cls in 0:(n-1)) {
    cData_cls <- cData
    cData_cls[cData_cls != cls, 'group'] <- 'rest'
    sca <- FromMatrix(as.matrix(df), cData = cData_cls, fData = fData)
    cdr2 <-colSums(assay(sca)>0)
    colData(sca)$cngeneson <- scale(cdr2)
    cond <- factor(colData(sca)$group)
    cond <- relevel(cond, 'rest')
    colData(sca)$group<-cond
    zlmCond <- zlm(~ group + cngeneson, sca, parallel = TRUE)
    groupName <- paste('group', cls, sep='')
    summary <- summary(zlmCond, doLRT=groupName)
    dt <- summary$datatable
    fcHurdle <- merge(dt[contrast==groupName & component=='H',.(primerid, `Pr(>Chisq)`)], #hurdle P values
                      dt[contrast==groupName & component=='logFC', .(primerid, coef, ci.hi, ci.lo)], by='primerid') #logFC coefficients
    fcHurdle[,fdr:=p.adjust(`Pr(>Chisq)`, 'fdr')]
    write.table(paste(groupName, '.MAST-DE.data.table.txt', sep=''), x = fcHurdle, sep = '\t', quote = FALSE, row.names = FALSE)
}

gfinak commented 6 years ago

A few more questions:

is the matrix you read in already log transformed?
not that it makes any difference, but why are you looping over the groups?
out of curiosity do you have the same issue when parallel=FALSE?
do your significant genes based on unadjusted p-values make sense?
what kind of effect sizes are you seeing?
when you say "smaller matrices" do you mean fewer genes or fewer cells?
Do you do any filtering of genes and cells prior to model fitting?

amcdavid commented 6 years ago

And one other question. When you say

I used a similar matrix before and I had not any problem in getting meaningful results from MAST.

do you mean, you literally had the same expression matrix (same cells) just with slightly different preprocessing, and now you get widely different results? Or is it different cells and different preprocessing?

fbrundu commented 6 years ago

Sorry for the late response:

Yes, log(1+TPM)
I'm looping over the groups because I want to compare each group vs the rest - but probably there's a more smart way to do it
When possible I will try with parallel = F and let you know

Unadjusted p-values (Pr(>Chisq)) are distributed approx. in this way in one of the groups:

count    27933.000000
mean         0.804736
std          0.276876
min          0.000462
25%          0.675734
50%          0.988951
75%          1.000000
max          1.000000

Effect sizes (coef) are distributed in the same group as:

count    10151.000000
mean         0.011468
std          0.209171
min         -1.311648
25%         -0.088395
50%          0.008869
75%          0.101489
max          1.243300

(the count is different because when the p-value is 1, the coef is NaN)

For smaller matrices I mean approx. the same number of cells (except for a minimal set of outliers selected by looking at percentage of spike-ins and mitochondrial genes vs. the total transcripts of each cell), but a stronger filter on genes - aside from ribosomal, spike-ins and mitochondrial genes removal also the removal of genes with less that 1 transcripts every n cells
Mainly outlier detection based (by now) on arbitrary cutoffs by looking at the values distribution

@amcdavid on several other matrices (included the same matrix) but with the preprocessing defined in the original post.

RGLab / MAST

No significant DE genes after MAST (FDR < 0.05) #90