GabrielHoffman / variancePartition

Quantify and interpret divers of variation in multilevel gene expression experiments
http://gabrielhoffman.github.io/variancePartition/
57 stars 14 forks source link

dream(): Evaluate function error --> "Error evaluating fxn:" #66

Closed aidarripoll closed 1 year ago

aidarripoll commented 1 year ago

Hi,

I'm calling variancePartition::dream() function to fit a mixed-model and I'm getting the following error:

_Error in dream(vobjDream_obj, form_model, metadata_obj, L) : Error evaluating fxn: Error in evalf(x, ...): Downdated VtV is not positive definite

I've gone through the variancePartition::dream() function and it seems the error appears when evaluating the function (line 582-604). Do you have any recommendations to handle the error and continue the execution for the genes which do not face this problem (with tryCatch() for instance). I'm asking this since biocParallel::book() is used to make the evaluation, and there's maybe something to consider regarding when/where to handle the error.

Also, I'm wondering whether I could do some previous check on the exprObj or data variancePartition::dream() arguments before calling variancePartition::dream() function to see what could be going on...

Moreover, what I found a bit strange is that the error _Error in evalf(x, ...): Downdated VtV is not positive definite seems to be related to the complete separation problem, which I haven't faced before and it seems more related to logistic models...

Thanks a lot, Aida

GabrielHoffman commented 1 year ago

Can you show sessionInfo(). This can happen if the covariates are almost colinear, or if the residual variance is near zero. Can you provide a reproducible example?

aidarripoll commented 1 year ago

As I mentioned, when calling variancePartition::dream() in a exprObj of 21,315 genes (rows) and 981 donors (cols), it stops due to this error. I'm using try() to skip the error (gene with the failure), but it's not working:

> dim(geneExpr)
[1] 21315   981

> fit = try(dream(vobjDream_obj, form_model, metadata_obj, L))

How can I know in which gene I have the error?

At the moment I can't share the data, but you can see some features of the arguments here:

> class(vobjDream_obj)
[1] "EList"
attr(,"package")
[1] "limma"

> form_model
~Gender + Age + (1 | date)

> head(metadata_obj)
        Age Gender   date   lane Age_cat Age_cat_all
692_693  62      F pool_1 pool_1       O           O
686_687  69      M pool_1 pool_1       O           O
684_685  75      F pool_1 pool_1       O           O
682_683  73      M pool_1 pool_1       O           O
689_690  59      M pool_1 pool_1       M           O
690_691  68      M pool_1 pool_1       O           O

> L
(Intercept)     GenderF         Age
          0           1           0

Here you can find the sessionInfo():

> sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: SUSE Linux Enterprise Server 12 SP2

Matrix products: default
BLAS:   /gpfs/apps/MN4/R/4.1.2/GCC/lib64/R/lib/libRblas.so
LAPACK: /gpfs/apps/MN4/R/4.1.2/GCC/lib64/R/lib/libRlapack.so

locale:
[1] C

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods
[8] base

other attached packages:
 [1] metafor_3.4-0               metadat_1.2-0
 [3] variancePartition_1.27.2    BiocParallel_1.28.3
 [5] ggplot2_3.3.6               edgeR_3.36.0
 [7] limma_3.50.3                Matrix.utils_0.9.8
 [9] MAST_1.21.3                 SingleCellExperiment_1.16.0
[11] SummarizedExperiment_1.24.0 Biobase_2.54.0
[13] GenomicRanges_1.46.1        GenomeInfoDb_1.30.1
[15] IRanges_2.28.0              S4Vectors_0.32.4
[17] BiocGenerics_0.40.0         MatrixGenerics_1.6.0
[19] matrixStats_0.62.0          stringr_1.4.0
[21] stringi_1.7.6               reshape2_1.4.4
[23] tibble_3.1.7                tidyr_1.2.0
[25] dplyr_1.0.9                 plyr_1.8.7
[27] lme4_1.1-28                 Matrix_1.4-1
[29] data.table_1.14.2           optparse_1.7.1

loaded via a namespace (and not attached):
 [1] nlme_3.1-157           bitops_1.0-7           pbkrtest_0.5.1
 [4] doParallel_1.0.17      progress_1.2.2         numDeriv_2016.8-1.1
 [7] tools_4.1.2            backports_1.4.1        utf8_1.2.2
[10] R6_2.5.1               KernSmooth_2.23-20     DBI_1.1.2
[13] colorspace_2.0-3       withr_2.5.0            tidyselect_1.1.2
[16] prettyunits_1.1.1      compiler_4.1.2         cli_3.3.0
[19] DelayedArray_0.20.0    caTools_1.18.2         scales_1.2.0
[22] minqa_1.2.4            aod_1.3.2              XVector_0.34.0
[25] RhpcBLASctl_0.21-247.1 pkgconfig_2.0.3        rlang_1.0.2
[28] generics_0.1.2         gtools_3.9.2           RCurl_1.98-1.6
[31] magrittr_2.0.3         GenomeInfoDbData_1.2.7 Rcpp_1.0.8.3
[34] munsell_0.5.0          fansi_1.0.3            abind_1.4-5
[37] lifecycle_1.0.1        mathjaxr_1.6-0         MASS_7.3-57
[40] zlibbioc_1.40.0        gplots_3.1.3           grid_4.1.2
[43] parallel_4.1.2         crayon_1.5.1           lattice_0.20-45
[46] splines_4.1.2          hms_1.1.1              locfit_1.5-9.5
[49] pillar_1.7.0           boot_1.3-28            codetools_0.2-18
[52] glue_1.6.2             vctrs_0.4.1            nloptr_2.0.0
[55] Rdpack_2.3             foreach_1.5.2          gtable_0.3.0
[58] grr_0.9.5              getopt_1.20.3          purrr_0.3.4
[61] assertthat_0.2.1       rbibutils_2.2.8        broom_0.8.0
[64] lmerTest_3.1-3         iterators_1.0.14       ellipsis_0.3.2
GabrielHoffman commented 1 year ago

Hi Aida, 1.27.2 is a recent version, so that's not the issue. In my experience errors like this are caused small issues with filtering, or a single gene that is problematic. For most errors like this variancePartition functions also give you the gene number where the error was, so you can debug further. I'm not sure why that didn't happen in this case.

But you can iterate through each gene to find where the error is:

for(j in 1:nrow(vobjDream_obj)){
  message(j)
  fit = dream(vobjDream_obj[j,], form_model, metadata_obj, L)
}

A few options for the issue is 1) Is Age coded as numeric? If it is a character or factor it could give this error 2) If you have a gene with no variance 3) If you have a gene with no variance within a category. For example, if gene j is constant in females. 4) if there are missing expression values

dream() should have built-in checks for each of these, so it looks like you found a special case that slips by.

I want to fix this for other users, so please let me know if you find the answer

Gabriel

aidarripoll commented 1 year ago

Hi,

I attach a reproducible example which uses testing data host at Molgenis server:

You can see that the execution stops when we run dream() on the whole vobjDream object (lines 40-41). Then, we have run dream() on each gene separately (lines 43-81) to check which gene/s are giving the following error:

_Error in dream(vobjDream, form, aggregatemetadata, L) : Error evaluating fxn: Error in devfun(rho$pp$theta): Downdated VtV is not positive definite

Lately, we have done a couple of checks (lines 83-115) trying to answer your questions about the potential reasons of the errors, but apparently, the error is not due to any of those cases.

I hope you can find the reason. At the moment, I'm just running dream() on each gene separately and throwing out the genes that give this error (using try()) since there're few cases... However, I think it'd be useful to be able to use try() when running dream() on the whole object, which is not possible right now. I guess you might have to adapt the dream() function.

Thanks a lot! Aida

#!/usr/bin/env Rscript

######## Install and load required packages ######## 
shhh <- suppressPackageStartupMessages
shhh(library(data.table))
shhh(library(lme4))
shhh(library(plyr))
shhh(library(dplyr))
shhh(library(tidyr))
shhh(library(tibble))
shhh(library(reshape2))
shhh(library(stringi))
shhh(library(stringr))
shhh(library(edgeR))
shhh(library(limma))
shhh(library(variancePartition))
shhh(library(BiocParallel))
shhh(library(caret))

######## Read testing input files ######## 
# Example: Gene expression (DGEList object) --> 500 genes x 981 samples
## Gene expression (DGEList object)
geneExpr.url <- 'https://molgenis26.gcc.rug.nl/downloads/marta-transcriptomics/limma_pseudobulk_error/geneExpr.rds'
geneExpr <- readRDS(url(geneExpr.url))

## Donor metadata (data.frame object)
aggregate_metadata.url <- 'https://molgenis26.gcc.rug.nl/downloads/marta-transcriptomics/limma_pseudobulk_error/aggregate_metadata.rds'
aggregate_metadata <- readRDS(url(aggregate_metadata.url))

######## Dream() ########
# Prepare data form dream(): testing Sex differences and correcting for Age (fixed) and date (random) factors
form_vars <- "~Sex+Age+(1|date)"
form <- as.formula(form_vars)
vobjDream = voomWithDreamWeights(geneExpr, form, aggregate_metadata)
comb <- c('Sex1', 'Sex2')
contrast <- 'Sex'
L = getContrast(vobjDream, form, aggregate_metadata, comb[[2]])

# Dream()
## 1. Regular: dream() on the whole vobjDream --> to reproduce the error (we don't know the gene/s)
fit = dream(vobjDream, form, aggregate_metadata, L)

## 2. Alternative: dream() by gene using try() --> to pick the gene/s that gives the error
dream_list <- list()
for(j in 1:nrow(vobjDream)){
  g <- rownames(vobjDream)[j]
  print(paste0(j, ': ', g))
  fit = try(dream(vobjDream[j,], form, aggregate_metadata, L))
  limma_res <- NA
  g_report <- TRUE
  if(class(fit)!='try-error'){
    fit = eBayes(fit)
    limma_res <- topTable(fit, coef=c('L1'), number=length(fit$F.p.value))
    limma_res <- limma_res %>%
      data.frame() %>%
      rownames_to_column(var="gene") %>%
      arrange(adj.P.Val)
  }else{
    fit <- NA
    g_report <- FALSE
  }
  names(g_report) <- g
  dream_list[[g]] <- list(fit = fit, 
                          top_table = limma_res,
                          g_report = g_report)
}

### 2.1 Get only the genes that do not give an error
g_passed <- unlist(unname(lapply(dream_list, function(x) x$g_report)))
g_out.names <- names(which(g_passed==FALSE))
dream_list.passed <- dream_list[g_passed]
g_out.tag <- paste(g_out.names, collapse = ', ')
print(paste0('Genes with an error in dream(): ', g_out.tag))

### 2.2 Get the toptable for all genes (data frame)
limma_res <- do.call("rbind", lapply(dream_list.passed, function(x) x$top_table))
limma_res$adj.P.Val <- p.adjust(limma_res$P.Value, method = "BH")
limma_res <- limma_res[order(limma_res$adj.P.Val),]

### 2.3 Get the fit objects for all genes (list)
fit <- lapply(dream_list.passed, function(x) x$fit)

######## Checks for potential errors ######## 
## Is Age coded as numeric? --> It is not the case
lapply(aggregate_metadata, class)
str(aggregate_metadata)

## If there are missing expression values --> It is not the case
lapply(geneExpr, function(x) table(is.na(x)))

## If you have a gene with no variance
### Variance --> It is not the case
var_by_gene <- apply(vobjDream[[2]], 1, var)
summary(var_by_gene)
min_var <- sort(var_by_gene)[1] #the gene that gives the error --> RPL11
max_var <- sort(var_by_gene,decreasing = T)[1]
hist(apply(vobjDream[[2]], 1, var))
plot(density(apply(vobjDream[[2]], 1, var)))

### Also check nearZeroVar (with caret::nearZeroVar) --> It is not the case
nearZeroVar.idx <- nearZeroVar(t(vobjDream[[2]]))
nearZeroVar.metrics <- nearZeroVar(t(vobjDream[[2]]), saveMetrics=TRUE)

## If you have a gene with no variance within a category. For example, if gene j is constant in females. --> --> It is not the case
check.func <- function(g_id, vobjDream_obj = vobjDream, metadata_obj = aggregate_metadata, contrast_var = contrast){
  print(g_id)
  gene_donor.expr <- data.frame(gene_expr = unname(vobjDream_obj[[2]][g_id,]),
                                donor_id = names(vobjDream_obj[[2]][g_id,]))
  metadata_obj$donor_id <- rownames(metadata_obj)
  metadata_obj.gene <- merge(metadata_obj, gene_donor.expr, by = 'donor_id')
  var_by_contrast <- lapply(split(metadata_obj.gene, metadata_obj.gene[[contrast_var]]), function(x) var(x$gene_expr))
  return(var_by_contrast)
}
check.res <- sapply(g_out.names, function(i) check.func(i), simplify = FALSE)
check.res
GabrielHoffman commented 1 year ago

Thanks for the example. I can reproduce the error now and I am looking in to a solution.

GabrielHoffman commented 1 year ago

There are two ways to deal with an error like this. 1) detect the failure internally and deal with it gracefully. That takes a lot of engineering, and is a medium term solution. 2) fix the input leading to the error. Since my colleagues and I use this function every day and haven't see this error, lets start with (2).

Here I recreate the error and find that the input is the cause of the problem. It's a very pathological case that seems to be extremely rare.

shhh <- suppressPackageStartupMessages
shhh(library(lme4))
shhh(library(plyr))
shhh(library(edgeR))
shhh(library(variancePartition))

######## Read testing input files ######## 
# Example: Gene expression (DGEList object) --> 500 genes x 981 samples
## Gene expression (DGEList object)
geneExpr.url <- 'https://molgenis26.gcc.rug.nl/downloads/marta-transcriptomics/limma_pseudobulk_error/geneExpr.rds'
geneExpr <- readRDS(url(geneExpr.url))

## Donor metadata (data.frame object)
aggregate_metadata.url <- 'https://molgenis26.gcc.rug.nl/downloads/marta-transcriptomics/limma_pseudobulk_error/aggregate_metadata.rds'
aggregate_metadata <- readRDS(url(aggregate_metadata.url))

# make sure your meta-data matches the expression data
# This changes the result, but the error still comes ups
idx = match(colnames(geneExpr), rownames(aggregate_metadata) )
aggregate_metadata = aggregate_metadata[idx,]

######## Dream() ########
form <- ~ Sex + Age + (1|date)
vobjDream = voomWithDreamWeights(geneExpr, form, aggregate_metadata, plot=TRUE)

# fit dream() one gene 336 to get error
fit = dream(vobjDream[336,], form, aggregate_metadata)
# Dividing work into 1 chunks...
# Error: BiocParallel errors
#   1 remote errors, element index: 1
#   0 unevaluated and other errors
#   first remote error:
# Error in devfun(rho$pp$theta): Downdated VtV is not positive definite

But why does this fail? Lets examine the lmer evaluation directly

y = vobjDream$E[336,]
w = vobjDream$weights[336,]
info = aggregate_metadata

fit = lmer( y ~ (1 | date), info, weights=w)
# Error in devfun(rho$pp$theta) : Downdated VtV is not positive definite

So we can reproduce the error at a lower level, that's a start.

Lets examine the weights

 range(w)
# [1] 4.800174e+02 5.958613e+16

That is a HUGE weight. That, combined with the fact that there is no error when I exclude the weights, means that the weights are the issue.

In your case, a different gene had a huge weight (since I sorted the metadata rows). The question is why. Lets look at the mean-variance trend from voom. In the right of the plot, the trend line is a very poor fit to the data.

Screen Shot 2022-10-05 at 9 39 37 AM

More importantly it dives towards zero. Since the voom weights are based on the reciprocal of that line, you get huge weights like 5e16 that cause some numerical instability in the linear algebra in the model fit. If you reduce the span parameter in voomWithDreamWeights() you will get a better fit to the data, and avoid this issue.

So it's important to look at the mean-variance trend from voom to find pathological cases.

Of course, this is test data so the error on your real dataset might be different. Does this fix your issue?

I the short term, I think you can avoid the issue by checking the voom plot and the weights. In the longer term, I will add a check of the weights to detect pathological cases and build in better failure detection.

Gabriel

GabrielHoffman commented 1 year ago

Hi Aida, I'm just following up on this

Gabriel

aidarripoll commented 1 year ago

Hi again,

As we had very few cases showing this type of error, I used the workaround you suggested to catch and skip these cases by passing to dream() the voomWithDreamWeights() output for each gene separately. However, in another smaller dataset (45 samples), if we look at some specific genes (not showing any error, then we can have the output from both approaches) I found some notable differences in some output values (t, P.Value, adj.P.Val, B, and z.std) when running dream() using the voomWithDreamWeights() output for all tested genes, or running it on a gene-by-gene basis. Surprisingly, it does not happen if we use the previous bigger dataset (981 samples). I found the differences in the small dataset quite unexpected since dream() is supposedly fitting a model by gene, so, the gene-specific output values should be the same regardless of the of the dream() input (i.e., gene expression matrix of all genes or gene-specific expression matrix)...

Here you have the code to reproduce the two examples (small and big datasets): (*) notice that I am referring to 'regular' when running dream() using voomWithDreamWeights() output for all tested genes; and to 'alternative' when running dream() on a gene-by-gene basis.

You can see that in the small dataset, for instance, the 'RPS4Y1' gene is showing very different t, P.Value, adj.P.Val, B, and z.std when we compare the regular and the alternative approach. These types of differences are also observed in the other genes.

#!/usr/bin/env Rscript

######## Install and load required packages ########
shhh <- suppressPackageStartupMessages
shhh(library(data.table))
shhh(library(lme4))
shhh(library(plyr))
shhh(library(dplyr))
shhh(library(tidyr))
shhh(library(tibble))
shhh(library(reshape2))
shhh(library(stringi))
shhh(library(stringr))
shhh(library(edgeR))
shhh(library(limma))
shhh(library(variancePartition))
shhh(library(BiocParallel))
shhh(library(caret))

############################################################################
#################### SMALL DATASET (13 genes x 45 samples) #################
############################################################################

######## Read testing input files ########
# Example: Gene expression (DGEList object) --> 13 genes x 45 samples
## Gene expression (DGEList object)
geneExpr.url <- 'https://molgenis26.gcc.rug.nl/downloads/marta-transcriptomics/limma_pseudobulk_error/pilot3_subset.geneExpr.rds'
geneExpr <- readRDS(url(geneExpr.url))

## Donor metadata (data.frame object)
aggregate_metadata.url <- 'https://molgenis26.gcc.rug.nl/downloads/marta-transcriptomics/limma_pseudobulk_error/pilot3_subset.aggregate_metadata.rds'
aggregate_metadata <- readRDS(url(aggregate_metadata.url))

## Make sure your meta-data matches the expression data
idx = match(colnames(geneExpr), rownames(aggregate_metadata))
aggregate_metadata = aggregate_metadata[idx,]

######## Dream() ########
# Prepare data form dream(): testing Sex differences and correcting for Age (fixed) and date (random) factors
form_vars <- "~Sex+Age+(1|date)"
form <- as.formula(form_vars)
vobjDream = voomWithDreamWeights(geneExpr, form, aggregate_metadata)

# Dream()
comb <- c('Sex1', 'Sex2')
contrast <- 'Sex'
L = getContrast(vobjDream, form, aggregate_metadata, comb[[2]])

## 1. Regular: dream() on the whole vobjDream --> to reproduce the error (we don't know the gene/s)
fit = dream(vobjDream, form, aggregate_metadata, L)
fit.reg = eBayes(fit)
limma_res <- topTable(fit.reg, coef=c('L1'), number=length(fit.reg$F.p.value))
limma_res <- limma_res %>%
  data.frame() %>%
  rownames_to_column(var="gene") %>%
  arrange(adj.P.Val)
limma_res.reg <- limma_res[order(limma_res$P.Value),]

## 2. Alternative: dream() by gene using try() --> to pick the gene/s that gives the error
dream_list <- list()
for(j in 1:nrow(vobjDream)){
  g <- rownames(vobjDream)[j]
  print(paste0(j, ': ', g))
  fit = try(dream(vobjDream[j,], form, aggregate_metadata, L))
  limma_res <- NA
  g_report <- TRUE
  if(class(fit)!='try-error'){
    fit = eBayes(fit)
    limma_res <- topTable(fit, coef=c('L1'), number=length(fit$F.p.value))
    limma_res <- limma_res %>%
      data.frame() %>%
      rownames_to_column(var="gene") %>%
      arrange(adj.P.Val)
  }else{
    fit <- NA
    g_report <- FALSE
  }
  names(g_report) <- g
  dream_list[[g]] <- list(fit = fit,
                          top_table = limma_res,
                          g_report = g_report)
}

### 2.1 Get only the genes that do not give an error
g_passed <- unlist(unname(lapply(dream_list, function(x) x$g_report)))
g_out.names <- names(which(g_passed==FALSE))
dream_list.passed <- dream_list[g_passed]
g_out.tag <- paste(g_out.names, collapse = ', ')
print(paste0('Genes with an error in dream(): ', g_out.tag))

### 2.2 Get the toptable for all genes (data frame)
limma_res <- do.call("rbind", lapply(dream_list.passed, function(x) x$top_table))
limma_res$adj.P.Val <- p.adjust(limma_res$P.Value, method = "BH")
limma_res.alt <- limma_res[order(limma_res$adj.P.Val),]
rownames(limma_res.alt) <- NULL
limma_res.alt <- limma_res.alt[match(limma_res.reg$gene, limma_res.alt$gene),]

### 2.3 Get the fit objects for all genes (list)
fit.alt <- lapply(dream_list.passed, function(x) x$fit)

######## Check differences ########
# Overview
limma_res.list <- list(regular = limma_res.reg,
                       alternative = limma_res.alt)
print(limma_res.list)

# Modify P.Value and adj.P.Val
limma_res_mod.list <- lapply(limma_res.list, function(x){
  x$P.Value_log10 <- -log10(x$P.Value)
  x$adj.P.Val_log10 <- -log10(x$adj.P.Val)
  return(x)
})

# Check differences
vars_int <- colnames(limma_res_mod.list[[1]])[colnames(limma_res_mod.list[[1]])!='gene']
vars_int.out <- sapply(vars_int, function(i){
  print(i)
  i_list <- lapply(limma_res_mod.list, function(x) x[[i]])
  cor_r <- Reduce("cor", i_list)
  print(paste0('Pearson correlation (between regular and alternative): ', cor_r))
  j <- names(i_list)[1]
  summary_out <- sapply(names(i_list), function(j){
    print(paste0('Summary --> ', j, ':'))
    print(summary(i_list[[j]]))
  }, simplify = FALSE)
  cat('\n')
  return(NULL)
})

# # Plots
# plot(-log10(limma_res.reg$P.Value),-log10(limma_res.alt$P.Value))
# plot(limma_res.reg$B, limma_res.alt$B)

############################################################################
###################### BIG DATASET (4 genes x 981 samples) #################
############################################################################

######## Read testing input files ########
# Example: Gene expression (DGEList object) --> 13 genes x 45 samples
## Gene expression (DGEList object)
geneExpr.url <- 'https://molgenis26.gcc.rug.nl/downloads/marta-transcriptomics/limma_pseudobulk_error/large_subset.geneExpr.rds'
geneExpr <- readRDS(url(geneExpr.url))

## Donor metadata (data.frame object)
aggregate_metadata.url <- 'https://molgenis26.gcc.rug.nl/downloads/marta-transcriptomics/limma_pseudobulk_error/large_subset.aggregate_metadata.rds'
aggregate_metadata <- readRDS(url(aggregate_metadata.url))

## Make sure your meta-data matches the expression data
idx = match(colnames(geneExpr), rownames(aggregate_metadata))
aggregate_metadata = aggregate_metadata[idx,]

######## Dream() ########
# Prepare data form dream(): testing Sex differences and correcting for Age (fixed) and date (random) factors
form_vars <- "~Sex+Age+(1|date)"
form <- as.formula(form_vars)
vobjDream = voomWithDreamWeights(geneExpr, form, aggregate_metadata)

# Dream()
comb <- c('Sex1', 'Sex2')
contrast <- 'Sex'
L = getContrast(vobjDream, form, aggregate_metadata, comb[[2]])

## 1. Regular: dream() on the whole vobjDream --> to reproduce the error (we don't know the gene/s)
fit = dream(vobjDream, form, aggregate_metadata, L)
fit.reg = eBayes(fit)
limma_res <- topTable(fit.reg, coef=c('L1'), number=length(fit.reg$F.p.value))
limma_res <- limma_res %>%
  data.frame() %>%
  rownames_to_column(var="gene") %>%
  arrange(adj.P.Val)
limma_res.reg <- limma_res[order(limma_res$P.Value),]

## 2. Alternative: dream() by gene using try() --> to pick the gene/s that gives the error
dream_list <- list()
for(j in 1:nrow(vobjDream)){
  g <- rownames(vobjDream)[j]
  print(paste0(j, ': ', g))
  fit = try(dream(vobjDream[j,], form, aggregate_metadata, L))
  limma_res <- NA
  g_report <- TRUE
  if(class(fit)!='try-error'){
    fit = eBayes(fit)
    limma_res <- topTable(fit, coef=c('L1'), number=length(fit$F.p.value))
    limma_res <- limma_res %>%
      data.frame() %>%
      rownames_to_column(var="gene") %>%
      arrange(adj.P.Val)
  }else{
    fit <- NA
    g_report <- FALSE
  }
  names(g_report) <- g
  dream_list[[g]] <- list(fit = fit,
                          top_table = limma_res,
                          g_report = g_report)
}

### 2.1 Get only the genes that do not give an error
g_passed <- unlist(unname(lapply(dream_list, function(x) x$g_report)))
g_out.names <- names(which(g_passed==FALSE))
dream_list.passed <- dream_list[g_passed]
g_out.tag <- paste(g_out.names, collapse = ', ')
print(paste0('Genes with an error in dream(): ', g_out.tag))

### 2.2 Get the toptable for all genes (data frame)
limma_res <- do.call("rbind", lapply(dream_list.passed, function(x) x$top_table))
limma_res$adj.P.Val <- p.adjust(limma_res$P.Value, method = "BH")
limma_res.alt <- limma_res[order(limma_res$adj.P.Val),]
rownames(limma_res.alt) <- NULL
limma_res.alt <- limma_res.alt[match(limma_res.reg$gene, limma_res.alt$gene),]

### 2.3 Get the fit objects for all genes (list)
fit.alt <- lapply(dream_list.passed, function(x) x$fit)

######## Check differences ########
# Overview
limma_res.list <- list(regular = limma_res.reg,
                       alternative = limma_res.alt)
print(limma_res.list)

# Modify P.Value and adj.P.Val
limma_res_mod.list <- lapply(limma_res.list, function(x){
  x$P.Value_log10 <- -log10(x$P.Value)
  x$adj.P.Val_log10 <- -log10(x$adj.P.Val)
  return(x)
})

# Check differences
vars_int <- colnames(limma_res_mod.list[[1]])[colnames(limma_res_mod.list[[1]])!='gene']
vars_int.out <- sapply(vars_int, function(i){
  print(i)
  i_list <- lapply(limma_res_mod.list, function(x) x[[i]])
  cor_r <- Reduce("cor", i_list)
  print(paste0('Pearson correlation (between regular and alternative): ', cor_r))
  j <- names(i_list)[1]
  summary_out <- sapply(names(i_list), function(j){
    print(paste0('Summary --> ', j, ':'))
    print(summary(i_list[[j]]))
  }, simplify = FALSE)
  cat('\n')
  return(NULL)
})

# # Plots
# plot(-log10(limma_res.reg$P.Value),-log10(limma_res.alt$P.Value))
# plot(limma_res.reg$B, limma_res.alt$B)

Thank you a lot in advance!

Aida

GabrielHoffman commented 1 year ago

Hi Aida, Analysis with dream (and limma) have two steps:

1) fit a regression for each gene separately, 2) use and empirical Bayes method in eBayes() to borrow information across genes to produce 'moderated t-statistics' to improve performance in small samples.

dream and limma are both meant to be run on all genes at the same time. Earlier I suggested testing dream on one gene at a time for debugging purposes, but I didn't intend it for real analysis.

The empirical Bayes step only makes sense if you are borrowing information across many genes. Running it on the result of 1 gene is not useful. So the difference you are seeing is due to eBayes() correctly borrowing information across thousands of genes, versus just seeing results of 1 gene.

Also, this explains why there is a difference between the full and gene-wise analyses only in the small dataset. The empirical Bayes step improves performance for small dataset, but as sample size increases its effect becomes very minimal. So for large datasets, full and gene-wise analyses are very similar.

This is based on my understanding of your issue, but I didn't test it out in your code.

Best, Gabriel

aidarripoll commented 1 year ago

Thanks for your quick answer!

I was not considering that eBayes() was gathering info across all genes, so, it makes a lot of sense what you've explained.

Then, I guess one option could be running dream() in a gene-wise manner to detect the few errors, remove those genes from the whole expression matrix and then, run dream() using as input this new gene expression matrix. However, this is very time-consuming... Another option could be running dream() using as input the whole matrix, and only if it crashes, run the gene-wise approach to detect the errors, remove those genes, and run it again with the whole matrix except those genes.

Another option you suggested was to reduce the span parameter from voomWithDreamWeights(). This modification will affect all the genes, but it would only be needed in a few genes. Then, I am not sure whether this would be a good option, also considering that probably it won't be necessary in most of the cases (depending on the gene, the dataset, etc...). Also, I don't know if there would be a way to know if we are over-fitting the model for some genes by changing this parameter...

With all this in mind, do you have any suggestions?

I assume that, ideally, there should be a way to detect this type of errors internally when running dream() with the whole matrix, and then, decide what to do in those cases, for example: 1) re-run voomWithDreamWeights() giving a specific span to each gene, or 2) remove those genes from the whole matrix and keep running dream().

Thanks a lot! Aida

GabrielHoffman commented 1 year ago

When I run the current version of dream() from variancePartition v1.28.4 on your code from Oct 5, 2022 I get this warning:

  The maximum precision weight is 4.30033e+17, suggesting a poor smoothing fit
on the mean-variance plot for large expression values. Such large weights can
have unexpected effects downstream.  Consider examining the mean-variance plot
and reducing the span parameter.

This gives a problematic voom plot we described above.

This means that the voom trend produced precision weights that are nonsense (i.e. 4.30033e+17). You could just detect and remove genes with super high weights.

Running dream for each gene separately is never the answer unless you are debugging something.

As far as changing the span parameter, its default value (inherited from voom()) is really arbitrary. It just happens to work for almost all datasets. In your data, changing span only affects the right tail where there are very few points. It is very robust.

vobjDream = voomWithDreamWeights(geneExpr, form, aggregate_metadata, plot=TRUE, span=.1)

So I have no issue with changing it slightly.

Are you doing this on bulk RNA-seq, or is this single cell RNA-seq converted to pseudobulk? Or something else?

Gabriel

aidarripoll commented 1 year ago

Nice, the variancePartition I was using didn't give this warning, so, this is much more informative :)

Then, the options would be: 1) If we want to remove the cases with high weights, we can detect these few cases by picking the outliers (high weights) from the weights distribution (voomWithDreamWeights() output), remove them from the gene expression matrix, and run dream(). 2) If we want to keep all the genes, we can reduce the span parameter to make sure we won't have this issue anymore. This change won't have an important impact on other genes as it is only affecting the right tail (where there are very few points).

Is it right?

About the type of data, I'm using pseudobulk data. Specifically, I use the sum to aggregate the scRNA-seq data, and then filtering by gene expression + normalising as it was bulkRNA-seq. For filtering, I use isexpr = rowSums(edgeR::cpm(aggregate_countMatrix)>0.1) >= 5 , and for normalising, edgeR::calcNormFactors()). However, I'm now testing if we have the same problem when using the mean to aggregate the previously normalized scRNA-seq data.

Thank you very much, Aida

GabrielHoffman commented 1 year ago

Yes, those are the two options. Either choice is fine.

Concerning pseudobulk from scRNA-seq, this is just one of the issues that comes up. I have a solution that solves these problems and others, and also extends dream. I’ll have public software and a biorxiv preprint in 2 weeks.

Gabriel

aidarripoll commented 1 year ago

Wow, I can't wait to read it!

Btw, when using the pseudobulk mean data for the same subset of genes in the big dataset (from previously normalized scRNA-seq data), instead of the pseudobulk sum data (+bulk normalization), we don't find the issue anymore. Actually, you can see how different is the distribution of the maximum gene weights computed by voomWithDreamWeights() when using sum- or mean-pseudobulk:

# pseudobulk sum
vobjDream_sum.max <- apply(vobjDream_sum, 1, max)
summary(vobjDream_sum.max )
# Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
# 2.743   2.743   3.875   4.257   5.094  13.494
head(sort(vobjDream_sum.max, decreasing=T))
# RPL11     RPL22      CD52  SH3BGRL3    LAPTM5      IFI6
# 13.493915 11.961673 11.465821 10.389491  9.699667  9.016967

# pseudobulk mean
vobjDream_mean.max <- apply(vobjDream_mean, 1, max)
summary(vobjDream_mean.max)
# Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
# 14.30   14.30   14.32   14.44   14.44   17.42
head(sort(vobjDream_mean.max, decreasing=T))
# RPL11    RPL22     CD52 SH3BGRL3   LAPTM5    CDC42
# 17.41889 16.97915 16.79101 16.24833 15.92449 15.69318

Thanks again,

Aida