glmGamPoi scales cubically with the number of design matrix columns

frederikziebell commented 1 year ago

Hi Constantin,

when the number of design matrix columns increases (and thus the number of samples for a fixed number of replicates), glmGamPoi scales cubically. Consider the following code where 3 genes are fitted with 3 replicates per condition and a varying number of conditions:

library("glmGamPoi")
library("DESeq2")
library("tidyverse")

# adapted from DESeq2
make_example_dds <- function(n_genes, n_replicates, n_conditions){

  dispMeanRel <-  function(x) 4/x + 0.1

  beta <- matrix(rnorm(n_genes*n_conditions, mean = 4, sd = 2), ncol = n_conditions)
  dispersion <- dispMeanRel(2^(beta[, 1]))
  colData <- DataFrame(condition = factor(rep(paste0("cond", 1:n_conditions), n_replicates)))
  x <- model.matrix.default(~colData$condition)
  mu <- t(2^(x %*% t(beta)))

  countData <- matrix(rnbinom(mu, mu = mu, size = 1/dispersion), ncol = ncol(mu))
  mode(countData) <- "integer"

  design <- as.formula("~ condition", env = .GlobalEnv)
  object <- DESeqDataSetFromMatrix(countData = countData, colData = colData, design = design)
  object
}

calc_runtime <- function(n_conditions){
  set.seed(1)
  n_replicates <- 3
  dds <- make_example_dds(n_genes = 3, n_replicates = n_replicates, n_conditions = n_conditions)
  time <- system.time(
    fit <- glm_gp(assay(dds), design = design(dds), col_data = colData(dds), size_factors = rep(1, n_replicates*n_conditions))
  )
  as.double(time["elapsed"])
}

n_conditions <- c(10,25,seq(50,600,50))
res_glm_gp <- tibble(
  n_conditions = n_conditions,
  runtime = map_dbl(n_conditions, calc_runtime)
)

ggplot(res_glm_gp, aes(n_conditions, runtime^(1/3))) +
    geom_point() +
    geom_smooth(method="lm", se=F)

Could this be related to the QR decomposition of the design matrix and ultimately not be improved?

const-ae commented 1 year ago

Hey, thanks for documenting the aspect. I am not terribly surprised though, that glmGamPoi has a cubic runtime complexity with the number of covariates because the same is true for regular linear regression (https://math.stackexchange.com/a/84503/492945), and I am internally using an iterative reweighted least squares algorithm.

I would be curious if there is a clever trick to speed this up. I could imagine that there could be interesting pointers in the GWAS literature, because they have to fit gigantic linear models all the time :)

frederikziebell commented 1 year ago

From a practical standpoint, if there are many conditions, maybe one could subdivide the dataset (perhaps in an overlapping way that the control conditions are in every subset) and hope for very similar dispersion estimates? In this way, the linear runtime of estimating on all subsets would outperform the cubic runtime on the whole dataset.

const-ae commented 1 year ago

Sorry, I'm a bit confused and that's probably simply because I worked too long today 🙈 I estimate a single overdispersion for each gene so that is independent of the number of covariates. The cubic scaling comes from the coefficient estimates and I don't understand how subdividing helps there. Best, Constantin

frederikziebell commented 1 year ago

Here is an example, using the make_example_dds function from above. A dataset with 201 conditions is fitted, one time with all conditions simultaneously, and one time in 10 chunks with samples of condition 1 (to always have the same intercept), and 20 additional conditions. By averaging the overdispersion estimates over the 10 chunks, one arrives at a good approximation of the overdispersion estimate of the full model:

set.seed(1)
n_genes <- 10
n_replicates <- 3
n_conditions <- 201
dds <- make_example_dds(n_genes, n_replicates, n_conditions)

fit <- glm_gp(assay(dds), design = design(dds), col_data = colData(dds), size_factors = rep(1, ncol(dds)))

beta_full <- fit$Beta
overdispersions_full <- fit$overdispersions

# split samples in conditions 2 to 201 in 10 chunks of size 20
# and estimate parameters for each chunk separately,
# i.e. chunk 1: samples of conditions 1, 2, ... 20
#      chunk 2: samples of conditions 1, 21, ... 30
chunk_size <- 20

fit_chunk <- 2:201 %>% 
  split(ceiling(seq_along(.) / chunk_size)) %>% 
  map(function(conditions){
    dds_chunk <- dds[,dds$condition %in% c("cond1",str_c("cond",conditions))]
    fit_chunk <- glmGamPoi::glm_gp(assay(dds_chunk), design = design(dds_chunk), col_data = colData(dds_chunk), size_factors = rep(1, ncol(dds_chunk)))    
  })

# matrix of Beta estimates
beta_chunk <- cbind(fit_chunk[[1]]$Beta[,1,drop=F],do.call("cbind",map(fit_chunk,~.x$Beta[,-1])))

# overdispersions by averaging estimates for each gene
# over all chunks
overdispersions_chunk <- map(fit_chunk,~.x$overdispersions) %>% 
  do.call("cbind",.) %>% 
  rowMeans()

# overdispersions of full vs chunked model
tibble(
  full = overdispersions_full,
  chunk = overdispersions_chunk
) %>% 
  ggplot(aes(full, sub)) +
    geom_point() +
    geom_abline() +
    labs(title = "overdispersion")

# beta estimates
reshape2::melt(beta_full, value.name="beta_full") %>% 
  left_join(reshape2::melt(beta_chunk, value.name="beta_chunk"), by=c("Var1","Var2")) %>% 
  filter(beta_full > -1e-7) %>% 
  ggplot(aes(beta_full, beta_chunk)) +
    geom_point() +
    geom_abline() +
    labs(title = "Beta")

The full model takes 10sec on my machine, the chunked version 0.4sec.

I followed up on your hint and looked briefly into GWAS literature. One strategy is to use variable selection to potentially not estimate all covariates (see the time and memory requirements section here, and the referenced paper therein).

const-ae commented 1 year ago

Hey Frederik,

I am very sorry for the late reply, there was too much else going on. You can make the regular function fast by setting do_cox_reid_adjustment = FALSE. The following takes 0.135 seconds on my computer (i.e., a 75x speed-up):

system.time(
  fit <- glm_gp(assay(dds), design = design(dds), col_data = colData(dds), size_factors = rep(1, ncol(dds)), 
                do_cox_reid_adjustment = FALSE)
)

The problem was that I assumed the issue was in estimating the beta coefficients, where I cannot beat the cubic scaling. I did not realize that your design matrix had a convenient form where we know that the coefficients are independent and can thus make it scale linear with the number of coefficients. In fact, the estimating of the beta coefficients was already fast but the overdispersion estimate (which I thought linear) scaled cubically because calculating the Cox-Reid (McCarthy 2012) adjustment is implemented as a simple matrix multiplication ($X^T diag(w) X$).

I could implement a faster version of the Cox-Reid adjustment that uses the known sparsity of the design matrix. However, that would complicate the methods significantly, so I am hesitant. There is a difference between the overdispersion estimates, but it is not too big:

const-ae / glmGamPoi

glmGamPoi scales cubically with the number of design matrix columns #41