Pseudobulk DGE - Githubissues

jennapappalardo commented 3 years ago

Hello! I'm interested in doing pseudobulk differential expression analysis to get sample-level rather than cell-level statistics for single-cell data. I was wondering what the rationale is for adding counts for cells and then normalizing the data rather than normalizing data at the single-cell level and then combining. I initially assumed pseudobulk would be done on data normalizing on a per-cell level to reduce cellular variability before combining.

Thanks so much! Jenna

LTLA commented 3 years ago

Nothing too complicated.

The theoretical reason is that the addition of counts is a more accurate representation of an actual bulk experiment, where you're just pooling the transcripts from all cell types regardless of their total RNA content. For example, if we had a bulk sample with equal numbers of cells from two types, but one type of cell had higher RNA content, the bulk sample would have a greater contribution of RNA from that larger cell type. Whether this is desirable or not is up for debate, but the pseudo-bulk process just replicates this effect by adding the counts directly rather than performing any normalization for library size on a per-cell level.

The practical reasons are that (i) downstream tools like edgeR and friends work best with counts, rather than a sum of normalized values that probably won't recapitulate the mean-variance relationship properly; and (ii) it's just easier to sum counts and not have to worry about how to normalize them beforehand.

jennapappalardo commented 3 years ago

That makes sense, thanks so much for the quick answer!

LTLA commented 3 years ago

For an extra bonus:

It's not always clear that normalization will actually reduce the variability of the sum. For example:

set.seed(13579)

# 100 cells with a range of size factors
ncells <- 100
sf <- 2^rnorm(ncells, sd=2)
sf <- sf/mean(sf)

# Simulating 1000 pseudo-bulk samples, summed with and without scaling. 
iterations <- 1000
collected.raw <- collected.norm <- numeric(iterations)
for (i in seq_len(iterations)) {
    y <- rnbinom(ncells, mu=sf, size=10)
    collected.raw[[i]] <- sum(y)
    collected.norm[[i]] <- sum(y/sf)
}

# Computing the dispersion, according to edgeR
# (assuming that no normalization across samples is necessary).
library(edgeR)
design <- matrix(1, iterations)
offset <- numeric(iterations) 
estimateDisp(rbind(collected.raw), design=design, offset=offset)$common
## [1] 0.003879769
estimateDisp(rbind(collected.norm), design=design, offset=offset)$common
## [1] 0.04159851

This is because, when counts for cells with small sf get scaled up, the normalized values inflate the variance of the sum - remember, scaling up a count will increase the variance by the square of the scaling factor.

jennapappalardo commented 3 years ago

That's a really great point! Thanks so much!

MarioniLab / scran

Pseudobulk DGE #73