Closed jennapappalardo closed 3 years ago
Nothing too complicated.
The theoretical reason is that the addition of counts is a more accurate representation of an actual bulk experiment, where you're just pooling the transcripts from all cell types regardless of their total RNA content. For example, if we had a bulk sample with equal numbers of cells from two types, but one type of cell had higher RNA content, the bulk sample would have a greater contribution of RNA from that larger cell type. Whether this is desirable or not is up for debate, but the pseudo-bulk process just replicates this effect by adding the counts directly rather than performing any normalization for library size on a per-cell level.
The practical reasons are that (i) downstream tools like edgeR and friends work best with counts, rather than a sum of normalized values that probably won't recapitulate the mean-variance relationship properly; and (ii) it's just easier to sum counts and not have to worry about how to normalize them beforehand.
That makes sense, thanks so much for the quick answer!
For an extra bonus:
It's not always clear that normalization will actually reduce the variability of the sum. For example:
set.seed(13579)
# 100 cells with a range of size factors
ncells <- 100
sf <- 2^rnorm(ncells, sd=2)
sf <- sf/mean(sf)
# Simulating 1000 pseudo-bulk samples, summed with and without scaling.
iterations <- 1000
collected.raw <- collected.norm <- numeric(iterations)
for (i in seq_len(iterations)) {
y <- rnbinom(ncells, mu=sf, size=10)
collected.raw[[i]] <- sum(y)
collected.norm[[i]] <- sum(y/sf)
}
# Computing the dispersion, according to edgeR
# (assuming that no normalization across samples is necessary).
library(edgeR)
design <- matrix(1, iterations)
offset <- numeric(iterations)
estimateDisp(rbind(collected.raw), design=design, offset=offset)$common
## [1] 0.003879769
estimateDisp(rbind(collected.norm), design=design, offset=offset)$common
## [1] 0.04159851
This is because, when counts for cells with small sf
get scaled up, the normalized values inflate the variance of the sum - remember, scaling up a count will increase the variance by the square of the scaling factor.
That's a really great point! Thanks so much!
Hello! I'm interested in doing pseudobulk differential expression analysis to get sample-level rather than cell-level statistics for single-cell data. I was wondering what the rationale is for adding counts for cells and then normalizing the data rather than normalizing data at the single-cell level and then combining. I initially assumed pseudobulk would be done on data normalizing on a per-cell level to reduce cellular variability before combining.
Thanks so much! Jenna