makeTechTrend() identified many housekeeping/ribosome genes as variable genes

ahy1221 commented 6 years ago

Hi, Following the new updated simpleSingleCell workflow , I use makeTechTrend() to fit mean-variance trend for my 10X data. However, after decomposeVar() from the new fit, many housekeeping genes, which always show highest expression, are detected as variable genes with very large biological component. I wonder is there any way to avoid this ? Or should I trust this result ? The housekeeping gene list I used are sourced from http://science.sciencemag.org/content/352/6282/189

hk.genes <- c("ACTB
B2M
HNRPLL
HPRT
PSMB2
PSMB4
PPIA
PRPS1
PRPS1L1
PRPS1L3
PRPS2
PRPSAP1
PRPSAP2
RPL10
RPL10A
RPL10L
RPL11
RPL12
RPL13
RPL14
RPL15
RPL17
RPL18
RPL19
RPL21
RPL22
RPL22L1
RPL23
RPL24
RPL26
RPL27
RPL28
RPL29
RPL3
RPL30
RPL32
RPL34
RPL35
RPL36
RPL37
RPL38
RPL39
RPL39L
RPL3L
RPL4
RPL41
RPL5
RPL6
RPL7
RPL7A
RPL7L1
RPL8
RPL9
RPLP0
RPLP1
RPLP2
RPS10
RPS11
RPS12
RPS13
RPS14
RPS15
RPS15A
RPS16
RPS17
RPS18
RPS19
RPS20
RPS21
RPS24
RPS25
RPS26
RPS27
RPS27A
RPS27L
RPS28
RPS29
RPS3
RPS3A
RPS4X
RPS5
RPS6
RPS6KA1
RPS6KA2
RPS6KA3
RPS6KA4
RPS6KA5
RPS6KA6
RPS6KB1
RPS6KB2
RPS6KC1
RPS6KL1
RPS7
RPS8
RPS9
RPSA
TRPS1
UBB
")

hk.genes <- strsplit(hk.genes, "\n")[[1]]
keep <- rowSums(scater::calcIsExprs(sce, detection_limit = 0, exprs_values = "counts")) >= 30
sce <- sce[keep,]
new.trend <- makeTechTrend(x=sce)
fit <- trendVar(sce.sub, loess.args = list(span = .05), design = NULL, min.mean = 0.05, use.spikes = F)
fit$trend <- new.trend

dec <- decomposeVar(fit=fit)
 plot(dec$mean, dec$total, pch=16, cex=0.6, xlab="Mean log-expression",ylab="Variance of log-expression")
  points(dec[hk.genes, "mean"], dec[hk.genes, "total"], col="red", pch=16)
  curve(fit$trend(x), col="dodgerblue", add=TRUE, lwd=2)
top.dec <- dec[order(dec$bio, decreasing = T),]
as.data.frame(top.dec[rownames(top.dec) %in% hk.genes, ])

default default

LTLA commented 6 years ago

There is no satisfactory way to accurately decompose biological and technical components of variation with 10X Genomics data. There's no spike-ins and, when dealing with UMI data, it is simply not correct to assume that the trend fitted to the variances of the endogenous genes is a good estimate of the technical noise. (Compare, for example, the trend fitted to the spike-ins in the Zeisel dataset in the third episode of the workflow, which is consistently below the variances of the endogenous genes.) This is not an issue unique to scran; anyone using the gene-fitted trend as an estimate of the technical noise is deluding themselves.

So, in light of this, what can we do? There are two options. The first is to make some assumptions that allow us to get some reasonably appropriate measure of the biological component - this is the approach of makeTechTrend, which assumes that technical noise is Poisson-distributed. I have used this approach in the workflow as an accurate estimate of the biological component is necessary for denoisePCA.

The second option is to forget about getting an accurate estimate of the biological component, and to follow the advice here. This explains how to interpret the results when spike-ins are not available and the trend must be fitted to the endogenous genes. This will probably fit the line through the bulk of high-abundance "house-keeping" genes, which should reduce their detection. (Though at least some of them will still be called as being significantly highly variable, as the p-values are not well-calibrated - I won't go into this.)

There is also the fundamental issue in that highly expressed genes provide the greatest power to distinguish between biological and technical noise. Put bluntly; house-keeping genes will exhibit non-zero biological variation. Do you really believe that the numbers of transcripts are constant in each cell? I don't. For me, a positive biological component is the correct result. And who knows, some cell types may be actively upregulating some of these genes (e.g., GAPDH in cancer). Just because they're "house-keeping" doesn't mean that they have to be constant in every cell, only that they are expressed at a high level.

To conclude: I wouldn't worry about picking up house-keeping genes. The purpose of the trend fitting step is not to call HVGs, as these are not of direct interest anyway. (Think about it - when was the last time a collaborator was satisfied with a list of HVGs?) The real aim is to get the fitted trend, which is then used for feature selection in denoisePCA prior to clustering, dimensionality reduction, etc.

ahy1221 commented 6 years ago

Thanks for your reply. In my experience , the gene selection would be quite impactful for constructing PCA subspace and at the same thime the graph-based clustering is quite sensitivity to the PCA space we used ( I wonder is that true from the computational perspective ?) . In fact, it is quite easy to get some clusters which marker genes are mostly ribosome genes using 10X genomic data (at least for human T cells) . If I understand right, what you think is that do not try not perform much gene selection but to use genes as much as possible to construct a PCA space. Because you think that we can denoise unnecessary gene signatures through selecting PCs . Is that right ? If so ,is there any way to look at which PC is associated with unnecessary gene signatures such as "house-keeping" and decide to discard them .

LTLA commented 6 years ago

I also have seen ribosomal protein genes being markers for distinct clusters (using the human 4K T cell data set from 10X Genomics, see here). However, some closer inspection indicates that this may actually reflect some real biology. Ribosomal protein gene expression seems to be downregulated by ~2-fold in certain T cell subsets that are expressing granzymes and related proteins. Certainly, we see a strong upregulation in biosynthetic pathways (including ribosomal proteins) upon mouse T cell activation. So I would be cautious about deciding a priori that ribosomal proteins are not helpful for characterising heterogeneity.

Now, onto your question. Yes, we aim to (initially) use as many genes as possible, excluding only those genes that are "obviously" uninteresting, i.e., variances below the technical noise. When I first wrote the workflow, I did try to use some more sophisticated algorithms to take only the significant HVGs. I then realized that this was a bit silly, because - due to a quirk of how the null hypothesis needs to be constructed - the genes with the lowest p-values are not always the genes with the largest biological components! So if I was to take the top X genes, how should I order them - by biological component, or by p-value? And what should the value of X be? What happens if I lose an entire class of genes because X is too small? These worries can now all be ignored, because we just take all genes above the trend.

As for the PCA step; it is not intended to discard any "unnecessary" gene signatures, only random noise. These are two different things, as gene signatures are systematic and will not be random. (That said, if the ribosomal proteins are randomly variable, then this would count as noise and get removed in later PCs.) I think that removing certain PCs (other than the ones at the end) is a troublesome procedure. The PCs are not computed independently of each other, and it's rare for one PC to capture all of and only one signature. If you want to get rid of the ribosomal protein gene signature, then just don't use the corresponding genes as input. It is for this reason I have the subset.row arguments in most of the scran functions.

In general, I've come to feel that exploratory data analyses should be performed with as little pre-processing as possible, which includes feature selection. More stringent selection of genes should only be performed when necessary, rather than doing it beforehand without knowing what your data looks like. So while you're welcome to remove ribosomal proteins prior to further analysis, don't complain when you miss out on some potentially interesting biology, e.g., relating to changes in biosynthetic capacity of different cell types.

MarioniLab / scran

makeTechTrend() identified many housekeeping/ribosome genes as variable genes #13