Questions about your paper

Hello!

First I just wanted to say your paper is probably one of my favourite reads so far on scRNA-seq normalization and feature selection, everything was clear, easy to follow, and your use of examples was amazing, very well-written. I just have some questions that I was hoping you could answer:

- In the section Appendix: Limitation of residuals without variance stabilization, you talk about the issues of using analytical Pearson residuals with the following example:

Suppose we have a data set consisting of 10000 cells with a gene (gene A) that is only expressed in 100 cells with identical counts of 100 in each of those cells; the rest of the cells have 0 counts...

The overall variance of the residuals for gene A is approximately 98.03, thus exhibiting significant deviation from the null expectation of 1.

I am wondering as to why this shows an issue with analytical Pearson residuals as isn't this wanted, the gene is a robust marker for the subpopulation of 100 cells and the gene is significantly variable enough to be selected for. Could you elaborate on why this is an issue?

- In the section Piccolo normalization reduces the impact of sampling depth differences between cells while simultaneously ensuring variance stabilization, you look at the variance of residuals comparing your method to others and state:

It is immediately apparent that the residuals obtained from Piccolo exhibit much better variance stabilization compared to the other two residuals-based approaches.

As I understand it the objective of variance stabilization is to eliminate the variance-mean relationship so that variance among low-mean genes are comparable to high-mean genes(because variance strongly depends on gene mean with unnormalized data). Meaning genes with significantly different means should have equal opportunity to reach a certain variance. However looking Fig 3. and Fig. S9, it's evident that higher-mean genes consistently have larger residual variances compared to lower-mean genes with Piccolo. However, Pearson residuals from SCTransform(V2) and less so Analytic Pearson residuals do not appear to have a strong relationship between gene mean and residual variance. Could you elaborate on why Piccolo's residuals(z-scores) exhibit this relationship— why this would be wanted —as I would then assume that higher-mean genes contribute more to cell-cell differences using residuals.

Thank you, this was a fun read!

Hello, thanks a lot for your kind words about the preprint and for bringing up these excellent points! Deeply appreciate such sincere and valuable feedback.

This may sound surprising at first but both the points you raised are intimately connected to how principal components analysis (PCA) works. Recall, that we use PCA to reduce the dimensionality of our data sets after performing normalization. With PCA we seek to explain the variance-covariance structure of a given set of variables through linear combinations of these variables. These linear combinations are referred to as the principal components (PCs). The weights (or loadings) for each variable in the PCs depend on the variance of the variables - variables with larger variances have larger loading values. This can lead to a strong bias in favor of variables with larger variances. For example, if we have gene A that exhibits counts in the range of 1000 - 50000 and another gene B that exhibits counts in the range of 1 - 100, then the contribution to total variation as assessed through PCA will be mostly due to gene A. This will be reflected in the form of a top PC with a large loading value for gene A. As you very accurately pointed out, the objective of variance stabilization is to stabilize the variance-mean relationship so that the variances of genes with low means and variances of genes with high means are comparable. Put in the context of PCA, through variance stabilization we ensure that genes with different mean expression levels get to contribute similarly in the PCs. Given this background, I'll attempt to respond to your queries in turn now.

As highlighted with that example you mentioned, with raw counts based residuals (for instance the residuals obtained with the analyticPearson method or SCTransform) marker genes have large residual variances. While that is indeed desirable from a feature selection perspective, when viewed in the context of PCA this is a drawback since these genes will end up with disproportionately large loadings. In contrast, in Piccolo's normalization we include a variance stabilizing transformation which ensures that the variances of the resultant residuals even for marker genes are not significantly larger. Moreover, the monotonic nature of the variance stabilizing transformation ensures that genes with larger variances of their raw counts also have larger variances of the residuals. Thus, the information about higher variability exhibited by marker genes is preserved while computing the residuals with Piccolo normalization.
Indeed, that is a very good observation. With Piccolo, we do observe that genes with higher mean expression levels consistently have larger residual variances compared to genes with lower mean expression levels. This is an issue that arises when applying log transformation on small counts (Piccolo normalization includes a log transformation as the default variance stabilization transformation). Notice, that this pattern of behavior of the residual variances closely mirrors what we observe with the simple log normalization (see bottom left panel in Fig 3B that you shared above). This is definitely not desirable. However, in the context of PCA the residuals obtained with Piccolo are quite favorable since despite genes with larger means generally exhibiting larger residual variances, these variances do not significantly exceed the residual variances of genes with lower means (notice in Fig 3B top left panel that the maximum residual variance obtained with Piccolo does not even exceed 2). In contrast, as pointed out above, often the residuals obtained for some genes (especially marker genes) using the analytic Pearson method or SCTransform exceed the null expectation of 1 by several orders of magnitude leading to much larger loadings for these genes in the PCA. This tends to have an adverse impact on the evaluation of cell-cell similarities based on the PCs.

I hope you find this helpful. If you find that some things are not very clear, I'll be happy to respond and elaborate further.

Once again, thank you for your kind words and for reaching out with such good observations.

Thank you so much for the response, this cleared some things up for me I’m just curious now about a new few things

As highlighted with that example you mentioned, with raw counts based residuals (for instance the residuals obtained with the analyticPearson method or SCTransform) marker genes have large residual variances. While that is indeed desirable from a feature selection perspective, when viewed in the context of PCA this is a drawback since these genes will end up with disproportionately large loadings.

This makes sense however as I understand it prior to PCA for raw counts-based residuals(SCTransform and analyticPearson), residuals for selected variable genes are centred and scaled to unit variance to prevent disproportionately large loadings for genes with larger variances. I do agree that the small range in variance of the residuals based on log-transformed counts may limit an unexpected relationship between gene mean and the feature loadings of the top few PCs, however it may be useful to centre and scale the residuals to unit variance if this does become an issue with a particular dataset if the residual variances span too large of a range, especially for the other transformations you include in Piccolo(although I have not checked to see what order of magnitude their residual variances span).

...genes with higher mean expression levels consistently have larger residual variances compared to genes with lower mean expression levels. This is an issue that arises when applying log transformation on small counts.

If this is an issue then after the counts are corrected for their size-factors would it be possible to get residuals without applying a log-transformation, similar to how SCTransform(V2) and analyticPearson compute residuals with:

r_{g,c} = \frac{ X_{g,c} - p_{g}n_{c} }{ \sqrt{p_{g}n_{c} + \alpha_{NB\ g}(p_{g}n_{c})^2} } = \frac{ X_{g,c} - \hat{\mu}_{g,c} }{ \sqrt{\hat{\mu}_{g,c} + \alpha_{NB\ g}(\hat{\mu}_{g,c})^2}} = \frac{X_{g,c} - expected\ count_{g,c}} {\sqrt{expected\ variance_{g,c}}}

Where $p{g}$ represents the expected fraction of counts a gene takes up in any cell, $n{c}$ represents the total counts of a cell, thus giving $\hat{\mu}_{g,c}$ representing the expected count for a given gene in a given cell, and the denominator showing the expression for variance with a negative binomial model.

As a side note, i've noticed a few concerns that SCTransform(V2) may overcorrect for overdispersion if there is an abundance of biologically variable genes. Since it estimates the gene-specific overdispersion factor $\alpha_{NB\ g}$(which they then flip to the inverse overdispersion factor $\theta$) on the dataset, if the data consists of a biologically variable sample this may skew the expected variance for a given count to be larger than it would be if it were just accounting for technical variance.

In your paper you describe the qualities of a stable gene, where it has a low $\alpha{QP\ g}$ in comparison to other nearby genes in a given "expression range", telling us these genes only(or at the least mostly) vary due to technical sources(equivalent to the gene coming from a single population or source, like repeatedly sampling one cell or in a negative control dataset). As I understand it, you use these genes to calculate size factors for each cell and correct their counts, then allowing you to calculate $\hat{\mu{g}}$, the expected count assuming a single source and a poisson-like model, simply as the mean of the gene across all size-corrected cells— instead of estimating an expected fraction parameter $p_{g}$ like above.

I think you may also be able to use these stable genes to compute the expected technical variance of a count(in this case the variance of a gene assuming a single source $\hat{\sigma}_{g}^2$ ). Since you define them as genes that only vary due to technical reasons, their variance would truly represent technical variation, thus bypassing the issue SCTransform(V2) may have. Finding enough of these genes across the range of means would then allow you to do this for all genes regardless of the "expression level" they are in, thus giving:

r_{g,c} = \frac{ \hat{X}_{g,c} - \hat{\mu}_{g} }{ \sqrt{\hat{\sigma}_{Stable\ g}^2}} = \frac{X_{g,c} - expected\ count_{g,c}} {\sqrt{expected\ variance_{g,c}}}

Where $\hat{X}$ represents the size-corrected data matrix.

Then centring and scaling the residuals to unit variance should prevent the issue you mentioned above with larger residual variances disproportionately contributing to principal components.

I hope this comes across clearly, and please let me know what you think, I love the idea of identifying stable genes and I think it can help solve the issues some datasets present when there is a large amount of biological variation.

Thank you for the discussion!

Apologies for this delayed response, was away for a bit. Thanks once again for such a careful consideration of all the points.

Actually, the residuals obtained from SCTransform (including v2) as well as analytic Pearson are typically not standardized before PCA; they are centered but not scaled by their standard deviations. There is a compelling reason not to do so. By forcing all the genes to have unit variance we often end up amplifying noise, especially for genes that have small counts. My understanding of why this happens is that while the magnitude of technical noise can be assumed to be roughly the same for all the genes, the relative magnitude of it is higher for genes with smaller counts. When we standardize, the higher relative noise for these genes gets amplified leading to a reduction in the ability to distinguish between different cell-types/groups. Thus, standardization is not advisable in such circumstances.

While there is indeed conceptual appeal in computing the residuals using the observed or corrected counts, there is another advantage to employing variance stabilization transformations (such as log) to our counts data. We know that the counts distributions of genes (especially the highly variable ones) exhibit significant skew. A monotonic non-linear transformation of the counts helps reduce the skew significantly which again leads to an improvement in dimensionality reduction using PCA. I realize that I haven't highlighted or emphasized this aspect very much in my manuscript, although I do mention that often for genes with small counts the individual residuals as calculated for the raw/observed counts with SCTransform or analytic Pearson residuals methods are abnormally large requiring heuristic approaches such as clipping to smaller values in order to tackle this problem.

I think the suggestion you've made about using just the stable genes to arrive at estimates for technical variance is extremely appealing. I will try to see if it'll be possible to come up with a practical implementation. A potential hiccup I see right now is that while the stable genes are determined for the "expression range" within each bin, the actual range of mean expression levels can vary a bit within the bin, in particular for bins that contain genes with high mean expressions. But, it is indeed a nice idea.

I will mention here that currently in Piccolo the premise on which the estimates of means and variances for each gene in each cell are arrived at is very simple. Basically, I use the corrected counts (after scaling the observed counts with the cell-specific size factors) to estimate the mean expression level for each gene. We consider this to be an estimate of the underlying mean expression level of the gene that we would have measured had there not been any technical biases. Similarly, we estimate the variance of the corrected counts (around the corrected mean, $\tilde\mu_g$) and consider this to be an estimate of the variance of the counts of the given gene had there not been any technical biases. Thereafter, we arrive at estimates for the expected mean and expected variance of the given gene in each cell in the presence of technical biases. The estimates of expected means and variances for each cell are calculated based on the assumption that if we had a pool of identical copies of this cell influenced almost identically by the technical bias, then the mean would need to be scaled appropriately (by the size-factor for the given cell) and similarly with the estimates for the variances. Thus, the estimates of mean and variance for each gene in each cell are the means and variances we would expect assuming the technical biases are sufficiently captured by the cell-specific size factors.

I hope I have been able to address your questions and concerns to some extent. Happy to discuss further. Again, very grateful for this fantastic feedback and discussion. Please feel free to drop me an email (as2197 at scarletmail dot rutgers dot edu) for any queries, I'm generally more prompt in my responses there.

Amartya101 / Piccolo

Questions about your paper #2