davismcc / archive-scater

An archived version of the scater repository, see https://github.com/davismcc/scater for the active version.
64 stars 18 forks source link

What is the difference between built-in normalization in scater 1.0.4 and 1.2.0? #100

Closed sozzznanie closed 7 years ago

sozzznanie commented 7 years ago

Hello,

I analyse single-cell RNA-seq data. My count matrix with UMIs (1100 cells and about 20500 detected genes) contains a lot of zeros and in average looks like this:

image

When I create SCE object using only countData, phenoData, and featureData, exprsData is created automatically.

sce_full = newSCESet(countData=as.matrix(counts_cells), phenoData=pheno, featureData=feature) expr_mat = exprs(sce_full)

I noticed that this expr_mat was different depends on scater version (1.0.4 or 1.2.0). In 1.0.4 version, zeros were transformed to zeros

image

In 1.2.0 version, zeros were transformed into a positive constant (0 -> 7.20), which reduced the contrast between zero and nonzero expression on heatmaps, and made the heatmaps more fishy.

image

Then I replaced exprs(sce_full) explicitly with the scran-normalized matrix. I chose sizes vector in such a way to avoid zero sizeFactors:

cl = scran::quickCluster(sce_full) sce_full_lg = scran::computeSumFactors(sce_full, clusters = cl, sizes = sizes = c(10,25,50,75,100), positive = TRUE) sce_full_lg = scater::normalize(sce_full_lg) expr_mat_lg = exprs(sce_full_lg)

Scran transformed zeros into a very small (almost zero) negative constant but compressed the data more strongly, that also made heatmaps not so bright in comparison to 1.0.4:

image

I've read the scran (Aaron T.L. Lun et al., Genome Biology, 2016) and scater (Davis J. McCarthy et al., Bioinformatics, 2017) articles. In the scater article, it is written that scater uses scran normalization. Anyway, I see every time different results. I find visualization using normalization from scater 1.0.4 more fancy, but I am affraid to use it because it is an old package version, and maybe you improved the algorithm of normalization in 1.2.0 version significantly. So for a while, I use scran normalization. But I am really curious to know why these three matrix have different zero-offset, and which one is the best.

Thank you a lot for your help.

LTLA commented 7 years ago

There's a couple of points to make here. Firstly, we switched to edgeR's cpm function in scater 1.2.0, which is why the values from normalise are different between versions. However, the fact that zero counts get converted to non-zero log-expression values shouldn't really matter. You should always be mean-centring each row before making a heatmap anyway; you should not be looking at the absolute values of the expression values, because then the differences in abundances between genes dominate the heatmap. Once you mean-center, I suspect that you will find that the relative differences between observations will still be visible, regardless of the added "7.20" for zero counts.

Secondly, the small negative values from scran's normalisation are a quirk of cpm. This has some justification but it was too annoying to explain every time, so in the latest version of scater we've modified normalise to return zero values from zero counts (if the pseudo-count from the log-transformation is unity). This makes the log-expression values easier to use, e.g., in NMF.

Finally, positive=TRUE doesn't protect against zero size factors. It only avoids negative size factors for some cells from distorting the other estimates. There could be some instances in which this setting is useful, but in most scenarios, observing negative or zero size factors indicates that you have not properly applied cell-level quality control or filtered the genes sufficiently. It is better to go back and fix your QC and filtering steps to avoid the need to use positive=TRUE in the first place.

P.S. Questions like these are best posted on the Bioconductor support site.

sozzznanie commented 7 years ago

Hello Aaron,

Thank you a lot for helpful remarks. Sorry that I used the wrong forum, next time I will post my questions there. If you allow me, I would like to continue this post here, it is more comfortable.

Yes, I repeated my heatmap with mean-centering, it looks more clean now because mean-centering masked the difference in abundance of genes. Should I do mean-centering before t-SNE, PCA or diffusion map?

mat_mean = rowSums(mat)/ncol(mat)
mat = mat - mat_mean

I even tried to do mean-variance-centering but it only smeared the picture. Fortunately, all clusters and patterns are reproducible in all three cases.

mat_var = c()
for(i in 1:nrow(mat)){
  mat_var = c(mat_var, var(mat[i,]))
}
mat_mean = rowSums(mat)/ncol(mat)
mat = (mat - mat_mean)/mat_var

Thank you that you noticed positive=TRUE. I forgot to delete it. What I was talking about, that sizes-vector influences the quantity of zero sizeFactors. I've done cells pre-filtering, deleted about 270 cells (1370-270=1100 cells). For now, I didn't use calculateQCMetrics, I just deleted outliers using technical and biological characteristics. 20% of cells maybe is too much, probably, later I will be able to get reproducible results with less deleted cells. And, probably, calculateQCMetrics will be a good alternative :)

I checked for the whole dataset, I don't have any zero sizeFactors with default sizes-vectors:

> cl = scran::quickCluster(sce_full)
> cl_len = c()
> for(i in 1:length(levels(cl))){
+ cl_len = c(cl_len, length(cl[cl == as.numeric(levels(cl)[i])]))
+ }
> cl_len # vector with sizes of every cluster
[1] 301 296 265 238

> sizes = c(20, 40, 60, 80, 100) # default
> sce_full_lg = scran::computeSumFactors(sce_full, clusters = cl, sizes = sizes)
> summary(sizeFactors(sce_full_lg))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.07909 0.64180 0.92060 1.00000 1.22800 4.82200 

Interestingly, when I take a subset of this dataset, I have some zero sizeFactors depends on the sizes-vector:

> cl = scran::quickCluster(sce_full)
Warning message:
In .local(x, ...) : 63 cells were not assigned to any cluster
> cl_len = c()
> for(i in 1:length(levels(cl))){
+ cl_len = c(cl_len, length(cl[cl == as.numeric(levels(cl)[i])]))
+ }
> cl_len # vector with sizes of every cluster
[1]  63 203

> sizes = c(5,15,25,30)
> sce_full_lg = scran::computeSumFactors(sce_full, clusters = cl, sizes = sizes)
> summary(sizeFactors(sce_full_lg))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.01653 0.74460 0.98950 1.00000 1.21800 2.26000 

> sizes = c(5,15,20,30)
> sce_full_lg = scran::computeSumFactors(sce_full, clusters = cl, sizes = sizes)
Warning message:
In .local(x, ...) : negative factor estimates, re-run with 'positive=TRUE'
> summary(sizeFactors(sce_full_lg))
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-0.06597  0.72370  0.98610  0.99600  1.25000  2.26200 

It maked me a little confused, but it seems that it is not a problem of lack of filtering and it is necessary to tune parameters a little bit every time to increase accuracy and allow the linear system to be solved for each cell.

Best regards, Polina

LTLA commented 7 years ago

You don't need to do mean-centring before these techniques, as it is done automatically in prcomp (for PCA) and doesn't matter for the others (which are based on distances between cells). As for sizes; if you set it too small, it defeats the whole purpose of the pooling and deconvolution procedure, i.e., to add expression values together to avoid zero/low counts. It needs to be reasonably large (> 20 at least).

sozzznanie commented 7 years ago

Aaron, thank you for the explanation and clear threshold.

davismcc commented 7 years ago

Thanks, Polina and Aaron. I'll close this now (and, as Aaron did, encourage queries such as these to be posted to the Bioconductor support site).