bnprks / BPCells

Scaling Single Cell Analysis to Millions of Cells
https://bnprks.github.io/BPCells
Other
166 stars 17 forks source link

BPCells operations fail #4

Closed yuhanH closed 1 year ago

yuhanH commented 1 year ago

There is a bug in BPCells after several operations in the IterableMatrix matrix. After row subset, row centering, row standardization, the matrix cannot be converted sparse matrix (crash R session) but matrix multiplication still can work. With additional min_scalar step, the matrix multiplication doesn't work. Here is an reproducible example:

library(Seurat)
library(BPCells)
library(dplyr)
library(SeuratData)
data('pbmc3k')
pbmc3k <- UpdateSeuratObject(pbmc3k)
pbmc3k <- FindVariableFeatures(pbmc3k)%>%ScaleData()%>%RunPCA()
mat <- write_matrix_memory(pbmc3k[['RNA']]@data) 
mat <- mat[VariableFeatures(pbmc3k),]

feature_stat <- matrix_stats(matrix = mat, row_stats = c('variance'))
feature.mean <- feature_stat$row_stats['mean',]
feature.sd <- sqrt(feature_stat$row_stats['variance',])
mat.scale <- (mat - feature.mean)/feature.sd

# work
proj_1 <- t(mat.scale) %*% pbmc3k[['pca']]@feature.loadings[VariableFeatures(pbmc3k),]

# not work
mat.sparse_1 <- as(object = mat.scale, Class = 'dgCMatrix')

mat.scale <- min_scalar(mat = mat.scale, val = 10)

# not work
mat.sparse_2 <- as(object = mat.scale, Class = 'dgCMatrix')

# not work
proj_2 <-t(mat.scale) %*% pbmc3k[['pca']]@feature.loadings[VariableFeatures(pbmc3k),]
bnprks commented 1 year ago

Thanks for the reproducible example. This was a memory bug triggered by the combination of reordered rows and writing out a dense matrix transformation element-by-element (with the as("dgCMatrix") call). It should be fixed in the main branch now -- please reopen and let me know if you have any further issues.

I'd also have a performance suggestion for the code it looks like you're trying to implement:

Usage example for replicating Seurat's default normalization:

library(BPCells)
library(Seurat)
library(dplyr)
library(SeuratData)
data('pbmc3k')
pbmc3k <- UpdateSeuratObject(pbmc3k)
pbmc3k <- FindVariableFeatures(pbmc3k)%>%ScaleData()%>%RunPCA()
mat <- write_matrix_memory(pbmc3k[['RNA']]@data) 
mat <- mat[VariableFeatures(pbmc3k),]

feature_stat <- matrix_stats(matrix = mat, row_stats = c('variance'))
feature.mean <- feature_stat$row_stats['mean',]
feature.sd <- sqrt(feature_stat$row_stats['variance',])

# Derive a per-row cap from the inequality  (X - mean)/sd <= 10
mat.capped <- min_by_row(mat, 10*feature.sd + feature.mean)
mat.scale <- (mat.capped - feature.mean)/feature.sd

ans <- pbmc3k@assays$RNA@scale.data[VariableFeatures(pbmc3k),]
all.equal(
    ans, 
    as.matrix(mat.scale)
)
# TRUE

loadings <- pbmc3k[['pca']]@feature.loadings[VariableFeatures(pbmc3k),]
all.equal(
    t(mat.scale) %*% loadings,
    t(ans) %*% loadings
)
# TRUE
yuhanH commented 1 year ago

Makes sense. Thank you!

yuhanH commented 1 year ago

Two minor issues. 'min by row' doesn't appear to be exported. To call it, I must use 'BPCells:::min by row'. When I show this object in R after running "min by row," it then prints some error messages even though it has no effect on other operations on the object.

mat.capped <- min_by_row(mat, 10*feature.sd + feature.mean)
mat.capped
Error in (function (cl, name, valueClass)  : 
  assignment of an object of class “numeric” is not valid for @‘row_params’ in an object of class “TransformMinByRow”; is(value, "matrix") is not TRUE