MarioniLab / scran

Clone of the Bioconductor repository for the scran package.
https://bioconductor.org/packages/devel/bioc/html/scran.html
40 stars 22 forks source link

support the abstract DelayedArray backend #64

Closed mikejiang closed 4 years ago

mikejiang commented 4 years ago

I've created DelayedArray extension : CytoArray by wrapping the existing cytoframe object into a seed, i.e. https://github.com/RGLab/flowWorkspace/blob/se/R/CytoArraySeed.R

And then use it in cytoexperiment, which is a SummarizedExperiment object. Everything seems to work out-of-box at R level e.g. https://rpubs.com/rglab/642636

Until I start to experiment it with scran

suppressPackageStartupMessages(library(flowWorkspace))
suppressPackageStartupMessages(library(SummarizedExperiment))

#' ## Load cytoframe
dataDir <- system.file("extdata",package="flowWorkspaceData")
fcs_file <- list.files(dataDir, "Cyto", full.names = TRUE)[1]
cf <- load_cytoframe_from_fcs(fcs_file)
#lock cytoframe to prevent it from being accidentally modified
cf_lock(cf)

#' ## Wrap it into SummarizedExperiment
ce <- cytoexperiment(cf)
# scan expect a count assay
names(ce@assays) <- "counts"

library(scran)
clusters <- quickCluster(ce)
sce <- computeSumFactors(ce, clusters=clusters)
Error in subset_and_divide(x, subset.row - 1L, curdex - 1L, scaling) : 
  Evaluation error: object of type 'S4' is not subsettable.

It appears to me that scran is relying on beachmat c++ code to directly access the underlying matrix from the DelayedArray through its pointer as Rcpp::RObject , apparently beachmat doesn't know about cytoframe -backed DelayedArray .

If it is the case, then would it be worth to consider an alternative R-version of subset_and_divide function (maybe also other functions) to handle the non-beachmat compatible DelayedArray? So that it is agnostic about the backend and any DelayedArray can work as it is without necessarily need to extend beachmat for any arbitrary matrix type?

LTLA commented 4 years ago

It should already do that. When beachmat encounters a DelayedMatrix that is not easily comprehensible (i.e., has a backend that is neither ordinary or sparse and/or has operations beyond subsetting, transposition or replacement of dimnames), it will fall back to R to realize a block of the matrix into memory; this effectively provides a C++ wrapper to the block processing mechanism.

The error you're observing is in fact on the R side when the subsetting fails. You can see the code here:

https://github.com/LTLA/beachmat/blob/master/R/setupUnknownMatrix.R

Your object just gets passed through as mat; for comparison, the same code works for HDF5Matrix objects.

mikejiang commented 4 years ago

Ok. Here is the traceback

> traceback()
13: stop(e)
12: value[[3L]](cond)
11: tryCatchOne(expr, names, parentenv, handlers[[1L]])
10: tryCatchList(expr, classes, parentenv, handlers)
9: tryCatch({
       FUN(...)
   }, error = handle_error)
8: withCallingHandlers({
       tryCatch({
           FUN(...)
       }, error = handle_error)
   }, warning = handle_warning)
7: FUN(...)
6: FUN(X[[i]], ...)
5: lapply(X, FUN_, ...)
4: bplapply(indices, FUN = .per_cluster_normalize, x = x, sizes = sizes, 
       subset.row = subset.row, min.mean = min.mean, positive = positive, 
       scaling = scaling, BPPARAM = BPPARAM)
3: bplapply(indices, FUN = .per_cluster_normalize, x = x, sizes = sizes, 
       subset.row = subset.row, min.mean = min.mean, positive = positive, 
       scaling = scaling, BPPARAM = BPPARAM)
2: .calculate_sum_factors(assay(x, i = assay.type), ...)
1: computeSumFactors(ce, clusters = clusters)

and here is where Rcpp code throws

> debugonce(scran:::subset_and_divide)
> sce <- computeSumFactors(ce, clusters=clusters)
debugging in: subset_and_divide(x, subset.row - 1L, curdex - 1L, scaling)
debug: {
    .Call("_scran_subset_and_divide", PACKAGE = "scran", matrix, 
        row_subset, col_subset, scaling)
}
Browse[2]>  .Call("_scran_subset_and_divide", PACKAGE = "scran", matrix, 
+     row_subset, col_subset, scaling)
Error in subset_and_divide(x, subset.row - 1L, curdex - 1L, scaling) : 
  Evaluation error: object of type 'S4' is not subsettable.
LTLA commented 4 years ago

The traceback is expected given the error; the question is why it's throwing. beachmat:::realizeByRange and related functions only require a functional dim and [ operator. For comparison:

library(BiocSingular)
L <- matrix(runif(1000), ncol=5)
R <- matrix(runif(500), ncol=5)
mat <- LowRankMatrix(L, R)
stuff <- scran:::subset_and_divide(mat, 1:20 - 1L, seq_len(ncol(mat)) - 1, scaling=rep(1, ncol(mat)))

You can see that mat is a very non-ordinary matrix representation but it passes through without problems.

mikejiang commented 4 years ago

As you can see below, the matrix passed in conforms to all the conventions of DelayedArray

Browse[2]> class(matrix)
[1] "DelayedMatrix"
attr(,"package")
Browse[2]> dim(matrix)
[1]     12 119531
Browse[2]> matrix[2:3,3:4]
<2 x 2> matrix of class DelayedMatrix and type "double":
           [,1]      [,2]
FSC-H  51594.00 103613.00
FSC-W  81667.89  81210.08
Browse[2]> matrix[0, ,drop = F]
<0 x 119531> matrix of class DelayedMatrix and type "double"
Browse[2]> matrix[,0 ,drop = F]
<12 x 0> matrix of class DelayedMatrix and type "double"

But subset_and_divide just doesn't like it

Browse[2]> .Call("_scran_subset_and_divide", PACKAGE = "scran", matrix, 
+     row_subset, col_subset, scaling)
Error in subset_and_divide(x, subset.row - 1L, curdex - 1L, scaling) : 
  Evaluation error: object of type 'S4' is not subsettable.

Let me know what else I can provide to further diagnose the cause. BTW, the code is fully reproducible (if you'd install the development stack of cytoverse and make sure to get se branch for RGLab/flowWorkspace)

LTLA commented 4 years ago

This section may be instructive:

https://github.com/LTLA/beachmat/blob/8e57d4f5c109fa52a7d13ba4af5a4f09ccddb470/R/setupDelayedMatrix.R#L20-L24

beachmat expects that your DA backend has the format of XXXSeed yielding an XXXMatrix. In fact, the seed class doesn't even have to end with Seed, but if you wrap it in a DA, it should return a class other than DelayedMatrix.

If a wrapped seed yields a DelayedMatrix, beachmat assumes that the seed is a valid matrix in its own right (e.g., when you wrap a matrix in a DelayedArray) and will try to operate on the seed directly. This is done so that upstream pipelines can wrap ordinary and sparse matrices in DelayedArrays without worrying about loss of efficiency from falling back to block processing.

So in your case, just complete the process of creating your DA backend by giving the actual matrix a name that is not DelayedMatrix.

mikejiang commented 4 years ago

Turned out that I am missing these two lines in my customized CytoArray machinery

setClass("CytoMatrix", contains=c("CytoArray", "DelayedMatrix"))
setMethod("matrixClass", "CytoArray", function(x) "CytoMatrix")

which are responsible (along with new_DelayedArray(seed, Class="CytoArray") call) to label the CytoArray class other than default DelayedMatrix, So it now should passe this particular barier

> class(CytoArray(cf))
[1] "CytoMatrix"
attr(,"package")
[1] "flowWorkspace"