Bioconductor / DelayedArray

A unified framework for working transparently with on-disk and in-memory array-like datasets
https://bioconductor.org/packages/DelayedArray
24 stars 9 forks source link

Feature request: make `seedApply` as a generic function #111

Open Yunuuuu opened 7 months ago

Yunuuuu commented 7 months ago

It would be awosome if seedApply can recursely run based on the methods, for classes cannot be integrated with DelayedArray class, we cannot make seedApply work as usual as in the DelayedArray package.

BPCells provides three DelayedUnaryOp-like class, and five DelayedNaryOp-like class, but the slot names don't comply with DelayedArray class. The differences are as follows:

  1. For DelayedUnaryOp-like class in BPCells, it use matrix slot as the seed slot of DelayedUnaryOp class in DelayedArray.
  2. For DelayedNaryOp-like class in BPCells, it use both slots (left and right), or use matrix_list to indicates the seeds slot of DelayedNaryOp-class in DelayedArray.

so I must regard all these classes as a single seed object and wrap them it into seed slot of BPCellsMatrix class (a DelayedArray object, which also is a DelayedUnaryOp object).

Now I have created another new_seedApply function and substitute the seedApply function in DelayedArray package when loaded. the codes are below:

new_seedApply <- function(x, .fn, ...) {
    if (methods::is(x, "BPCellsMatrix")) {
        x <- entity(x)
    }
    if (methods::is(x, "BPCellsUnaryOpsSeed")) {
        return(Recall(entity(x), .fn, ...))
    }
    if (methods::is(x, "BPCellsNaryOpsSeed")) {
        x <- entity(x)
    }
    if (is.list(x)) {
        ans <- lapply(x, FUN = new_seedApply, .fn = .fn, ...)
        return(unlist(ans, recursive = FALSE, use.names = FALSE))
    } else {
        list(.fn(x, ...))
    }
}
rebind <- function(sym, value, ns) {
    if (rlang::is_string(ns)) {
        Recall(sym, value, getNamespace(ns))
        pkg <- paste0("package:", ns)
        if (pkg %in% search()) {
            Recall(sym, value, as.environment(pkg))
        }
    } else if (is.environment(ns)) {
        if (bindingIsLocked(sym, ns)) {
            unlockBinding(sym, ns)
            on.exit(lockBinding(sym, ns))
        }
        assign(sym, value, ns)
    } else {
        stop("ns must be a string or environment")
    }
}
.onLoad <- function(libname, pkgname) {
    old_seedApply <- DelayedArray::seedApply
    rebind("seedApply", function(x, FUN, ...) {
        assert_(FUN, is.function, "a function")
        if (methods::is(x, c("BPCellsMatrix", "BPCellsSeed"))) {
            new_seedApply(x, FUN, ...)
        } else {
            old_seedApply(x, FUN, ...)
        }
    }, "DelayedArray")
}

I want something like this:

methods::setGeneric("seedApply", function(x, FUN, ...) {
    standardGeneric("seedApply")
})
methods::setMethod("seedApply", "DelayedUnaryOp", function(x, FUN, ...) {
    x <- x@seed
    methods::callGeneric()
})
methods::setMethod("seedApply", "DelayedNaryOp", function(x, FUN, ...) {
    x <- x@seeds
    methods::callGeneric()
})
methods::setMethod("seedApply", "list", function(x, FUN, ...) {
    ans <- lapply(x, seedApply, FUN, ...)
    unlist(ans, recursive = FALSE, use.names = FALSE)
})
methods::setMethod("seedApply", "ANY", function(x, FUN, ...) {
    list(FUN(x, ...))
})
Yunuuuu commented 7 months ago

And by the way, Can ... be added into seed generic function like path function. I want to allow seed function to return a list of seed object directly and let user control whether to do unlist. Something like this:

# BPCellsNaryOpsSeed is a DelayedNaryOp-like object
methods::setMethod("seed", "BPCellsNaryOpsSeed", function(object, unlist = TRUE) {
    out <- lapply(object@seeds, seed)
    if (isTRUE(unlist)) out <- unlist(out, recursive = FALSE, use.names = FALSE)
    out
})
hpages commented 7 months ago

I just took a quick look at the work you are doing in BPCellsArray and I'm surprised by the number of *Seed classes (38!) that you are defining there.

Not sure what you are trying to achieve but if the goal is to implement a MatrixDir backend for DelayedArray objects (that is, to be able to wrap a MatrixDir object inside a DelayedArray shell), then all you need to do is make sure that dim(), dimnames(), and extract_array() work on these objects. In other words, MatrixDir objects need to honor "the seed contract". See here for more information.

Seed contract:

dim() and dimnames() already work on these objects so we just need to make sure that extract_array() also works. Two approaches for this: either we define a new extract_array() method, or we implement what's missing to make the default extract_array() method work on a MatrixDir object. Turns out that the only thing that's missing for the latter to work is an as.array() method. So we add it:

as.array.IterableMatrix <- function(x) suppressMessages(as.matrix(x))

Notes:

  1. This is a simple wrapper to the as.matrix() method for IterableMatrix objects defined in the BPCells package.
  2. We define an S3 method instead of an S4 method to avoid import/dispatch issues.
  3. We define the method for IterableMatrix objects even though all we really care about is that it works for MatrixDir objects.

Then this works:

library(BPCells)
mat_raw <- open_matrix_dir("path/to/pbmc_3k_rna_raw")  # a MatrixDir object

library(DelayedArray)
extract_array(mat_raw, list(1:5, 1:6))
type(mat_raw)

So now we can wrap mat_raw in a DelayedArray shell:

DelayedArray(mat_raw)

Note that we can also do this with any IterableMatrix derivative since they all comply with the seed contract.

Additional things:

Operating on DelayedArray(mat_raw) won't be optimal if we don't define a few additional things. In particular block processing will perform very poorly if it doesn't "know" the ondisk layout of the data. We address this by defining a chunkdim() method for MatrixDir objects:

setMethod(chunkdim, "MatrixDir",
    function(x) if (x@transpose) c(1L, ncol(x)) else c(nrow(x), 1L)
)

We also "tell" block processing that these objects are sparse, and we define an OLD_extract_sparse_array() method for them:

setMethod(is_sparse, "MatrixDir", function(x) TRUE)

setMethod(OLD_extract_sparse_array, "MatrixDir",
    function(x, index) {
        slice <- S4Arrays:::subset_by_Nindex(x, index)
        as(as(slice, "dgCMatrix"), "SparseArraySeed")
    }
)

These additions allow things like colSums(DelayedArray(mat_raw)) or rowVars(DelayedArray(mat_raw)), which use block processing internally, to perform in "reasonable" time. They'll still be quite slow compared to something like Matrix::colSums(mat_raw) but this situation will improve slightly once we switch from SparseArraySeed to SVT_SparseArray objects for block processing of sparse DelayedArray objects. Note that this switch should happen soon e.g. in the next 3 or 4 weeks (I was originally planning for it to happen for BioC 3.18 but got distracted by other things).

What else?

There are other things that could be done to improve how objects like DelayedArray(mat_raw) perform. For example we could reduce memory footprint of block processing by modifying extract_array() and OLD_extract_sparse_array() so that they preserve the native type. However none of these things need the introduction of a *Seed class.

Just wanted to clarify before you go too far into this.

Yunuuuu commented 7 months ago

Hi, thank you for the thorough guidance, I firstly only implement MatrixDir class as the seed object.

The main reaseon I implement all class of BPCells is to utilize the methods defined in BPCells instead of directly using the methods from DelayedArray, I don't know whether this is a good method for DelayedArray backend.

I am uncertain of my ability to tell the difference accurately as my proficiency in English is limited and I was learning.

Like cbind and rbind, if I only implement MatrixDir, let's assume the DelayedArray class for MatrixDir is BPCellsDirArray, both function will return another object class ColBindMatrices or RowBindMatrices which cannot be put into the slot of BPCellsDirArray class. The same for [, dimnames<-, %*% and other methods defined in BPCells. They will all return another class and do delayed operations. Although we can directly utilize the same method defined in DelayedArray, but I think they will not give the same performance of the method defined in BPCells. These methods defined in BPCells return the object with a classe live in the same level of DelayedNaryOp or DelayedUnaryOp class (Let's assume the level from lower to higher: basic seed class, DelayedArray class, and the higher level DelayedNaryOp or DelayedUnaryOp class representing the delayed operations). I don't know how to deal with these classes well as DelayedNaryOp class utilizes seeds slot to save the list of processed objects while the same-level class defined in BPCells uses another slot (some use two slots, some use single slot but with different slot name).

Yunuuuu commented 7 months ago

Or I shouldn't create a new seed class and directly use IterableMatrix object as the seed, then dispatch methods into BPCells method. I have tried following code, but it failed

library(BPCells)
library(DelayedArray)
set.seed(1L)
path <- tempfile("BPCells")
sce <- scuttle::mockSCE(2000L, 3000L)
mat <- SummarizedExperiment::assay(sce, "counts")
# A `IterableMatrix` object
obj <- BPCells::write_matrix_dir(as(mat, "dgCMatrix"), path)
get_class <- function(name) {
    asNamespace("BPCells")[[paste0(".__C__", name)]]
}
empty_seed <- function() {
    mat <- matrix(integer(), nrow = 0L, ncol = 0L)
    mat <- methods::as(mat, "dgCMatrix")
    methods::new(
        get_class("Iterable_dgCMatrix_wrapper"),
        dim = dim(mat),
        dimnames = dimnames(mat),
        transpose = FALSE,
        mat = mat
    )
}
setClass("BPCellsMatrix",
    contains = "DelayedMatrix",
    slots = list(seed = get_class("IterableMatrix")),
    prototype = list(seed = empty_seed())
)
setClass("BPCellsArray",
    contains = "DelayedArray",
    slots = c(seed = get_class("IterableMatrix"))
)
setMethod("matrixClass", "BPCellsArray", function(x) {
    "BPCellsMatrix"
})
setMethod(
    "DelayedArray", "IterableMatrix",
    function(seed) new_DelayedArray(seed, Class = "BPCellsArray")
)
setMethod(
    "chunkdim", "IterableMatrix",
    function(x) if (x@transpose) c(1L, ncol(x)) else c(nrow(x), 1L)
)
setMethod("is_sparse", "IterableMatrix", function(x) TRUE)
setMethod("type", "IterableMatrix", function(x) {
    switch(BPCells:::matrix_type(x),
        uint32_t = "integer",
        float = "double",
        double = "double"
    )
})
setMethod(
    "extract_array", "IterableMatrix",
    function(x, index) {
        slice <- S4Arrays:::subset_by_Nindex(x, index)
        slice <- as(slice, "dgCMatrix")
        slice <- as.matrix(slice) # always return double type
        # we should keep the original type
        storage.mode(slice) <- type(x)
        slice
    }
)
setMethod(
    "OLD_extract_sparse_array", "IterableMatrix",
    function(x, index) {
        slice <- S4Arrays:::subset_by_Nindex(x, index)
        as(as(slice, "dgCMatrix"), "SparseArraySeed")
    }
)
extract_array(obj, list(integer(), integer()))
#> <0 x 0 matrix>
DelayedArray(obj)
#> Error in ext[[class2]]: invalid subscript type 'S4'

Created on 2024-01-28 with reprex v2.0.2

hpages commented 1 month ago

Hi @Yunuuuu,

I'm sorry that I lost track of this. Are you still working on this?

The reason the above doesn't work is because BPCellsMatrix needs to extend both BPCellsArray and DelayedMatrix. See for example how the HDF5Array and HDF5Matrix classes are defined in the HDF5Array package.

But, more importantly, it seems to me that your BPCellsArray objects will always be 2D objects (the seed is always an IterableMatrix object). So why bother with the BPCellsArray class at all? You only need to define the BPCellsMatrix class. See for example how the TENxMatrix class is defined in the HDF5Array package.

Note that seedApply(x, FUN) applies FUN to all the terminal seeds in the tree representation of the DelayedArray object, that is, to the leaf nodes of the DelayedOp tree. If you plug an IterableMatrix object into a DelayedArray object, the IterableMatrix object will be considered a terminal seed. This means that seedApply() won't recurse passed the IterableMatrix object, even if the IterableMatrix object itself contains some kind of tree representation with its own notion of seeds. In other words, FUN will be called on the IterableMatrix object, and not on the leaves of its internal tree. This is a feature, and I think it's important for the robustness/stability of the DelayedArray framework to keep things that way.

One possible workaround is that you introduce your own generic for recursing on a BPCellsMatrix object. Or, if you already have a function for recursing on an IterableMatrix object, you could extend it to recurse on a BPCellsMatrix object.

Hope this helps, and sorry again for the late response.

H.