Open Yunuuuu opened 7 months ago
And by the way, Can ...
be added into seed
generic function like path
function. I want to allow seed
function to return a list of seed
object directly and let user control whether to do unlist
. Something like this:
# BPCellsNaryOpsSeed is a DelayedNaryOp-like object
methods::setMethod("seed", "BPCellsNaryOpsSeed", function(object, unlist = TRUE) {
out <- lapply(object@seeds, seed)
if (isTRUE(unlist)) out <- unlist(out, recursive = FALSE, use.names = FALSE)
out
})
I just took a quick look at the work you are doing in BPCellsArray and I'm surprised by the number of *Seed classes (38!) that you are defining there.
Not sure what you are trying to achieve but if the goal is to implement a MatrixDir backend for DelayedArray objects (that is, to be able to wrap a MatrixDir object inside a DelayedArray shell), then all you need to do is make sure that dim()
, dimnames()
, and extract_array()
work on these objects. In other words, MatrixDir objects need to honor "the seed contract". See here for more information.
Seed contract:
dim()
and dimnames()
already work on these objects so we just need to make sure that extract_array()
also works. Two approaches for this: either we define a new extract_array()
method, or we implement what's missing to make the default extract_array()
method work on a MatrixDir object. Turns out that the only thing that's missing for the latter to work is an as.array()
method. So we add it:
as.array.IterableMatrix <- function(x) suppressMessages(as.matrix(x))
Notes:
as.matrix()
method for IterableMatrix objects defined in the BPCells package.Then this works:
library(BPCells)
mat_raw <- open_matrix_dir("path/to/pbmc_3k_rna_raw") # a MatrixDir object
library(DelayedArray)
extract_array(mat_raw, list(1:5, 1:6))
type(mat_raw)
So now we can wrap mat_raw
in a DelayedArray shell:
DelayedArray(mat_raw)
Note that we can also do this with any IterableMatrix derivative since they all comply with the seed contract.
Additional things:
Operating on DelayedArray(mat_raw)
won't be optimal if we don't define a few additional things. In particular block processing will perform very poorly if it doesn't "know" the ondisk layout of the data. We address this by defining a chunkdim()
method for MatrixDir objects:
setMethod(chunkdim, "MatrixDir",
function(x) if (x@transpose) c(1L, ncol(x)) else c(nrow(x), 1L)
)
We also "tell" block processing that these objects are sparse, and we define an OLD_extract_sparse_array()
method for them:
setMethod(is_sparse, "MatrixDir", function(x) TRUE)
setMethod(OLD_extract_sparse_array, "MatrixDir",
function(x, index) {
slice <- S4Arrays:::subset_by_Nindex(x, index)
as(as(slice, "dgCMatrix"), "SparseArraySeed")
}
)
These additions allow things like colSums(DelayedArray(mat_raw))
or rowVars(DelayedArray(mat_raw))
, which use block processing internally, to perform in "reasonable" time. They'll still be quite slow compared to something like Matrix::colSums(mat_raw)
but this situation will improve slightly once we switch from SparseArraySeed to SVT_SparseArray objects for block processing of sparse DelayedArray objects. Note that this switch should happen soon e.g. in the next 3 or 4 weeks (I was originally planning for it to happen for BioC 3.18 but got distracted by other things).
What else?
There are other things that could be done to improve how objects like DelayedArray(mat_raw)
perform. For example we could reduce memory footprint of block processing by modifying extract_array()
and OLD_extract_sparse_array()
so that they preserve the native type. However none of these things need the introduction of a *Seed class.
Just wanted to clarify before you go too far into this.
Hi, thank you for the thorough guidance, I firstly only implement MatrixDir
class as the seed object.
The main reaseon I implement all class of BPCells is to utilize the methods defined in BPCells
instead of directly using the methods from DelayedArray
, I don't know whether this is a good method for DelayedArray
backend.
I am uncertain of my ability to tell the difference accurately as my proficiency in English is limited and I was learning.
Like cbind
and rbind
, if I only implement MatrixDir
, let's assume the DelayedArray
class for MatrixDir
is BPCellsDirArray
, both function will return another object class ColBindMatrices
or RowBindMatrices
which cannot be put into the slot of BPCellsDirArray
class. The same for [
, dimnames<-
, %*%
and other methods defined in BPCells
. They will all return another class and do delayed operations. Although we can directly utilize the same method defined in DelayedArray
, but I think they will not give the same performance of the method defined in BPCells
. These methods defined in BPCells
return the object with a classe live in the same level of DelayedNaryOp
or DelayedUnaryOp
class (Let's assume the level from lower to higher: basic seed
class, DelayedArray
class, and the higher level DelayedNaryOp
or DelayedUnaryOp
class representing the delayed operations). I don't know how to deal with these classes well as DelayedNaryOp
class utilizes seeds
slot to save the list of processed objects while the same-level class defined in BPCells
uses another slot (some use two slots, some use single slot but with different slot name).
Or I shouldn't create a new seed class and directly use IterableMatrix
object as the seed, then dispatch methods into BPCells method. I have tried following code, but it failed
library(BPCells)
library(DelayedArray)
set.seed(1L)
path <- tempfile("BPCells")
sce <- scuttle::mockSCE(2000L, 3000L)
mat <- SummarizedExperiment::assay(sce, "counts")
# A `IterableMatrix` object
obj <- BPCells::write_matrix_dir(as(mat, "dgCMatrix"), path)
get_class <- function(name) {
asNamespace("BPCells")[[paste0(".__C__", name)]]
}
empty_seed <- function() {
mat <- matrix(integer(), nrow = 0L, ncol = 0L)
mat <- methods::as(mat, "dgCMatrix")
methods::new(
get_class("Iterable_dgCMatrix_wrapper"),
dim = dim(mat),
dimnames = dimnames(mat),
transpose = FALSE,
mat = mat
)
}
setClass("BPCellsMatrix",
contains = "DelayedMatrix",
slots = list(seed = get_class("IterableMatrix")),
prototype = list(seed = empty_seed())
)
setClass("BPCellsArray",
contains = "DelayedArray",
slots = c(seed = get_class("IterableMatrix"))
)
setMethod("matrixClass", "BPCellsArray", function(x) {
"BPCellsMatrix"
})
setMethod(
"DelayedArray", "IterableMatrix",
function(seed) new_DelayedArray(seed, Class = "BPCellsArray")
)
setMethod(
"chunkdim", "IterableMatrix",
function(x) if (x@transpose) c(1L, ncol(x)) else c(nrow(x), 1L)
)
setMethod("is_sparse", "IterableMatrix", function(x) TRUE)
setMethod("type", "IterableMatrix", function(x) {
switch(BPCells:::matrix_type(x),
uint32_t = "integer",
float = "double",
double = "double"
)
})
setMethod(
"extract_array", "IterableMatrix",
function(x, index) {
slice <- S4Arrays:::subset_by_Nindex(x, index)
slice <- as(slice, "dgCMatrix")
slice <- as.matrix(slice) # always return double type
# we should keep the original type
storage.mode(slice) <- type(x)
slice
}
)
setMethod(
"OLD_extract_sparse_array", "IterableMatrix",
function(x, index) {
slice <- S4Arrays:::subset_by_Nindex(x, index)
as(as(slice, "dgCMatrix"), "SparseArraySeed")
}
)
extract_array(obj, list(integer(), integer()))
#> <0 x 0 matrix>
DelayedArray(obj)
#> Error in ext[[class2]]: invalid subscript type 'S4'
Created on 2024-01-28 with reprex v2.0.2
Hi @Yunuuuu,
I'm sorry that I lost track of this. Are you still working on this?
The reason the above doesn't work is because BPCellsMatrix needs to extend both BPCellsArray and DelayedMatrix. See for example how the HDF5Array and HDF5Matrix classes are defined in the HDF5Array package.
But, more importantly, it seems to me that your BPCellsArray objects will always be 2D objects (the seed is always an IterableMatrix object). So why bother with the BPCellsArray class at all? You only need to define the BPCellsMatrix class. See for example how the TENxMatrix class is defined in the HDF5Array package.
Note that seedApply(x, FUN)
applies FUN
to all the terminal seeds in the tree representation of the DelayedArray object, that is, to the leaf nodes of the DelayedOp tree. If you plug an IterableMatrix object into a DelayedArray object, the IterableMatrix object will be considered a terminal seed. This means that seedApply()
won't recurse passed the IterableMatrix object, even if the IterableMatrix object itself contains some kind of tree representation with its own notion of seeds. In other words, FUN
will be called on the IterableMatrix object, and not on the leaves of its internal tree. This is a feature, and I think it's important for the robustness/stability of the DelayedArray framework to keep things that way.
One possible workaround is that you introduce your own generic for recursing on a BPCellsMatrix object. Or, if you already have a function for recursing on an IterableMatrix object, you could extend it to recurse on a BPCellsMatrix object.
Hope this helps, and sorry again for the late response.
H.
It would be awosome if seedApply can recursely run based on the methods, for classes cannot be integrated with DelayedArray class, we cannot make
seedApply
work as usual as in the DelayedArray package.BPCells provides three
DelayedUnaryOp
-like class, and fiveDelayedNaryOp
-like class, but the slot names don't comply with DelayedArray class. The differences are as follows:DelayedUnaryOp
-like class in BPCells, it usematrix
slot as theseed
slot ofDelayedUnaryOp
class inDelayedArray
.DelayedNaryOp
-like class in BPCells, it use both slots (left and right), or usematrix_list
to indicates theseeds
slot ofDelayedNaryOp
-class inDelayedArray
.so I must regard all these classes as a single seed object and wrap them it into
seed
slot ofBPCellsMatrix
class (a DelayedArray object, which also is aDelayedUnaryOp
object).Now I have created another new_seedApply function and substitute the seedApply function in DelayedArray package when loaded. the codes are below:
I want something like this: