Bioconductor / DelayedArray

A unified framework for working transparently with on-disk and in-memory array-like datasets
https://bioconductor.org/packages/DelayedArray
24 stars 9 forks source link

Feature request: add `BPCell` backend #110

Closed Yunuuuu closed 7 months ago

Yunuuuu commented 8 months ago

BPCell brings a more effecient on-disk matrix attached with a series operation for single cell. Can this be integrated into Delayed Array ?

https://bnprks.github.io/BPCells/articles/web-only/benchmarks.html

LTLA commented 8 months ago

Usually, new file-backed formats are handled by DelayedArray subclasses in separate packages, e.g., see HDF5Array. This avoids adding dependencies to the core DelayedArray package, which would require installation of those dependencies by all downstream packages (which is a lot). So if you have a new file format you'd like to wrap, you can create your own package that implements the relevant methods (e.g., extract_array, dim, is_sparse).

That said, wrapping BPCell files in a DelayedArray may not provide the performance improvements that you'd expect, as any reads via the DelayedArray wrapper will involve calling R code. IIRC BPCell's performance is at least partially based on doing most of their operations in C++, so going through an R layer may dilute some of that. This could be overcome by writing bindings to tatami, which is my unofficial "C++ API for DelayedArrays"; then downstream tatami-compliant packages will be able to call the C++ code directly, without going through R. There's not a lot of these packages yet, though.

If you want to give this a go, have a look at the TileDBArray package (and the associated tatami bindings) for a demonstration. Incidentally, TileDB also positions itself as a successor to HDF5, particularly in the single-cell space, so that might be worth checking out if you just want something better than HDF5.

Yunuuuu commented 7 months ago

Hi, I'm implementing the BPCells backend, but there are some errors confusing me. Why DelayedArray function relies on extract_array methods which always return a dense matrix (I have tried to return a sparse matrix, but it failed) causing error: cannot allocate vector of size 177.3 Gb?

"pbmc_3k_rna_raw" is the path of BPCellsDir file created BPCells::write_matrix_dir

DelayedArray::DelayedArray(BPCellsDirSeed("pbmc_3k_rna_raw"))
>Error in h(simpleError(msg, call)) : 
>error in evaluating the argument 'x' in selecting a method for function 'type': cannot allocate vector of size 177.3 Gb

The codes I used are as follows:

#' @importClassesFrom BPCells MatrixDir
methods::setClass("BPCellsDirSeed", contains = "MatrixDir")

BPCellsDirSeed <- function(x, buffer_size = 8192L) {
    if (rlang::is_string(x)) {
        matrix_dir <- BPCells::open_matrix_dir(
            dir = x, buffer_size = buffer_size
        )
    } else if (methods::is(x, "MatrixDir")) {
        matrix_dir <- x
    } else {
        cli::cli_abort(
            "{.arg x} must be a {.cls string} or {.cls MatrixDir} object"
        )
    }
    methods::as(matrix_dir, "BPCellsDirSeed")
}

#' @importClassesFrom DelayedArray DelayedArray
methods::setClass("BPCellsDirArray",
    contains = "DelayedArray",
    slots = c(seed = "BPCellsDirSeed"),
    prototype = list(seed = methods::new("BPCellsDirSeed"))
)

#' @importMethodsFrom DelayedArray elayedArray
#' @importFrom DelayedArray new_DelayedArray
methods::setMethod(
    "DelayedArray", "BPCellsDirSeed",
    function(seed) {
        new_DelayedArray(seed, Class = "BPCellsDirArray")
    }
)

BPCellsDirArray <- function(x, ...) {
    DelayedArray(BPCellsDirSeed(x, ...))
}

#' @export
methods::setMethod("type", "BPCellsDirSeed", function(x) {
    # it seems R don't differentiate 32-bit and 64-bit real number
    switch(x@type,
        uint32_t = "integer",
        float = "double", # 32-bit real number
        double = "double" # 64-bit real number
    )
})

#' @export
methods::setMethod(
    "extract_array", "BPCellsDirSeed",
    function(x, index) {
        out <- as.matrix(extract_bpcells_array(x, index))
        storage.mode(out) <- type(x)
        out
    }
)

extract_bpcells_array <- function(x, index) {
    i <- index[[1L]]
    j <- index[[2L]]
    if (!length(i) && !length(j)) {
        out <- x
    } else if (length(i) && length(j)) {
        out <- x[i, j]
    } else if (length(i)) {
        out <- x[i, ]
    } else {
        out <- x[, j]
    }
    methods::as(out, "dgCMatrix")
}

#' @export
#' @importFrom DelayedArray path
methods::setMethod("path", "BPCellsDirSeed", function(object, ...) {
    object@dir
})
hpages commented 7 months ago

Hi @Yunuuuu, I'll take a look. Are you working on a BPCellArray package or somthing like that? It's going to be easier if the BPCell backend for DelayedArray objects is implemented in a dedicated package.

Yunuuuu commented 7 months ago

Yes, I have created a pacakge named BPCellsArray deposited in https://github.com/Yunuuuu/BPCellsArray, I finally figured it out, the problem exist in extract_bpcells_array function.

Yunuuuu commented 7 months ago

Now, two basic class MatrixSubset and ConvertMatrixType has been implemented, additionally, MatrixDir object using Directory of files format has also been implemented with DelayedArray. I'll implement Hdf5 file format tomorrow

https://bnprks.github.io/BPCells/articles/web-only/bitpacking-format.html

hpages commented 7 months ago

Thanks for the link. If you are now working on a BPCellsArray package, can we close this issue? Feel free to ask any question in a new issue if you run into other problems related to the implementation of this new backend.

Yunuuuu commented 7 months ago

Of course, you can close this thread. It is better to ask any questions on the Bioconductor support site. Apologies for any inconvenience caused.

hpages commented 7 months ago

No worries