Closed Yunuuuu closed 7 months ago
Usually, new file-backed formats are handled by DelayedArray
subclasses in separate packages, e.g., see HDF5Array. This avoids adding dependencies to the core DelayedArray package, which would require installation of those dependencies by all downstream packages (which is a lot). So if you have a new file format you'd like to wrap, you can create your own package that implements the relevant methods (e.g., extract_array
, dim
, is_sparse
).
That said, wrapping BPCell
files in a DelayedArray
may not provide the performance improvements that you'd expect, as any reads via the DelayedArray
wrapper will involve calling R code. IIRC BPCell
's performance is at least partially based on doing most of their operations in C++, so going through an R layer may dilute some of that. This could be overcome by writing bindings to tatami, which is my unofficial "C++ API for DelayedArrays"; then downstream tatami-compliant packages will be able to call the C++ code directly, without going through R. There's not a lot of these packages yet, though.
If you want to give this a go, have a look at the TileDBArray package (and the associated tatami bindings) for a demonstration. Incidentally, TileDB also positions itself as a successor to HDF5, particularly in the single-cell space, so that might be worth checking out if you just want something better than HDF5.
Hi, I'm implementing the BPCells backend, but there are some errors confusing me.
Why DelayedArray
function relies on extract_array
methods which always return a dense matrix (I have tried to return a sparse matrix, but it failed) causing error: cannot allocate vector of size 177.3 Gb?
"pbmc_3k_rna_raw" is the path of BPCellsDir file created BPCells::write_matrix_dir
DelayedArray::DelayedArray(BPCellsDirSeed("pbmc_3k_rna_raw"))
>Error in h(simpleError(msg, call)) :
>error in evaluating the argument 'x' in selecting a method for function 'type': cannot allocate vector of size 177.3 Gb
The codes I used are as follows:
#' @importClassesFrom BPCells MatrixDir
methods::setClass("BPCellsDirSeed", contains = "MatrixDir")
BPCellsDirSeed <- function(x, buffer_size = 8192L) {
if (rlang::is_string(x)) {
matrix_dir <- BPCells::open_matrix_dir(
dir = x, buffer_size = buffer_size
)
} else if (methods::is(x, "MatrixDir")) {
matrix_dir <- x
} else {
cli::cli_abort(
"{.arg x} must be a {.cls string} or {.cls MatrixDir} object"
)
}
methods::as(matrix_dir, "BPCellsDirSeed")
}
#' @importClassesFrom DelayedArray DelayedArray
methods::setClass("BPCellsDirArray",
contains = "DelayedArray",
slots = c(seed = "BPCellsDirSeed"),
prototype = list(seed = methods::new("BPCellsDirSeed"))
)
#' @importMethodsFrom DelayedArray elayedArray
#' @importFrom DelayedArray new_DelayedArray
methods::setMethod(
"DelayedArray", "BPCellsDirSeed",
function(seed) {
new_DelayedArray(seed, Class = "BPCellsDirArray")
}
)
BPCellsDirArray <- function(x, ...) {
DelayedArray(BPCellsDirSeed(x, ...))
}
#' @export
methods::setMethod("type", "BPCellsDirSeed", function(x) {
# it seems R don't differentiate 32-bit and 64-bit real number
switch(x@type,
uint32_t = "integer",
float = "double", # 32-bit real number
double = "double" # 64-bit real number
)
})
#' @export
methods::setMethod(
"extract_array", "BPCellsDirSeed",
function(x, index) {
out <- as.matrix(extract_bpcells_array(x, index))
storage.mode(out) <- type(x)
out
}
)
extract_bpcells_array <- function(x, index) {
i <- index[[1L]]
j <- index[[2L]]
if (!length(i) && !length(j)) {
out <- x
} else if (length(i) && length(j)) {
out <- x[i, j]
} else if (length(i)) {
out <- x[i, ]
} else {
out <- x[, j]
}
methods::as(out, "dgCMatrix")
}
#' @export
#' @importFrom DelayedArray path
methods::setMethod("path", "BPCellsDirSeed", function(object, ...) {
object@dir
})
Hi @Yunuuuu, I'll take a look. Are you working on a BPCellArray package or somthing like that? It's going to be easier if the BPCell backend for DelayedArray objects is implemented in a dedicated package.
Yes, I have created a pacakge named BPCellsArray
deposited in https://github.com/Yunuuuu/BPCellsArray, I finally figured it out, the problem exist in extract_bpcells_array
function.
Now, two basic class MatrixSubset
and ConvertMatrixType
has been implemented, additionally, MatrixDir
object using Directory of files format has also been implemented with DelayedArray. I'll implement Hdf5 file format tomorrow
https://bnprks.github.io/BPCells/articles/web-only/bitpacking-format.html
Thanks for the link. If you are now working on a BPCellsArray package, can we close this issue? Feel free to ask any question in a new issue if you run into other problems related to the implementation of this new backend.
Of course, you can close this thread. It is better to ask any questions on the Bioconductor support site. Apologies for any inconvenience caused.
No worries
BPCell
brings a more effecient on-disk matrix attached with a series operation for single cell. Can this be integrated into Delayed Array ?https://bnprks.github.io/BPCells/articles/web-only/benchmarks.html