Bioconductor / DelayedArray

A unified framework for working transparently with on-disk and in-memory array-like datasets
https://bioconductor.org/packages/DelayedArray
24 stars 9 forks source link

Thoughts on migrating from `read_sparse_block()` to SVTs #105

Closed LTLA closed 1 month ago

LTLA commented 1 year ago

Currently, read_sparse_block() returns the old SparseArraySeed, so I figured I would try updating some of my packages to the new SVT world. It seems SparseArray::extract_sparse_array() or SparseArray::read_block_as_sparse() are the replacements, but I ran into a few hiccups as I tried to switch alabaster.matrix to use them.

1. read_sparse_block() works with not just DelayedArrays, but also the underlying delayed "layers". This is no longer the case with either extract_sparse_array() or read_block_as_sparse().

library(DelayedArray)
X <- DelayedArray(Matrix::rsparsematrix(100, 100, 0.1))
Y <- X * 1

g <- colAutoGrid(Y)
ok <- read_sparse_block(Y@seed, g[[1L]]) # ok
str(ok)
## Formal class 'SparseArraySeed' [package "DelayedArray"] with 4 slots
##   ..@ dim     : int [1:2] 100 100
##   ..@ nzindex : int [1:1000, 1:2] 5 35 53 77 78 86 6 10 35 51 ...
##   ..@ nzdata  : num [1:1000] 1.3 -1.1 -0.47 0.64 -0.84 1.1 -0.45 0.48 -0.86 -0.35 ...
##   ..@ dimnames:List of 2
##   .. ..$ : NULL
##   .. ..$ : NULL

read_block_as_sparse(Y@seed, g[[1L]])
## Error in .Primitive("[")(new("DelayedUnaryIsoOpStack", OPS = list(function (a)  :
##   object of type 'S4' is not subsettable

extract_sparse_array(Y@seed, makeNindexFromArrayViewport(g[[1L]]))
## Error in .Primitive("[")(new("DelayedUnaryIsoOpStack", OPS = list(function (a)  :
##   object of type 'S4' is not subsettable

This is an issue as I sometimes have to "unwrap" delayed objects if I know a more efficient operation can be applied in the presence of certain delayed operations, avoiding the need to use general-purpose block processing.

The current workaround is to just re-wrap everything in a DelayedArray before calling extract_sparse_array(). But it would be nice to have everything just work off the bat, which would save me a little bit of boilerplate.

2. The new functions discard structural elements with a value of zero.

x <- Matrix::rsparsematrix(100, 100, 0.1) * 0
str(x@x)
##  num [1:1000] 0 0 0 0 0 0 0 0 0 0 ...

g <- colAutoGrid(x)
ok <- read_sparse_block(x, g[[1L]]) # ok
str(nzdata(ok))
##  num [1:1000] 0 0 0 0 0 0 0 0 0 0 ...

out <- read_block_as_sparse(x, g[[1L]])
nzcount(out)
## [1] 0

out <- extract_sparse_array(x, makeNindexFromArrayViewport(g[[1L]]))
nzcount(out)
## [1] 0

This causes some back-compatibility problems with old code that previously expected zeros to be counted, e.g., I have a piece of code that uses blockApply() (which calls read_sparse_block() IIUC) and thus counts the zeros, causing mismatching numbers of zeros when interacting with new code that uses extract_sparse_array() and omits those zeros.

More conceptually; it is not impossible to imagine an application that encodes some information in the explicit presence of zeros, e.g., a sparse matrix of log-fold changes for SNPs where an explicit structural zero value indicates no difference and an absent entry indicates that the SNP was not measured in a particular sample. So perhaps it would be better to hold onto to zeros.

FWIW the various sparse Matrix classes preserve zeros, so being consistent with them would be worthwhile if nothing else.

Session information ``` R version 4.3.0 Patched (2023-05-04 r84398) Platform: aarch64-apple-darwin22.3.0 (64-bit) Running under: macOS Ventura 13.2.1 Matrix products: default BLAS: /Users/luna/Software/R/R-4-3-branch/lib/libRblas.dylib LAPACK: /Users/luna/Software/R/R-4-3-branch/lib/libRlapack.dylib; LAPACK version 3.11.0 locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 time zone: America/Los_Angeles tzcode source: internal attached base packages: [1] stats4 stats graphics grDevices utils datasets methods [8] base other attached packages: [1] HDF5Array_1.29.3 rhdf5_2.45.0 DelayedArray_0.27.7 [4] SparseArray_1.1.10 S4Arrays_1.1.4 IRanges_2.35.2 [7] S4Vectors_0.39.1 MatrixGenerics_1.13.0 matrixStats_1.0.0 [10] BiocGenerics_0.47.0 Matrix_1.5-4.1 alabaster.matrix_1.1.3 [13] testthat_3.1.9 loaded via a namespace (and not attached): [1] jsonlite_1.8.7 compiler_4.3.0 crayon_1.5.2 [4] brio_1.1.3 Rcpp_1.0.10 rhdf5filters_1.13.3 [7] lattice_0.21-8 R6_2.5.1 XVector_0.41.1 [10] alabaster.base_1.1.0 desc_1.4.2 rprojroot_2.0.3 [13] pillar_1.9.0 rlang_1.1.1 utf8_1.2.3 [16] pkgload_1.3.2 cli_3.6.1 withr_2.5.0 [19] magrittr_2.0.3 Rhdf5lib_1.23.0 zlibbioc_1.47.0 [22] grid_4.3.0 alabaster.schemas_1.1.2 lifecycle_1.0.3 [25] vctrs_0.6.3 waldo_0.5.1 glue_1.6.2 [28] fansi_1.0.4 tools_4.3.0 ```
hpages commented 1 year ago

None of this is really expected to work properly at the moment. There's a long road before we can get there (see https://github.com/Bioconductor/DelayedArray/blob/devel/TODO). I'll coordinate with you when it's time to make the switch from SparseArraySeed to SVT_SparseArray. In the meantime, please ignore extract_sparse_array() and read_block_as_sparse().

LTLA commented 1 year ago

Oh. I started using it in BioC-devel a few weeks ago. Seemed okay.

More specifically, beachmat now uses extract_sparse_array() to extract SVT_SparseMatrix objects for abstract R matrices that the new tatami interface doesn't otherwise know how to handle. SingleR - the only package currently updated to use the new interface - seems pretty happy with it, so I assumed we were ready to go.

Guess I'll just leave it in there for now. Looks mature enough that it'll be in good shape for the next BioC release.

hpages commented 1 month ago

The migration from SparseArraySeed to SVT_SparseArray/COO_SparseArray objects happened in S4Arrays 1.5.3 and DelayedArray 0.31.5.