Bioconductor / SparseArray

High-performance sparse data representation and manipulation in R
8 stars 2 forks source link

`SparseArray::nzvals` orders the values but `nzcoo` does not #14

Closed LTLA closed 2 months ago

LTLA commented 2 months ago
library(SparseArray)
set.seed(1000)
basic <- matrix(rpois(1000, 0.1), ncol=10)

# Transposing so that the coordinates are not ordered by (column, row) anymore.
y <- t(as(basic, "COO_SparseArray"))

z <- matrix(0L, nrow(y), ncol(y))
z[nzcoo(y)] <- nzvals(y)

# Whoops!
all.equal(z, t(basic))
## [1] "Mean relative difference: 0.6666667"

Looks like nzvals does some work in .normalize_COO_SparseArray that may not match up to nzcoo()'s output.

Session information ``` R version 4.4.0 Patched (2024-05-20 r86569) Platform: x86_64-pc-linux-gnu Running under: Ubuntu 22.04.4 LTS Matrix products: default BLAS: /home/luna/Software/R/R-4-4-branch/lib/libRblas.so LAPACK: /home/luna/Software/R/R-4-4-branch/lib/libRlapack.so; LAPACK version 3.12.0 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C time zone: America/Los_Angeles tzcode source: system (glibc) attached base packages: [1] stats4 stats graphics grDevices utils datasets methods [8] base other attached packages: [1] SparseArray_1.5.16 S4Arrays_1.5.3 IRanges_2.39.1 [4] abind_1.4-5 S4Vectors_0.43.1 MatrixGenerics_1.17.0 [7] matrixStats_1.3.0 BiocGenerics_0.51.0 Matrix_1.7-0 loaded via a namespace (and not attached): [1] zlibbioc_1.51.1 compiler_4.4.0 tools_4.4.0 XVector_0.45.0 [5] crayon_1.5.3 grid_4.4.0 lattice_0.22-6 ```
hpages commented 2 months ago

Right, you need to use nzwhich() instead of nzcoo():

z[nzwhich(y)] <- nzvals(y)
identical(z, t(basic))
# [1] TRUE

nzcoo() and nzdata() are slot accessors for COO_SparseArray objects.

nzwhich() and nzvals() are generic functions whose behavior is independent of the internal representation of the sparse array e.g. they work on SVT_SparseArray, dgCMatrix, lgCMatrix objects etc...

Only when COO_SparseArray object x is normalized (i.e. nzcoo slot is strictly ordered and nzdata contains no zeros) will nzwhich(x, arr.ind=TRUE) and nzvals(x) return the same things as nzcoo(x) and nzdata(x).

H.

hpages commented 2 months ago

or use nzdata() instead of nzvals():

z[nzcoo(y)] <- nzdata(y)
identical(z, t(basic))
# [1] TRUE
LTLA commented 2 months ago

Ah, okay. I'll switch whichNonZero() to just use nzwhich() and nzvals() under the hood, then.

That said, some of my tests still fail:

library(SparseArray)
library(DelayedArray)
stuff <- Matrix::rsparsematrix(1000, 1000, density=0.01)
wrapped <- DelayedArray(stuff)
nzwhich(wrapped)
## Error in extract_sparse_array(x@seed, index) : NOT IMPLEMENTED YET!
## In addition: Warning message:
## In which(is_nonzero | is.na(is_nonzero), arr.ind = arr.ind, useNames = FALSE) :
##  'useNames' is ignored when 'x' is a DelayedArray object or derivative

I would have expected nzwhich() to work on any sparse array-ish object. Interestingly, it behaves as expected for a dense DelayedArray, albeit with a noisy warning:

dwrapped <- DelayedArray(as.matrix(stuff))
str(nzwhich(dwrapped))
##  int [1:10000] 63 133 338 452 563 579 731 844 912 1398 ...
## Warning message:
## In which(is_nonzero | is.na(is_nonzero), arr.ind = arr.ind, useNames = FALSE) :
##   'useNames' is ignored when 'x' is a DelayedArray object or derivative
Session information ``` R version 4.4.1 Patched (2024-06-20 r86796) Platform: aarch64-apple-darwin22.6.0 Running under: macOS Ventura 13.6.7 Matrix products: default BLAS: /Users/luna/Software/R/R-4-4-branch/lib/libRblas.dylib LAPACK: /Users/luna/Software/R/R-4-4-branch/lib/libRlapack.dylib; LAPACK version 3.12.0 locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 time zone: America/Los_Angeles tzcode source: internal attached base packages: [1] stats4 stats graphics grDevices utils datasets methods [8] base other attached packages: [1] DelayedArray_0.31.6 SparseArray_1.5.16 S4Arrays_1.5.3 [4] IRanges_2.39.1 abind_1.4-5 S4Vectors_0.43.1 [7] MatrixGenerics_1.17.0 matrixStats_1.3.0 BiocGenerics_0.51.0 [10] Matrix_1.7-0 loaded via a namespace (and not attached): [1] zlibbioc_1.51.1 lattice_0.22-6 XVector_0.45.0 grid_4.4.1 [5] compiler_4.4.1 tools_4.4.1 crayon_1.5.3 ```
hpages commented 2 months ago

mmh.. so it seems that 2 components are missing to make nzwhich() work on a sparse DelayedArray object: (1) an extract_sparse_array() method for DelayedNaryIsoOp objects, and (2) a which() method for SVT_SparseArray objects. Working on it now.

hpages commented 2 months ago

Finally I went for a dedicated nzwhich() method for DelayedArray objects. Should be slightly more efficient than relying on the default nzwhich() method:

library(DelayedArray)
set.seed(2009)
stuff <- Matrix::rsparsematrix(1000, 1000, density=0.01)
wrapped <- DelayedArray(stuff)
str(nzwhich(wrapped))
# int [1:10000] 76 124 157 250 554 812 985 1123 1298 1320 ...
identical(nzwhich(wrapped), nzwhich(stuff))
# [1] TRUE

This is in DelayedArray 0.31.7 (see https://github.com/Bioconductor/DelayedArray/commit/cf7427b72cd6df7715b0d962927869a6da2fee15).

The 2 methods I mentioned above (extract_sparse_array(<DelayedNaryIsoOp>) and which(<SVT_SparseArray>)) are still missing but that will have to wait for now.