Bioconductor / SparseArray

High-performance sparse data representation and manipulation in R
8 stars 2 forks source link

Unexpected errors during subassignment; not mimicking arrays #17

Open taylorpetty opened 4 months ago

taylorpetty commented 4 months ago

I'm unable to assign 1- or 2D-slices to vectors. I can loop through every element of an array with nested for-loops, but it is incredibly slow to do. I get a variety of error messages, demonstrated below. All errors were identical whether using = or <-, on a Windows laptop and a Linux server.

library(SparseArray)
a = array(1:45, dim = c(3,3,5))
s = SparseArray(a)

a[2,,4] = 7:9
s[2,,4] = 7:9 # error
# Error in .subassign_SVT_with_Rarray(x, index, value) : 
#   is.array(Rarray) is not TRUE

a[,,3] = matrix(1:9,nrow=3)
s[,,3] = matrix(1:9,nrow=3) # error; can also add as.array() outside to get same error message
# Error in .subassign_SVT_SparseArray_by_index(x, index, value) : 
#   the selection and supplied value must have the same dimensions

The errors differ whether subsetting by column or by row, and whether it's the entire column or a subset of it:

a = array(1:10, dim = c(5,2))
s = SparseArray(a)
a[1:5,1] = 8:12
s[1:5,1] = 8:12 # *no* error
s[1:4,1] = 8:11 # error
# Error in .Call2("C_subassign_SVT_with_short_Rvector", x@dim, x@type, x@SVT,  : 
#                   SparseArray internal error in init_left_bufs():
#                   invalid short Rvector length

a[1,1:2] = 8:9
s[1,1:2] = 8:9 # error
# Error in .subassign_SVT_with_Rarray(x, index, value) : 
#   is.array(Rarray) is not TRUE
Windows sessionInfo() ``` R version 4.3.1 (2023-06-16 ucrt) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 11 x64 (build 22631) Matrix products: default locale: [1] LC_COLLATE=English_United States.utf8 [2] LC_CTYPE=English_United States.utf8 [3] LC_MONETARY=English_United States.utf8 [4] LC_NUMERIC=C [5] LC_TIME=English_United States.utf8 time zone: America/New_York tzcode source: internal attached base packages: [1] stats4 stats graphics grDevices utils datasets methods [8] base other attached packages: [1] SparseArray_1.2.4 S4Arrays_1.2.1 IRanges_2.36.0 [4] abind_1.4-5 S4Vectors_0.40.2 MatrixGenerics_1.14.0 [7] matrixStats_1.3.0 BiocGenerics_0.48.1 Matrix_1.6-5 loaded via a namespace (and not attached): [1] zlibbioc_1.48.0 lattice_0.22-5 parallel_4.3.1 XVector_0.42.0 [5] this.path_2.3.1 grid_4.3.1 compiler_4.3.1 tools_4.3.1 [9] crayon_1.5.2 ```
Linux sessionInfo() ``` R version 4.3.1 (2023-06-16) Platform: x86_64-redhat-linux-gnu (64-bit) Running under: Rocky Linux 9.2 (Blue Onyx) Matrix products: default BLAS/LAPACK: FlexiBLAS OPENBLAS-OPENMP; LAPACK version 3.9.0 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C time zone: America/New_York tzcode source: system (glibc) attached base packages: [1] stats4 stats graphics grDevices utils datasets methods [8] base other attached packages: [1] SparseArray_1.2.4 S4Arrays_1.2.1 IRanges_2.36.0 [4] abind_1.4-5 S4Vectors_0.40.2 MatrixGenerics_1.14.0 [7] matrixStats_1.2.0 BiocGenerics_0.48.1 Matrix_1.6-4 loaded via a namespace (and not attached): [1] zlibbioc_1.48.0 compiler_4.3.1 tools_4.3.1 XVector_0.42.0 [5] crayon_1.5.2 grid_4.3.1 lattice_0.22-5 ```
hpages commented 4 months ago

Yeah subassignment is still a work-in-progress. In particular using a multidimensional index is not ready in general. It's on the TODO list but it might take a few more weeks before I get to it.

In the meantime, the workaround is to use a linear index i.e. a numeric vector or N-col matrix where N is the number of dimensions. In this case the right value must be a vector. So:

library(SparseArray)
a <- array(1:45, dim = c(3,3,5))
s <- SparseArray(a)

a[2,,4] <- 7:9
s[cbind(2,1:3,4)] <- 7:9
identical(as.array(s), a)  # TRUE

a[,,3] <- matrix(1:9,nrow=3)
s[19:27] <- as.vector(matrix(1:9,nrow=3))
identical(as.array(s), a)  # TRUE

a <- array(1:10, dim = c(5,2))
s <- SparseArray(a)
a[1,1:2] <- 8:9
s[cbind(1,1:2)] <- 8:9
identical(as.array(s), a)  # TRUE

Not always super easy/convenient to translate a multidimensional index [x,y,z] into a linear index though, sorry. However this form of subassignment is well tested and quite efficient.

Hopefully that'll get you going for now.

taylorpetty commented 4 months ago

This is helpful. Thank you for the quick response. If anyone is reading this and wants a quick and dirty workaround solution for now, here's the helper function I came up with:

mygrid <- function(...) {
  as.matrix(data.table::CJ(...))
}

Then you can write things like this:

a = array(0L, dim = 5:3)
s = SparseArray(a)
a[1:2,1:3,2] = 2L
s[mygrid(1:2,1:3,2)] = 2L
identical(a, as.array(s)) # TRUE

x = poissonSparseArray(5:3, lambda=0)
y = x
x[rbind(cbind(1,1:3,1), cbind(2,1:3,1))] <- 10L
y[mygrid(1:2,1:3,1)] <- 10L
identical(x,y) # TRUE
hpages commented 4 months ago

Nice. Good to know about data.table::CJ().

I don't know the details of your use case but note that another way to build a big multidimensional sparse array is to construct slices of lower dimensions and abind() them together. For example, to build a big 3D sparse array, construct the individual 2D slices with something like this:

library(SparseArray)
my_2D_slices <- lapply(1:20, function(i) poissonSparseMatrix(8000, 2500, density=0.15))

or load them from a file, then stack them together with:

my_3D_svt <- do.call(abind, c(my_2D_slices, list(along=3)))
my_3D_svt
# <8000 x 2500 x 20 SparseArray> of type "integer" [nzcount=60008425 (15%)]:
# ,,1
#            [,1]    [,2]    [,3]    [,4] ... [,2497] [,2498] [,2499] [,2500]
#    [1,]       0       0       0       0   .       0       0       0       0
#    [2,]       0       0       0       0   .       0       0       0       0
#     ...       .       .       .       .   .       .       .       .       .
# [7999,]       0       0       0       0   .       0       0       0       1
# [8000,]       0       0       0       0   .       0       0       0       1
# 
# ...
# 
# ,,20
#            [,1]    [,2]    [,3]    [,4] ... [,2497] [,2498] [,2499] [,2500]
#    [1,]       1       0       0       0   .       0       0       0       1
#    [2,]       0       0       0       0   .       0       0       0       0
#     ...       .       .       .       .   .       .       .       .       .
# [7999,]       1       0       0       0   .       0       0       0       0
# [8000,]       0       2       0       0   .       0       0       0       0

Note that stacking SVT_SparseArray objects of identical dimensions like this is a very cheap operation (almost instantaneous!). In particular it will be a lot more efficient than creating a big allzero array and subassigning slices to it.

See ?S4Arrays::abind for more information.

Also be aware that the row/col summarization methods for SVT_SparseArray objects (a.k.a. matrixStats methods) can handle multidimensional objects and support the dims argument:

cv23 <- colVars(my_3D_svt)  # takes < 0.2s
dim(cv23)
# [1] 2500   20

rv12 <- rowVars(my_3D_svt, dims=2)  # takes < 4s
dim(rv12)
# [1] 8000 2500

See ?SparseArray::colSums for more information.

Hope this helps, H.