Bioconductor / SparseArray

High-performance sparse data representation and manipulation in R
8 stars 2 forks source link

`nzwhich()` has integer overflow for a large `dgCMatrix` #16

Closed LTLA closed 2 months ago

LTLA commented 2 months ago
x <- Matrix::rsparsematrix(1e6, 1e6, 0.000001)
nzwhich(x, arr.ind=TRUE)
## Error in .Call2("C_Lindex2Mindex", Lindex, dim, use.names, PACKAGE = "S4Arrays") :
##   Lindex[2147] is NA
## In addition: Warning message:
## In x_nrow * seq_len(x_ncol) : NAs produced by integer overflow

By comparison, which(x != 0, arr.ind=TRUE) works as expected.

Session information ``` R version 4.4.1 Patched (2024-06-20 r86796) Platform: aarch64-apple-darwin22.6.0 Running under: macOS Ventura 13.6.7 Matrix products: default BLAS: /Users/luna/Software/R/R-4-4-branch/lib/libRblas.dylib LAPACK: /Users/luna/Software/R/R-4-4-branch/lib/libRlapack.dylib; LAPACK version 3.12.0 locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 time zone: America/Los_Angeles tzcode source: internal attached base packages: [1] stats4 stats graphics grDevices utils datasets methods [8] base other attached packages: [1] beachmat_2.21.3 SparseArray_1.5.17 S4Arrays_1.5.3 [4] IRanges_2.39.1 abind_1.4-5 S4Vectors_0.43.1 [7] MatrixGenerics_1.17.0 matrixStats_1.3.0 BiocGenerics_0.51.0 [10] Matrix_1.7-0 loaded via a namespace (and not attached): [1] zlibbioc_1.51.1 lattice_0.22-6 XVector_0.45.0 [4] grid_4.4.1 DelayedArray_0.31.6 compiler_4.4.1 [7] tools_4.4.1 Rcpp_1.0.12 crayon_1.5.3 ```
hpages commented 2 months ago

Thanks. Fixed in SparseArray 1.5.18.

By comparison, which(x != 0, arr.ind=TRUE) works as expected.

Not if x contains NAs. FWIW the default nzwhich() method does something like that:

> SparseArray:::default_nzwhich
function (x, arr.ind = FALSE) 
{
    if (!isTRUEorFALSE(arr.ind)) 
        stop(wmsg("'arr.ind' must be TRUE or FALSE"))
    zero <- vector(type(x), length = 1L)
    is_nonzero <- x != zero
    which(is_nonzero | is.na(is_nonzero), arr.ind = arr.ind, 
        useNames = FALSE)
}
<bytecode: 0x60cef6bfdaa0>
<environment: namespace:SparseArray>

However it's very inefficient on a big dg[C|R]Matrix object hence the dedicated methods for CsparseMatrix and RsparseMatrix objects.