Bioconductor / DelayedArray

A unified framework for working transparently with on-disk and in-memory array-like datasets
https://bioconductor.org/packages/DelayedArray
24 stars 9 forks source link

Coercion from SparseArraySeed to sparse matrix fails with duplicates #88

Closed LTLA closed 1 month ago

LTLA commented 3 years ago

Using an example in ?SparseArraySeed to illustrate:

library(DelayedArray)
set.seed(10)

## A big very sparse DelayedMatrix object:
nzindex4 <- cbind(sample(25000, 600000, replace=TRUE),
                   sample(195000, 600000, replace=TRUE))
nzdata4 <- runif(600000)
sas4 <- SparseArraySeed(c(25000, 195000), nzindex4, nzdata4)
X <- as(sas4, "sparseMatrix")
## Error in validObject(.Object) : 
##   invalid class “dgCMatrix” object: slot i is not *strictly* increasing inside a column

The problem seems to be caused by duplicate non-zero coordinates (i.e., same i and j) in the SAS itself. One could argue that they should not be allowed to exist in the SAS at all - what does it even mean to have a duplicate entry?

The constructor should probably warn/error/remove duplicates, though this may be an expensive operation. Perhaps add a check=TRUE option that can be set to FALSE for all internal uses where duplicates cannot exist.

Session information ``` R Under development (unstable) (2021-01-24 r79876) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 18.04.5 LTS Matrix products: default BLAS: /home/luna/Software/R/trunk/lib/libRblas.so LAPACK: /home/luna/Software/R/trunk/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] DelayedArray_0.17.7 IRanges_2.25.6 S4Vectors_0.29.7 [4] MatrixGenerics_1.3.1 matrixStats_0.58.0 BiocGenerics_0.37.1 [7] Matrix_1.3-2 loaded via a namespace (and not attached): [1] compiler_4.1.0 grid_4.1.0 lattice_0.20-41 ```
hpages commented 3 years ago

Only reason I don't check for duplicate entries in the nzindex slot at the moment is because it's kind of expensive. Also these duplicates should not occur in the context of "normal operations", only when you manually craft your own SparseArraySeed object.

The meaning of duplicate entries, if we were to support them, would be that the corresponding values in the nzdata slot add up.

There's also the question of what to do with zeroes in the nzdata slot. Whether we allow them or not is tightly related with the question of duplicate entries in the nzindex slot. If we allow these things, arithmetic operations on SparseArraySeed objects (not supported yet, but on the TODO list) can be made as fast as they can be because we don't need to clean the combined nzindex and nzdata slots after each operation.

I've actually postponed the decision of what to do with duplicate entries in the nzindex slot and zeroes in the nzdata slot until I start to work on these arithmetic operations (and other operations) on SparseArraySeed objects.

hpages commented 1 month ago

SparseArraySeed objects are now deprecated in favor of COO_SparseArray objects (in DelayedArray >= 0.31.5), but it seems that COO_SparseArray objects had the same issue. The issue is now fixed in SparseArray 1.5.27:

library(SparseArray)
set.seed(10)

## A big very sparse DelayedMatrix object:
nzcoo4 <- cbind(sample(25000, 600000, replace=TRUE),
                sample(195000, 600000, replace=TRUE))
nzdata4 <- runif(600000)
coo4 <- COO_SparseArray(c(25000, 195000), nzcoo4, nzdata4)
X <- as(coo4, "sparseMatrix")

validObject(X)  # TRUE

See https://github.com/Bioconductor/SparseArray/commit/5f06e0caa2a717f5dcd3f9b6490ef64ddf343819

Note that duplicate non-zero coordinates in the nzcoo slot of a COO_SparseArray object are still allowed, at least for now:

nzcoo <- rbind(c(1,1), c(2,2), c(1,1), c(1,1))
coo <- COO_SparseArray(5:4, nzcoo, 11:14)
coo@nzcoo
#      [,1] [,2]
# [1,]    1    1
# [2,]    2    2
# [3,]    1    1
# [4,]    1    1

In this case, the identical entries are ignored except the last one:

coo
# <5 x 4 SparseMatrix> of type "integer" [nzcount=3 (15%)]:
#      [,1] [,2] [,3] [,4]
# [1,]   14    0    0    0
# [2,]    0   12    0    0
# [3,]    0    0    0    0
# [4,]    0    0    0    0
# [5,]    0    0    0    0

as(coo, "sparseMatrix")
# 5 x 4 sparse Matrix of class "dgCMatrix"
#
# [1,] 14  . . .
# [2,]  . 12 . .
# [3,]  .  . . .
# [4,]  .  . . .
# [5,]  .  . . .

H.