Bioconductor / DelayedArray

A unified framework for working transparently with on-disk and in-memory array-like datasets
https://bioconductor.org/packages/DelayedArray
24 stars 9 forks source link

Combining DelayedArrays coerces later objects to the first backend #77

Closed LTLA closed 3 years ago

LTLA commented 3 years ago

The following behavior is more or less expected:

library(DelayedArray)
X <- DelayedArray(rbind(runif(10)))
Y <- rbind(X, X)
showtree(Y)
## 2x10 double: DelayedMatrix object
## └─ 2x10 double: Abind (along=1)
##    ├─ 1x10 double: [seed] matrix object
##    └─ 1x10 double: [seed] matrix object

But if we throw in another DA subclass, rbind() converts the later arguments to that class:

library(HDF5Array)
Z <- as(X, "HDF5Array")
AA <- rbind(Z, X)
showtree(AA)
## 2x10 double: DelayedMatrix object
## └─ 2x10 double: Abind (along=1)
##    ├─ 1x10 double: [seed] HDF5ArraySeed object
##    └─ 1x10 double: [seed] HDF5ArraySeed object

It is easy to see how this could be problematic if the conversion cost is high. The problem lies in DelayedArray:::stash_DelayedAbind, which runs S4Vectors:::prepare_objects_to_bind, which in turn runs coerce2. It seems like it would be possible to not do this coercion if we see that all entities are DAs.

FWIW, my actual use case is that of my new ConstantMatrix class. If you were to install mumosa, you could do:

library(mumosa)
big.missing <- ConstantMatrix(c(1e5, 1e6), NA)
X <- DelayedArray(rbind(runif(1e6)))

rbind(X, big.missing)
## <100001 x 1000000> matrix of class DelayedMatrix and type "double":
##                 [,1]       [,2]       [,3] ...  [,999999] [,1000000]
##      [1,]  0.3237279  0.5398988  0.5436987   .  0.1774165  0.1595351
##      [2,]         NA         NA         NA   .         NA         NA
##      [3,]         NA         NA         NA   .         NA         NA
##      [4,]         NA         NA         NA   .         NA         NA
##      [5,]         NA         NA         NA   .         NA         NA
##       ...          .          .          .   .          .          .
##  [99997,]         NA         NA         NA   .         NA         NA
##  [99998,]         NA         NA         NA   .         NA         NA
##  [99999,]         NA         NA         NA   .         NA         NA
## [100000,]         NA         NA         NA   .         NA         NA
## [100001,]         NA         NA         NA   .         NA         NA

rbind(big.missing, X)
## Error in checkSlotAssignment(object, name, value) : 
##   c("assignment of an object of class “matrix” is not valid for slot ‘seed’ in an object of class “ConstantMatrix”; is(value, \"ConstantMatrixSeed\") is not TRUE", "assignment of an object of class “array” is not valid for slot ‘seed’ in an object of class “ConstantMatrix”; is(value, \"ConstantMatrixSeed\") is not TRUE")

The last is a coercion error, which seems unnecessary to me; I'm happy with the Abind wrapper from the first rbind call.

hpages commented 3 years ago

Fixed in DelayedArray 0.17.4 (see commit f1279e072d94793add8ba65f281def5a967a4406).

With this change cbind()/rbind() can now combine different types of DelayedArray derivatives without trying to coerce them to the type of the first object:

library(HDF5Array)
M1 <- as(matrix(1:12, nrow=4), "HDF5Array")
M2 <- DelayedArray(as(matrix(sample(c(0,1), 20, replace=TRUE), nrow=4), "dgCMatrix"))
showtree(cbind(M1, M2))
# 4x8 double: DelayedMatrix object
# └─ 4x8 double: Abind (along=2)
#    ├─ 4x3 integer: [seed] HDF5ArraySeed object
#    └─ 4x5 double, sparse: [seed] dgCMatrix object

Try this with your mumosa (no idea what that is but it sounds refreshing) and let me know if that fixes the problem.

LTLA commented 3 years ago

Thanks, looks good on my end.