Combining DelayedArrays coerces later objects to the first backend

LTLA commented 3 years ago

The following behavior is more or less expected:

library(DelayedArray)
X <- DelayedArray(rbind(runif(10)))
Y <- rbind(X, X)
showtree(Y)
## 2x10 double: DelayedMatrix object
## └─ 2x10 double: Abind (along=1)
##    ├─ 1x10 double: [seed] matrix object
##    └─ 1x10 double: [seed] matrix object

But if we throw in another DA subclass, rbind() converts the later arguments to that class:

library(HDF5Array)
Z <- as(X, "HDF5Array")
AA <- rbind(Z, X)
showtree(AA)
## 2x10 double: DelayedMatrix object
## └─ 2x10 double: Abind (along=1)
##    ├─ 1x10 double: [seed] HDF5ArraySeed object
##    └─ 1x10 double: [seed] HDF5ArraySeed object

It is easy to see how this could be problematic if the conversion cost is high. The problem lies in DelayedArray:::stash_DelayedAbind, which runs S4Vectors:::prepare_objects_to_bind, which in turn runs coerce2. It seems like it would be possible to not do this coercion if we see that all entities are DAs.

FWIW, my actual use case is that of my new ConstantMatrix class. If you were to install mumosa, you could do:

library(mumosa)
big.missing <- ConstantMatrix(c(1e5, 1e6), NA)
X <- DelayedArray(rbind(runif(1e6)))

rbind(X, big.missing)
## <100001 x 1000000> matrix of class DelayedMatrix and type "double":
##                 [,1]       [,2]       [,3] ...  [,999999] [,1000000]
##      [1,]  0.3237279  0.5398988  0.5436987   .  0.1774165  0.1595351
##      [2,]         NA         NA         NA   .         NA         NA
##      [3,]         NA         NA         NA   .         NA         NA
##      [4,]         NA         NA         NA   .         NA         NA
##      [5,]         NA         NA         NA   .         NA         NA
##       ...          .          .          .   .          .          .
##  [99997,]         NA         NA         NA   .         NA         NA
##  [99998,]         NA         NA         NA   .         NA         NA
##  [99999,]         NA         NA         NA   .         NA         NA
## [100000,]         NA         NA         NA   .         NA         NA
## [100001,]         NA         NA         NA   .         NA         NA

rbind(big.missing, X)
## Error in checkSlotAssignment(object, name, value) : 
##   c("assignment of an object of class “matrix” is not valid for slot ‘seed’ in an object of class “ConstantMatrix”; is(value, \"ConstantMatrixSeed\") is not TRUE", "assignment of an object of class “array” is not valid for slot ‘seed’ in an object of class “ConstantMatrix”; is(value, \"ConstantMatrixSeed\") is not TRUE")

The last is a coercion error, which seems unnecessary to me; I'm happy with the Abind wrapper from the first rbind call.

hpages commented 3 years ago

Fixed in DelayedArray 0.17.4 (see commit f1279e072d94793add8ba65f281def5a967a4406).

With this change cbind()/rbind() can now combine different types of DelayedArray derivatives without trying to coerce them to the type of the first object:

library(HDF5Array)
M1 <- as(matrix(1:12, nrow=4), "HDF5Array")
M2 <- DelayedArray(as(matrix(sample(c(0,1), 20, replace=TRUE), nrow=4), "dgCMatrix"))
showtree(cbind(M1, M2))
# 4x8 double: DelayedMatrix object
# └─ 4x8 double: Abind (along=2)
#    ├─ 4x3 integer: [seed] HDF5ArraySeed object
#    └─ 4x5 double, sparse: [seed] dgCMatrix object

Try this with your mumosa (no idea what that is but it sounds refreshing) and let me know if that fixes the problem.

LTLA commented 3 years ago

Thanks, looks good on my end.

Bioconductor / DelayedArray

Combining DelayedArrays coerces later objects to the first backend #77