Bioconductor / DelayedArray

A unified framework for working transparently with on-disk and in-memory array-like datasets
https://bioconductor.org/packages/DelayedArray
24 stars 9 forks source link

Adding a ConstantArray class #87

Closed LTLA closed 3 years ago

LTLA commented 3 years ago

Nothing too fancy here, this class just represents an array with a constant value for all of its cells.

This was originally implemented as part of SingleCellExperiment, which contains a primitive implementation of what I hope will be the combineRows implementation for SummarizedExperiment objects. The idea is to create all-NA matrices for assays that are not present in some SE objects, thus allowing them to be efficiently combined (without pretending that they're zero, which they're not).

I thought about using an RleArray for this purpose but it breaks pretty quickly with:

RleArray(Rle(NA, 1e10), c(1e5, 1e5))
## Error in .make_RleArraySeed_from_Rle(data, dim, dimnames, chunksize) : 
##   Input data is too long (>= 2^31). Please supply an ordinary list of Rle
##   objects instead, or an RleList object, or a DataFrame object where all
##   the columns are Rle objects.

I guess I could make a list of Rle objects, but it's a bit of a chore to have to create the chunks manually. Probably like:

NR <- 1e5
NC <- 1e5
chunk.size <- floor(.Machine$integer.max/NR)
nchunks <- ceiling(NC/chunk.size)
chunks <- rep(list(Rle(NA, NR * chunk.size)), nchunks)
chunks[[nchunks]] <- Rle(NA, NR*(NC - chunk.size * (nchunks - 1)))
RleArray(chunks, c(NR, NC))

By comparison, the proposed class is simple to understand and use (and maintain) by just calling:

ConstantArray(c(1e5, 1e5), NA)

Data extraction also seems faster based on some cursory timings, but this isn't a major consideration here.

hpages commented 3 years ago

Sounds good but does this work?

> ConstantArray(5:7, value=NA)
Error in new_DelayedArray(seed, Class = "ConstantMatrix") : 
  the supplied seed must have exactly 2 dimensions when the specified
  class (ConstantMatrix) extends DelayedMatrix

Seems like some unit tests are needed.

LTLA commented 3 years ago

... only for matrices, apparently. Oops. Fixed.

hpages commented 3 years ago

Thanks. A few more things:

  1. ConstantMatrix needs to contain ConstantArray so the following returns TRUE:

    > is(ConstantArray(5:6, value=NA), "ConstantArray")
    [1] FALSE

    Another benefit of doing this is that ConstantMatrix will then inherit the ConstantArray reprresentation so you don't need to specify it again in the ConstantMatrix class definition. It will just be:

    setClass("ConstantMatrix", contains=c("ConstantArray", "DelayedMatrix"))

    Unfortunately, after doing this, you need to define coercion methods from ConstantMatrix to ConstantArray and from ConstantArray to ConstantMatrix to prevent some bad things to happen. See how these coercions are defined for RleArray/RleMatrix in RleArray-class.R and do the same thing.

  2. Validity method: Use setValidity2 instead of setValidity. Use msg <- validate_dim_slot(object, "dim") to validate the dim slot (it will also make sure that the slot doesn't contain NAs). Don't try to return all the problems in msg but return only the first one. See validity method for SparseArraySeed for an example.

  3. In addition to all the atomic types, the DelayedArray framework also supports arrays of type list so this would need to be supported:

    ConstantArray(4:3, value=list(55))
    ConstantArray(4:3, value=list(letters))

    I suggest you replace value="ANY" with value="vector" in the class definition (that's how the nzdata slot is specified for SparseArraySeed). Then just remove the !is.atomic(object@value) test in the validity method and you should be good.

  4. The extract_array() method should just return array(x@value, get_Nindex_lengths(index, dim(x))). Your .get_constant_dim helper is not needed.

  5. What's considered a zero depends on the type of a DelayedArray object. For example, if the type is character, the zero value is the empty string. If the type is list, it's NULL. If you use:

    zero <- vector(type(x), length=1L)
    identical(x@value, zero)

    in your is_sparse() method then it should do the right thing whatever the type. (However I just discovered that dense2sparse() is broken on arrays of type list so avoid this in your tests.)

  6. Export the ConstantArraySeed() constructor function, especially since you document it.

LTLA commented 3 years ago

Done... I think I got all of the above.

hpages commented 3 years ago

Thanks. Sorry to insist but you don't need to specify the representation in the class definition of ConstantMatrix 'cause you inherit it from ConstantArray (hey, I gave you above the exact setClass statement to use for this class). Other than that, everything looks good.

LTLA commented 3 years ago

ooops.

hpages commented 3 years ago

Just for fun, here's one way to make a delayed constant matrix without using the new ConstantMatrix stuff:

out2 <- DelayedArray(matrix(NA_real_))[rep(1, 1e6), rep(1, 1e6)]

However, it's 10x slower than:

out <- ConstantArray(c(1e6, 1e6), value=NA_real_)

and also the memory footprint of the resulting object is 5000x bigger.