Memory issue with subassignment

mikejiang commented 6 years ago

The new feature introduced by #32 seems to have memory leak issue, as the example shown below the DelayedArray object gets bigger after each iteration of subassignment

library(HDF5Array)

M <- matrix(1:100, nrow=10)
tmp <- tempfile()
h5write(M, tmp, "data")
hm <- HDF5Array(tmp, "data")
object.size(hm)
#> 2144 bytes
set.seed(123)
for(k in 1:10)
{
  i <- sample(10,10)
  j <- sample(10,10)
  hm[i, j][hm[i, j] >0] <- 0
}

object.size(hm)
#> 10593240 bytes

^{Created on 2018-10-01 by the reprex package (v0.2.1)}

PeteHaitch commented 6 years ago

Take a look at showtree(hm) - you build up a large and complicated DelayedArray using this code. Is this something you actually want to do in practice?

FWIW, I wouldn't call this a memory leak, rather I'd say this subassignment is something you want to avoid doing without realizing the result fairly soon afterward. Each subassignment has to keep a bunch of stuff in-memory and I think it can't simplify this (although you can of course explicitly realize one of the intermediate results).

hpages commented 6 years ago

As Pete said, this is not memory leak, just growing an object. The more delayed operations a DelayedArray object is going to carry, the bigger the object is going to be in memory.

Delayed ops are represented by DelayedOp objects. There are currently 8 types of DelayedOp (see ?DelayedOp). In general these are small objects except for DelayedSubset and DelayedSubassign objects (which are DelayedOp derivatives for representing subsetting and subassignment, respectively). These objects need to store the index of the subsetting or the "left index" of the subassignment. This index is a list of NULLs or integer vectors, one per dimension in the object, so it can be big.

Furthermore, when various subsetting operations are applied to a DelayedArray object (consecutively like in M[1:2, ][ , -5], or separated by other operations like in log(t(M[1:2, ]))[-5, ]), they can be merged into a single delayed subsetting operation that summarizes them all. This simplification of the tree of delayed operations is performed by simplify(). But there is no such thing for subassignments. These operations will accumulate in the tree of delayed operations carried by the object and cannot in general be merged into a single subassignment operation. Although there are situations where they could be merged, simplify() doesn't detect those situations at the moment (and it's not clear that the cost of detecting them wouldn't outweigh the benefit of the simplification).

To summarize, all delayed operations have 2 costs: (1) the cost of representing them (i.e. the memory footprint of the DelayedOp objects that represent them), and (2) the cost of realizing them. (1) is paid immediately and (2) is paid later at realization time. (1) is generally small but can become significant if the DelayedArray object carries many delayed operations that cannot be simplified. For example, doing M <- 1 / (M^2 + 1) in a loop would also grow the tree of delayed ops carried by M and there would be no way to simplify that tree.

So the advice is: even though a DelayedArray object supports delayed operations, one needs to be careful to not accumulate too many operations on it. Especially if the operations are subassignments, which can be expensive to represent (if the "left index" is big) and cannot be simplified.

Hope this makes sense.

mikejiang commented 6 years ago

Unfortunately this is the real use case where the matrix needs to be partially updated multiple time within the loop. After the entire process is finished, the memory usage of the DelayedArray object grows to 30G! (I've confirmed that the results are correct though). I guess the solution will be realize it immediately after each iteration (at cost of computing time). Since I am using Bigmemory as the backend, I wonder if you can start to write a general documentation on how to implement realization. Or you can point me to the sources of HDF5Array that I need to replicate (I guess I need the minimum specifications in order for realize to work)

PeteHaitch commented 6 years ago

Does Bigmemory support in-place replacement of values? This might be better than delaying the subassignment, for this use case.

hpages commented 6 years ago

It would indeed. Because each realization will copy the entire data set!

Anyway FWIW I've started to put together some notes for how to implement a realization backend here: https://github.com/Bioconductor/DelayedArray/issues/37

For your use case, if Bigmemory supports in-place replacement of values and you feel like taking the BigmemRealizationSink route, then you might consider implementing an extract_array() and replace_array() methods for BigmemRealizationSink objects. This would allow you to do all those replacements in-place by using something like this:

sink <- BigmemRealizationSink(dim(M), dimnames(M), type(M))
DelayedArray:::BLOCK_write_to_sink(M, sink)
for (k in ...) {
    ## Make sure that 'i' and 'j' are "short subscripts" so that extract_array() returns
    ## a "small" matrix 'm' (i.e. a matrix that fits in memory) at each iteration.
    i <- ...
    j <- ... 
    m <- extract_array(sink, list(i, j))
    m[m > 0] <- 0
    replace_array(sink, list(i, j), m)
}
close(sink)
M2 <- as(sink, "DelayedArray")

Note that there is no replace_array() generic in DelayedArray so you could either come up with your own generic or just implement this as a regular function.

hpages commented 4 years ago

@mikejiang Can we close this one too?

Bioconductor / DelayedArray

Memory issue with subassignment #36