Allowing extract_array calls to use pre-indexed grid information?

The recent conversation in theislab/zellkonverter#34 reminded me of some work I did in LTLA/beachmat#20. Briefly, the idea was to speed up row-based block processing of dgCMatrix by performing a single pass over the non-zero elements beforehand to identify the start and end of each row block in each column. This avoids the need for costly per-column binary searches when each row block is extracted in the usual way, and gives a ~10-fold speed-up in row-based processing of dgCMatrixes.

Now I'm wondering whether this approach can be generalized somehow so that other DelayedArray backends can benefit. Perhaps functions like rowAutoGrid() can decorate the grid object with extra information that allows extract_array to efficiently obtain the necessary bits and pieces, if a suitable object like a dgCMatrix is passed?

Happy to give this - or other ideas - a crack with a PR if there is some interest.

FWIW the extract_sparse_array() and extract_array() methods for H5SparseMatrixSeed objects use %in% to identify the selected non-zero elements in the selected columns (see https://github.com/Bioconductor/HDF5Array/blob/f791304316d13bdfc560b1d7674aa6224dad7ce8/R/H5SparseMatrixSeed-class.R#L385).

%in% is based on base::match() and base::match() uses a hashing algo which tends to be faster than a binary search.

Anyways last time I checked the bottleneck for extract_sparse_array(TENxMatrix) was IO and I don't remember that the cost of identifying the selected non-zero elements in the selected column was a big deal but maybe I missed something. Do you have any timing?

Oh, this doesn't have anything to do with the HDF5 representations; the discussion in LTLA/beachmat#20 only involves in-memory matrices. You can see some timings comparing the subsetting by Matrix's [ method to create row chunks, against the same operation done using precomputed indices.

(The precomputed index basically involves running through the set of non-zero values and their row indices, figuring out the start/stop positions for each row chunk in each column, and then using that information to extract the row chunk without having to do a search of some kind, which is what I assume Matrix's CHOLMOD code is doing.)

My implementation of this precomputed index gives me a 10-fold speedup to simply iterate over row-wise blocks of the in-memory dgCMatrix. This can be a quite noticeable percentage of the function time when the matrix has many row blocks.

I see, nice! What other sparse in-memory DelayedArray backends could benefit from something like that?

This kind of precomputed index seems pretty heavy (almost half the size of y, the big dgCMatrix object in your example), so decorating the grid with it will make the grid a lot bigger. A nice property of ArrayGrid objects is that they are super thin, which is particularly good in the context of parallelization.

Furthermore, it seems that the size of y + the size of the precomputed index generated by beachmat:::.prepare_sparse_row_subset() is greater than the size of the list you would get by simply extracting all the sub-matrices in advance. So if you're going to precompute something, you could almost go for that, then walk on this list directly, and throw away y.

More seriously, if we had a matrix-like container that supports chunking where the chunks are small dgCMatrix objects, you could put y in that, say z. Now the precomputing happens when constructing z from y. Now you have an object z that is ready to walk on at no cost. Note that this could be made to work with arbitrary chunking, not just chunks made of full rows or full columns. Reminds me of what I was trying to achieve with chunked RleArray objects. A huge Rle object and a huge dgCMatrix object actually have the same performance issue when repeatedly extracting small parts of them.

Note that one way to construct something analog to z at the moment is to pre-extract the small dgCMatrix objects from y, wrap them in DelayedArray objects, and rbind() everything together.

BTW I'm still not clear why we need to do block processing at all on in-memory objects. What's the use case?

Responding in reverse order:

BTW I'm still not clear why we need to do block processing at all on in-memory objects. What's the use case?

The strongest use case is when the operation discards sparsity. For example, I might be computing ranks, in which case I want to process by blocks so that the peak memory usage is never too high. (Case in point: aertslab/AUCell#16.)

Another use case is that I want to split the job across multiple workers. This requires me to split the matrix up somehow, and the block processing machinery is a natural and easy way to do that; just set BPPARAM= and you're done.

So if you're going to precompute something, you could almost go for that, then walk on this list directly, and throw away y.

Indeed, this is actually what I do in beachmat::rowBlockApply when the parallelization backend does not use shared memory, so as to avoid serializing the entire matrix to each worker when it only uses a submatrix.

Now you have an object z that is ready to walk on at no cost. Note that this could be made to work with arbitrary chunking, not just chunks made of full rows or full columns.

This makes sense but less appealing if it requires the user to do the transformation explicitly. In the context of the rest of the analysis, many other functions are optimized for dgCMatrix objects (e.g., irlba()) and I wouldn't want the user to have to convert between different objects depending on which function they're using.

This kind of precomputed index seems pretty heavy (almost half the size of y, the big dgCMatrix object in your example), so decorating the grid with it will make the grid a lot bigger.

Yes, unfortunately. I suspect this example is slightly pathological in that that 1% density of non-zero values is unusually low (because I couldn't be bothered waiting for rsparsematrix() to generate an example with greater density). Typically I'd be expecting upwards of 5% so that the relative memory usage of indexing is not so egregious.

That said, I suspect it is possible to halve the size of the index with some extra work:

Each viewport stores the start positions of the columns only.
The end positions are inferred from the values of the "next" viewport.
Column identities are inferred from knowledge of the grid coordinates and from the full matrix's p.

In practice, it may be more convenient to store both the start and end positions for each viewport, rather than having to jump to the next viewport to retrieve the end positions. This is still okay because the end positions of one viewport is literally the same vector as the start positions of the next viewport, allowing us to exploit R's COW to avoid an actual copy.

What other sparse in-memory DelayedArray backends could benefit from something like that?

Potentially, everything that I work on that uses an in-memory sparse matrix as the seed. The most obvious ones are ResidualMatrix and ScaledMatrix, where row processing could be accelerated if extract_sparse_array was able to somehow use some kind of pre-indexing information to extract the sparse chunks.

Was fiddling with something else and happened across some timings that might be helpful to distill the problem:

library(Matrix)
x <- rsparsematrix(50000, 20000, density=0.1)
system.time(colSums(x))
##    user  system elapsed
##   0.262   0.001   0.264

y <- DelayedArray(x)
system.time(beachmat::rowBlockApply(y, colSums, grid=TRUE)) # final reduce is negligible
##    user  system elapsed
##   2.468   0.371   2.849

system.time(colSums(y))
##    user  system elapsed
##  22.413   2.089  25.034

system.time(blockApply(y, colSums, grid=rowAutoGrid(x))) # for a strict-like-for-like
##    user  system elapsed
##  35.076   3.796  39.089

The differences from colSums can be interpreted as the iteration overhead. (I need to set grid=TRUE to forcibly perform grid processing, otherwise beachmat strips away the DA layer.) The 2 second overhead is a bit annoying and is due to the reallocations to make a new block; however, it's better than the 20-30 second overhead for direct DA block applying.

Bioconductor / DelayedArray

Allowing extract_array calls to use pre-indexed grid information? #91