Closed LTLA closed 3 years ago
Happy to go along with whatever is easiest to maintain in the long term and to defer to your and others expertise on the technical approach.
Nudging @hpages on this. Good? Bad? Crazy?
The necessary changes are done in my fork. The changes are fairly self-contained:
HDF5Matrix
objects, localized to src/beachmat
.longtests/testthat
.R/beachmat.R
.The only changes that really step on toes is the conversion of R_init_HDF5Array.c
into a .cpp
file, which was necessary to handle registration of C++ functions; some changes to the Makevars
to compile in a subdirectory; and some changes to the dependencies in DESCRIPTION
, as I intimated above.
Hi Aaron,
Sorry for the slow response. Happy to look at this. Do you think you can turn this into a PR?
@stephaniehicks and @drisso asked me to take a look at the poor performance of mbkmeans::mini_batch()
on HDF5Matrix inputs and it's true that this seems to be due to beachmat being extremely slow on these objects at the moment. Shouldn't be a surprise (you told us about this here) but somehow when Stephanie talked to me about this in NYC I didn't connect the dots.
More precisely, and FWIW, mbkmeans::mini_batch()
seems to be slowed down by the sluggishness of the subset_matrix_random()
function (https://github.com/drisso/mbkmeans/blob/7d2c1e5c17d268d639e7755e35f9f5fc4a88890a/src/mini_batch.cpp#L222-L244) which uses beachmat's get_row()
to extract random rows from the HDF5Matrix object. Comparatively, doing the same thing directly in R on a 1000x5000 HDF5Matrix object (with something like as.matrix(x[sample(1000, 50), ])
) is 40x times faster (550 ms for subset_matrix_random()
vs 13.6 ms for as.matrix(x[...])
).
I wonder why get_row()
is so slow and how migrating beachmat's HDF5-related code to HDF5Array will address that.
Thanks, H.
The slow-down is probably due to the following sequence of events:
HDF5Matrix
is given to beachmat via mkbmeans. beachmat no longer recognizes this as a natively supported matrix representation, so instead uses block processing via calls back to R.get_row()
call, beachmat attempts to realize the entire block in which that row is contained, based on the DelayedArray definition of the blocking pattern. The block is then cached to expedite future queries to that row or to surrounding rows.The solution is to either (i) use the get_rows()
method, which will call into R to only extract the rows of interest (i.e., do exactly as.matrix(x[sample(1000, 50), ])
), or (ii) re-enable native access to HDF5 matrices via C++ code, which will avoid block processing if the HDF5 file is chunked appropriately.
Thanks Aaron. We'll go for (i) (i.e. get_rows()
) for now. Isn't it a little bit overkill to use beachmat just for that when the C++ code in mbkmeans could just call extract_array(x, list(i, NULL))
(or as.matrix(x[i, , drop=FALSE])
, which is equivalent to extract_array(x, list(i, NULL))
) to extract rows in a way that is agnostic to the matrix representation?
@stephaniehicks and @drisso: I'll work on a patch and will send you a PR when it's ready.
Isn't it a little bit overkill to use beachmat just for that
In this case - yes, it is. It would be easier just to call out to R if you need to get a one-time matrix subset.
For natively supported matrices, get_rows()
should be more efficient as it avoids having to go back into R from C++. But it would have to be done very frequently for it to have a major impact on performance.
I see. Another advantage maybe of using get_rows()
in mbkmeans::mini_batch()
is that it will benefit from whatever improvement happens to native support for HDF5Matrix objects in the future. We'll probably stick to that. Thanks!
@LTLA not sure the HDF5Array mention in your last post on the Slack osca-review channel refers to this issue but do you think you can provide a PR for this change? Thanks
I'm happy to do the PR, but it wouldn't help with the inefficiency, it's a fundamental issue with the fact that we've densified a sparse matrix in HDF5 and that just makes everything... slow... down.
In many cases, we have taken to realizing the matrices in memory and operating from that, rather than using the on-disk representations directly. Maybe the TENxMatrix would be better.
I see. Thx for the clarification. Looking forward to a PR that adds beachmat's HDF5-related code to HDF5Array.
beachmat now supports native C++ access to third-party matrix representations via external linkage, as discussed here. This provides an opportunity to improve the parts of the beachmat codebase related to accessing
HDF5Matrix
objects.Specifically, I would like to migrate beachmat's HDF5-related C++ code to HDF5Array and then use beachmat's external linkage to achieve C++-level access from
HDF5Matrix
objects. This has the philosophical benefit of ensuring that HDF5Array is responsible for the code required to access its own objects. Practically, it allows beachmat to become a header-only library (something that is not possible right now due to static linkage to Rhdf5lib), which enables users to simplyLinkingTo: beachmat
to use it. This avoids the difficulties of manually managing static/dynamic linkage to beachmat in client packages.The work involved should be minimal, as it will be simply be a transplant of C++ code (plus some exporting of routines). HDF5Array will need to
Imports
beachmat to flag thatHDF5Matrix
objects have native access capabilities, but this should not be too onerous as HDF5Array depends on almost everything that beachmat depends on anyway. There will need to be some more tests but these can be dumped into thelongtests
as they are done in beachmat already.Tagging @PeteHaitch, the only other user of beachmat, for usage opinions; and @grimbough, for general C++ thoughts about this proposed strategy.
Edit: Okay, maybe not so minimal, as I'll have to modify beachmat's external linkage for writing to arbitrary output matrices. But I should have done that anyway, and it would be a good feature to have.