Bioconductor / HDF5Array

HDF5 backend for DelayedArray objects
https://bioconductor.org/packages/HDF5Array
9 stars 13 forks source link

Migrating beachmat's HDF5-related code to HDF5Array #15

Closed LTLA closed 3 years ago

LTLA commented 5 years ago

beachmat now supports native C++ access to third-party matrix representations via external linkage, as discussed here. This provides an opportunity to improve the parts of the beachmat codebase related to accessing HDF5Matrix objects.

Specifically, I would like to migrate beachmat's HDF5-related C++ code to HDF5Array and then use beachmat's external linkage to achieve C++-level access from HDF5Matrix objects. This has the philosophical benefit of ensuring that HDF5Array is responsible for the code required to access its own objects. Practically, it allows beachmat to become a header-only library (something that is not possible right now due to static linkage to Rhdf5lib), which enables users to simply LinkingTo: beachmat to use it. This avoids the difficulties of manually managing static/dynamic linkage to beachmat in client packages.

The work involved should be minimal, as it will be simply be a transplant of C++ code (plus some exporting of routines). HDF5Array will need to Imports beachmat to flag that HDF5Matrix objects have native access capabilities, but this should not be too onerous as HDF5Array depends on almost everything that beachmat depends on anyway. There will need to be some more tests but these can be dumped into the longtests as they are done in beachmat already.

Tagging @PeteHaitch, the only other user of beachmat, for usage opinions; and @grimbough, for general C++ thoughts about this proposed strategy.

Edit: Okay, maybe not so minimal, as I'll have to modify beachmat's external linkage for writing to arbitrary output matrices. But I should have done that anyway, and it would be a good feature to have.

PeteHaitch commented 5 years ago

Happy to go along with whatever is easiest to maintain in the long term and to defer to your and others expertise on the technical approach.

LTLA commented 5 years ago

Nudging @hpages on this. Good? Bad? Crazy?

LTLA commented 5 years ago

The necessary changes are done in my fork. The changes are fairly self-contained:

The only changes that really step on toes is the conversion of R_init_HDF5Array.c into a .cpp file, which was necessary to handle registration of C++ functions; some changes to the Makevars to compile in a subdirectory; and some changes to the dependencies in DESCRIPTION, as I intimated above.

hpages commented 5 years ago

Hi Aaron,

Sorry for the slow response. Happy to look at this. Do you think you can turn this into a PR?

@stephaniehicks and @drisso asked me to take a look at the poor performance of mbkmeans::mini_batch() on HDF5Matrix inputs and it's true that this seems to be due to beachmat being extremely slow on these objects at the moment. Shouldn't be a surprise (you told us about this here) but somehow when Stephanie talked to me about this in NYC I didn't connect the dots.

More precisely, and FWIW, mbkmeans::mini_batch() seems to be slowed down by the sluggishness of the subset_matrix_random() function (https://github.com/drisso/mbkmeans/blob/7d2c1e5c17d268d639e7755e35f9f5fc4a88890a/src/mini_batch.cpp#L222-L244) which uses beachmat's get_row() to extract random rows from the HDF5Matrix object. Comparatively, doing the same thing directly in R on a 1000x5000 HDF5Matrix object (with something like as.matrix(x[sample(1000, 50), ])) is 40x times faster (550 ms for subset_matrix_random() vs 13.6 ms for as.matrix(x[...])).

I wonder why get_row() is so slow and how migrating beachmat's HDF5-related code to HDF5Array will address that.

Thanks, H.

LTLA commented 5 years ago

The slow-down is probably due to the following sequence of events:

The solution is to either (i) use the get_rows() method, which will call into R to only extract the rows of interest (i.e., do exactly as.matrix(x[sample(1000, 50), ])), or (ii) re-enable native access to HDF5 matrices via C++ code, which will avoid block processing if the HDF5 file is chunked appropriately.

hpages commented 5 years ago

Thanks Aaron. We'll go for (i) (i.e. get_rows()) for now. Isn't it a little bit overkill to use beachmat just for that when the C++ code in mbkmeans could just call extract_array(x, list(i, NULL)) (or as.matrix(x[i, , drop=FALSE]), which is equivalent to extract_array(x, list(i, NULL))) to extract rows in a way that is agnostic to the matrix representation?

@stephaniehicks and @drisso: I'll work on a patch and will send you a PR when it's ready.

LTLA commented 5 years ago

Isn't it a little bit overkill to use beachmat just for that

In this case - yes, it is. It would be easier just to call out to R if you need to get a one-time matrix subset.

For natively supported matrices, get_rows() should be more efficient as it avoids having to go back into R from C++. But it would have to be done very frequently for it to have a major impact on performance.

hpages commented 5 years ago

I see. Another advantage maybe of using get_rows() in mbkmeans::mini_batch() is that it will benefit from whatever improvement happens to native support for HDF5Matrix objects in the future. We'll probably stick to that. Thanks!

hpages commented 4 years ago

@LTLA not sure the HDF5Array mention in your last post on the Slack osca-review channel refers to this issue but do you think you can provide a PR for this change? Thanks

LTLA commented 4 years ago

I'm happy to do the PR, but it wouldn't help with the inefficiency, it's a fundamental issue with the fact that we've densified a sparse matrix in HDF5 and that just makes everything... slow... down.

In many cases, we have taken to realizing the matrices in memory and operating from that, rather than using the on-disk representations directly. Maybe the TENxMatrix would be better.

hpages commented 4 years ago

I see. Thx for the clarification. Looking forward to a PR that adds beachmat's HDF5-related code to HDF5Array.