Open LTLA opened 1 year ago
Hey Aaron,
Very kind of you to think of me as your guinea pig 😄 I'm currently writing up my thesis, so won't have time for the next few weeks to work on it, but in July I could give it a shot. But just glancing over the docs, it sounds pretty straightforward.
Best, Constantin
Great. The timeframe is fine, I'll probably need to sort out the hiccups on the BBS anyway.
Hey,
I read through the documentation of tatami and rewrote the fitBeta_one_group
using it (https://github.com/const-ae/glmGamPoi/commit/6476773f8766f70ccd2b39828ca02050453385fb).
It seems to work fine, but there are some bits where I am not sure that I did them optimally:
beachmat::initializeCpp
in R before calling the C++ function?Rtatami::BoundNumericPointer
-> creating an Extractor
-> allocating a buffer -> calling fetch
which returns a double*
). Is that intentional?fetch
returns a const double*
, which is fine while working on each entry (https://github.com/const-ae/glmGamPoi/commit/6476773f8766f70ccd2b39828ca02050453385fb?diff=split#diff-57115f0ddf68b868ee0015ff6e29def5dfdf467691c36da419470edf147b8dffR393). However, I could not find an efficient way to cast it to a NumericVector
(or std::vector
) to pass to other functions (https://github.com/const-ae/glmGamPoi/commit/6476773f8766f70ccd2b39828ca02050453385fb?diff=split#diff-57115f0ddf68b868ee0015ff6e29def5dfdf467691c36da419470edf147b8dffR415). Am I missing something?Nice.
- Why do I have to call
beachmat::initializeCpp
in R before calling the C++ function?
Technically, it is possible to do this inside the C++ function - and in fact, previous versions of the tatami_r library did have a parse()
function to convert a matrix-like Rcpp::RObject
into a pointer to a tatami::NumericMatrix
. However, it was very difficult to unpack some of the delayed operations in R, as this requires inspection of function environments to identify the generic being called, e.g., in the delayed arithmetic operations. In the end, it was just easier for me to do the class inspection in R and then create a pointer to pass to C++.
From a package developer perspective, initializeCpp()
allows you to call it once inside an R function and then pass the same pointer to multiple C++ functions. This avoids repaying the cost of the conversion, which - while cheap - still involves some work through the S4 dispatch mechanism and memory allocations. It also means that you avoid the burden of compiling all of the header-only libraries for the delayed operations in your package C++ code, as you only need to compile against the interface; the concrete implementations are compiled in beachmat itself.
- There appears to be a lot of boilerplate code in C++ to extract a row from a matrix (making a
Rtatami::BoundNumericPointer
-> creating anExtractor
-> allocating a buffer -> callingfetch
which returns adouble*
). Is that intentional?
The amount of boilerplate should be equivalent to the old beachmat3 code with workspaces. I'm not sure if you were using workspaces previously, but the idea is to cache information between repeated fetch()
calls to improve efficiency for certain matrix representations. For example, if you're accessing consecutive rows of a CSC matrix, you can cache the position of the index pointers from one row and increment them when fetch()
ing the next row.
The new extractors are analogous to the old workspaces but with certain enhancements during their construction:
row/column()
call. (In fact, requesting a subset of indices wasn't even supported.)column()
, or vice versa. Now, the information about whether you want row/column or dense/sparse output is determined during the construction of the extractor, which reduces the potential for mix-ups.tatami::consecutive_extractor
) if you know that the subsequent fetch()
iterations will follow a certain order. This improves efficiency for certain matrix types that can benefit from pre-fetching of future elements, e.g., loading multiple rows/columns from disk in HDF5-backed matrices.tatami::Matrix
objects. This encourages more efficient downstream code, as opposed to the workspaces, which were optional (and thus occasionally omitted by accident, to the detriment of the performance of the caller).Technically you don't need to allocate a buffer, you can just call fetch()
directly on the index and this will return a std::vector<Data_>
to you. But if you're planning to loop, it makes more sense to allocate once outside the loop and just re-use the buffer as you iterate through the rows/columns of the matrix.
If it helps, I think of the extractors as being (kind of, philosophically) like iterators for the usual std
containers. So it's about the same amount of boilerplate as iterator-based C++ code.
fetch
returns aconst double*
, which is fine while working on each entry (6476773?diff=split#diff-57115f0ddf68b868ee0015ff6e29def5dfdf467691c36da419470edf147b8dffR393). However, I could not find an efficient way to cast it to aNumericVector
(orstd::vector
) to pass to other functions (6476773?diff=split#diff-57115f0ddf68b868ee0015ff6e29def5dfdf467691c36da419470edf147b8dffR415). Am I missing something?
Do you just want to fetch()
directly into your Rcpp::NumericVector
? If so, this should be straightforward:
counts_worker->fetch_copy(gene_idx, static_cast<double*>(counts_vec.begin()));
This will fetch and copy into counts_vec
. Note that we use fetch_copy()
to force the copying of matrix contents into the buffer. This is necessary as fetch()
itself will sometimes elide the copy, e.g., if the underlying matrix is dense and a no-copy pointer can be directly returned. Again, it makes more sense to allocate the Rcpp::NumericVector
objects outside of the loop and just re-use them. This should put less pressure on the allocator + GC.
Having said all that... the use of R data structures does make it a bit harder to parallelize, along with the fact that the loop involves possibly calling back into an R function. It can be done but does require some care - LMK if you are interested.
Thanks for the thorough explanation.
The amount of boilerplate should be equivalent to the old beachmat3 code with workspaces
No, I don't think I used workspaces. Previously I did:
auto Y_bm = beachmat::create_matrix<NumericType>(Y_SEXP);
typename NumericType::vector counts(n_samples);
Y_bm->get_row(gene_idx, counts.begin());
Now the equivalent code appears to be:
Rtatami::BoundNumericPointer Y_parsed(Y);
const auto& Y_bm = Y_parsed->ptr;
auto counts_worker = Y_bm->dense_row();
std::vector<double> counts_buffer(n_samples);
const double* counts = counts_worker->fetch(gene_idx, counts_buffer.data());
Or was the original code actually problematic and copied the data from the matrix into counts
?
the extractors can be specified with oracles (e.g., tatami::consecutive_extractor) if you know that the subsequent fetch() iterations will follow a certain order. This improves efficiency for certain matrix types
True, that is pretty cool :)
Do you just want to fetch() directly into your Rcpp::NumericVector? If so, this should be straightforward:
counts_worker->fetch_copy(gene_idx, static_cast<double*>(counts_vec.begin()));
This will fetch and copy into counts_vec
Well, ideally I would want fetch
to return a data structure of the same type as its second argument. That would allow me for example to allocate an Rcpp::NumericVector
once, then keep refilling it, and if necessary pass it on to some other function.
Having said all that... the use of R data structures does make it a bit harder to parallelize, along with the fact that the loop involves possibly calling back into an R function. It can be done but does require some care
The possibility to parallelize execution sounds really neat. How does the current implementation play with BiocParallel
? Is it problematic to call checkUserInterrupt()
?
Rtatami::BoundNumericPointer Y_parsed(Y); const auto& Y_bm = Y_parsed->ptr; auto counts_worker = Y_bm->dense_row(); std::vector<double> counts_buffer(n_samples); const double* counts = counts_worker->fetch(gene_idx, counts_buffer.data());
Technically, it's just one extra line if you do auto counts_worker = Y_parsed->ptr->dense_row();
. Not too bad IMO.
Or was the original code actually problematic and copied the data from the matrix into
counts
?
I think it was fine, it was just kinda inefficient in certain cases, e.g., row-wise iteration through a sparse matrix.
Well, ideally I would want
fetch
to return a data structure of the same type as its second argument. That would allow me for example to allocate anRcpp::NumericVector
once, then keep refilling it, and if necessary pass it on to some other function.
Yeah, that's what the static_cast
does. You can convert an atomic Rcpp::Vector
iterator to its raw pointer (double
for NumericVector
s, int
for IntegerVector
s and LogicalVector
s), and accessing the pointer will just reference the underlying Rcpp::Vector
. So indeed, you can allocate the vector once and then just continually refill it.
(Note that fetch()
is virtual, so I can't just template the second argument. I'd have to create hard-coded overloads for every pointer-like type that I expect to get, and given that the tatami interface is R-agnostic, I wouldn't be able to easily justify support for Rcpp iterators. But it doesn't matter if you just convert them to pointers.)
How does the current implementation play with
BiocParallel
?
Not at all. The parallelization is done using C++11's <thread>
, so it will work on every compliant OS/compiler. I no longer use BiocParallel for performance-critical parallelization because (i) forking doesn't work on Windows and spinning up a separate process is too slow, and (ii) I keep getting OOM problems because the child processes think they have more free memory than is present. C++-level parallelization directly shares memory between threads so there's no duplication or copying or whatever. I've never had OOM problems ever since I switched parallelization schemes.
In theory, we could also use OpenMP, but it's not supported on older versions of Clang, and the R-from-C++ evaluation is not protected with OpenMP because I don't know how to do the relevant shenanigans.
I just realized that the documentation for this functionality is not that great, but in your case, it would be something like:
// Do R-related allocations in serial section OR inside mexec.run().
std::vector<Rcpp::NumericVector> buffers;
buffers.reserve(num_threads);
for (int t = 0; t < num_threads; ++t) {
buffers.emplace_back(n_samples);
}
std::vector<double*> buffer_ptrs;
buffer_ptrs.reserve(num_threads);
for (int t = 0; t < num_threads; ++t) {
buffer_ptrs.emplace_back(static_cast<double*>(buffers[t].begin()));
}
// The executor ensures that R-related operations run on the main thread.
auto& mexec = tatami_r::executor();
tatami_r::parallelize([&](size_t thread, int start, int length) -> void {
auto ext = mat->dense_row();
// OR, for the advanced: request extraction of [start, start + length) dense rows in this thread.
// auto ext = tatami::consecutive_extractor</* row = */ true, /* sparse = */ false>(mat.get(), start, length);
auto& buffer = buffers[thread];
auto bptr = buffer_ptrs[thread];
for (int r = start, end = start + length; r < end; ++r) {
auto ptr = ext->fetch_copy(r, bptr); // Do something non-R related with ptr/bptr, they're the same here.
mexec.run([&]() -> void { // Now for something potentially R-related.
some_kind_of_R_operation(buffer);
});
}
}, mat->nrow(), num_threads);
Ooof. I'll just say that parallel tatami code is usually much cleaner; it's just that your use case needs to interact with the R API and that requires some gymnastics to protect R from simultaneous activity on multiple threads.
Is it problematic to call
checkUserInterrupt()
?
You'd have to call that from a serial section, so the code would be a little more complicated. For example, you'd need to parallelize 100 rows at a time in each thread, then return to the serial section to check the interrupt, and then parallelize the next 100 * num_threads
rows and so on. I'm not sure I can easily catch the interrupt signal in the threads.
Nudge - any progress? Need help?
Hey, sorry for the silence. I was traveling last week and am currently at ICML. I didn't manage to spend any more time on it. Maybe we can discuss next week in person :)
Just a heads up that, during the next release, I will be slowly removing some of the tests related to the old beachmat API (not any functionality, just the tests) in order to reduce beachmat's build time on the BioC machines.
This may be a good opportunity to switch to new tatami API, which will be much more comprehensively tested than beachmat's old API, especially once I start trimming back the test suite.
I think you're still using the old-school beachmat interface, which is fine and will still be around.
But I wonder if you'll consider experimenting with the new tatami interface, which will hopefully struggle its way onto BioC-devel via beachmat (v2.17.3) in the next few days.
This does away with the need for integer/numeric templates, it makes it easier to parallelize at the C++ level, and some delayed operations are captured natively (i.e., some DelayedArrays are translated to their C++ equivalents). Other R matrices are still supported, of course, but are handled more slowly by calling back into
DelayedArray::extract_array()
.It also greatly simplifies compilation. In the coolest and most experimental part of this update, beachmat will generate external pointers to matrix instances that can be passed to glmGamPoi's shared libraries. This means that glmGamPoi doesn't need to actually need to compile support for the various matrix implementations in its own
*.so
file, which makes it easier to extend tatami's C++ support to, e.g., HDF5, TileDB, Zarr-backed matrices.I've already switched SingleR to this latest paradigm, hopefully it'll pass BioC-devel in the next few days :crossed_fingers: If it works, I'd like a non-Aaron package developer to test out this new approach, and you're the lucky one.