const-ae / glmGamPoi

Fit Gamma-Poisson Generalized Linear Models Reliably
105 stars 14 forks source link

Upgrading from beachmat to tatami #52

Open LTLA opened 1 year ago

LTLA commented 1 year ago

I think you're still using the old-school beachmat interface, which is fine and will still be around.

But I wonder if you'll consider experimenting with the new tatami interface, which will hopefully struggle its way onto BioC-devel via beachmat (v2.17.3) in the next few days.

This does away with the need for integer/numeric templates, it makes it easier to parallelize at the C++ level, and some delayed operations are captured natively (i.e., some DelayedArrays are translated to their C++ equivalents). Other R matrices are still supported, of course, but are handled more slowly by calling back into DelayedArray::extract_array().

It also greatly simplifies compilation. In the coolest and most experimental part of this update, beachmat will generate external pointers to matrix instances that can be passed to glmGamPoi's shared libraries. This means that glmGamPoi doesn't need to actually need to compile support for the various matrix implementations in its own *.so file, which makes it easier to extend tatami's C++ support to, e.g., HDF5, TileDB, Zarr-backed matrices.

I've already switched SingleR to this latest paradigm, hopefully it'll pass BioC-devel in the next few days :crossed_fingers: If it works, I'd like a non-Aaron package developer to test out this new approach, and you're the lucky one.

const-ae commented 1 year ago

Hey Aaron,

Very kind of you to think of me as your guinea pig 😄 I'm currently writing up my thesis, so won't have time for the next few weeks to work on it, but in July I could give it a shot. But just glancing over the docs, it sounds pretty straightforward.

Best, Constantin

LTLA commented 1 year ago

Great. The timeframe is fine, I'll probably need to sort out the hiccups on the BBS anyway.

const-ae commented 1 year ago

Hey,

I read through the documentation of tatami and rewrote the fitBeta_one_group using it (https://github.com/const-ae/glmGamPoi/commit/6476773f8766f70ccd2b39828ca02050453385fb).

It seems to work fine, but there are some bits where I am not sure that I did them optimally:

  1. Why do I have to call beachmat::initializeCpp in R before calling the C++ function?
  2. There appears to be a lot of boilerplate code in C++ to extract a row from a matrix (making a Rtatami::BoundNumericPointer -> creating an Extractor -> allocating a buffer -> calling fetch which returns a double*). Is that intentional?
  3. fetch returns a const double*, which is fine while working on each entry (https://github.com/const-ae/glmGamPoi/commit/6476773f8766f70ccd2b39828ca02050453385fb?diff=split#diff-57115f0ddf68b868ee0015ff6e29def5dfdf467691c36da419470edf147b8dffR393). However, I could not find an efficient way to cast it to a NumericVector (or std::vector) to pass to other functions (https://github.com/const-ae/glmGamPoi/commit/6476773f8766f70ccd2b39828ca02050453385fb?diff=split#diff-57115f0ddf68b868ee0015ff6e29def5dfdf467691c36da419470edf147b8dffR415). Am I missing something?
LTLA commented 1 year ago

Nice.

  • Why do I have to call beachmat::initializeCpp in R before calling the C++ function?

Technically, it is possible to do this inside the C++ function - and in fact, previous versions of the tatami_r library did have a parse() function to convert a matrix-like Rcpp::RObject into a pointer to a tatami::NumericMatrix. However, it was very difficult to unpack some of the delayed operations in R, as this requires inspection of function environments to identify the generic being called, e.g., in the delayed arithmetic operations. In the end, it was just easier for me to do the class inspection in R and then create a pointer to pass to C++.

From a package developer perspective, initializeCpp() allows you to call it once inside an R function and then pass the same pointer to multiple C++ functions. This avoids repaying the cost of the conversion, which - while cheap - still involves some work through the S4 dispatch mechanism and memory allocations. It also means that you avoid the burden of compiling all of the header-only libraries for the delayed operations in your package C++ code, as you only need to compile against the interface; the concrete implementations are compiled in beachmat itself.

  • There appears to be a lot of boilerplate code in C++ to extract a row from a matrix (making a Rtatami::BoundNumericPointer -> creating an Extractor -> allocating a buffer -> calling fetch which returns a double*). Is that intentional?

The amount of boilerplate should be equivalent to the old beachmat3 code with workspaces. I'm not sure if you were using workspaces previously, but the idea is to cache information between repeated fetch() calls to improve efficiency for certain matrix representations. For example, if you're accessing consecutive rows of a CSC matrix, you can cache the position of the index pointers from one row and increment them when fetch()ing the next row.

The new extractors are analogous to the old workspaces but with certain enhancements during their construction:

Technically you don't need to allocate a buffer, you can just call fetch() directly on the index and this will return a std::vector<Data_> to you. But if you're planning to loop, it makes more sense to allocate once outside the loop and just re-use the buffer as you iterate through the rows/columns of the matrix.

If it helps, I think of the extractors as being (kind of, philosophically) like iterators for the usual std containers. So it's about the same amount of boilerplate as iterator-based C++ code.

  1. fetch returns a const double*, which is fine while working on each entry (6476773?diff=split#diff-57115f0ddf68b868ee0015ff6e29def5dfdf467691c36da419470edf147b8dffR393). However, I could not find an efficient way to cast it to a NumericVector (or std::vector) to pass to other functions (6476773?diff=split#diff-57115f0ddf68b868ee0015ff6e29def5dfdf467691c36da419470edf147b8dffR415). Am I missing something?

Do you just want to fetch() directly into your Rcpp::NumericVector? If so, this should be straightforward:

counts_worker->fetch_copy(gene_idx, static_cast<double*>(counts_vec.begin()));

This will fetch and copy into counts_vec. Note that we use fetch_copy() to force the copying of matrix contents into the buffer. This is necessary as fetch() itself will sometimes elide the copy, e.g., if the underlying matrix is dense and a no-copy pointer can be directly returned. Again, it makes more sense to allocate the Rcpp::NumericVector objects outside of the loop and just re-use them. This should put less pressure on the allocator + GC.

Having said all that... the use of R data structures does make it a bit harder to parallelize, along with the fact that the loop involves possibly calling back into an R function. It can be done but does require some care - LMK if you are interested.

const-ae commented 1 year ago

Thanks for the thorough explanation.

The amount of boilerplate should be equivalent to the old beachmat3 code with workspaces

No, I don't think I used workspaces. Previously I did:

auto Y_bm = beachmat::create_matrix<NumericType>(Y_SEXP);
typename NumericType::vector counts(n_samples);
Y_bm->get_row(gene_idx, counts.begin());

Now the equivalent code appears to be:

Rtatami::BoundNumericPointer Y_parsed(Y);
const auto& Y_bm = Y_parsed->ptr;
auto counts_worker = Y_bm->dense_row();
std::vector<double> counts_buffer(n_samples);
const double* counts = counts_worker->fetch(gene_idx, counts_buffer.data());

Or was the original code actually problematic and copied the data from the matrix into counts?

the extractors can be specified with oracles (e.g., tatami::consecutive_extractor) if you know that the subsequent fetch() iterations will follow a certain order. This improves efficiency for certain matrix types

True, that is pretty cool :)

Do you just want to fetch() directly into your Rcpp::NumericVector? If so, this should be straightforward:

counts_worker->fetch_copy(gene_idx, static_cast<double*>(counts_vec.begin()));

This will fetch and copy into counts_vec

Well, ideally I would want fetch to return a data structure of the same type as its second argument. That would allow me for example to allocate an Rcpp::NumericVector once, then keep refilling it, and if necessary pass it on to some other function.

Having said all that... the use of R data structures does make it a bit harder to parallelize, along with the fact that the loop involves possibly calling back into an R function. It can be done but does require some care

The possibility to parallelize execution sounds really neat. How does the current implementation play with BiocParallel? Is it problematic to call checkUserInterrupt()?

LTLA commented 1 year ago
Rtatami::BoundNumericPointer Y_parsed(Y);
const auto& Y_bm = Y_parsed->ptr;
auto counts_worker = Y_bm->dense_row();
std::vector<double> counts_buffer(n_samples);
const double* counts = counts_worker->fetch(gene_idx, counts_buffer.data());

Technically, it's just one extra line if you do auto counts_worker = Y_parsed->ptr->dense_row();. Not too bad IMO.

Or was the original code actually problematic and copied the data from the matrix into counts?

I think it was fine, it was just kinda inefficient in certain cases, e.g., row-wise iteration through a sparse matrix.

Well, ideally I would want fetch to return a data structure of the same type as its second argument. That would allow me for example to allocate an Rcpp::NumericVector once, then keep refilling it, and if necessary pass it on to some other function.

Yeah, that's what the static_cast does. You can convert an atomic Rcpp::Vector iterator to its raw pointer (double for NumericVectors, int for IntegerVectors and LogicalVectors), and accessing the pointer will just reference the underlying Rcpp::Vector. So indeed, you can allocate the vector once and then just continually refill it.

(Note that fetch() is virtual, so I can't just template the second argument. I'd have to create hard-coded overloads for every pointer-like type that I expect to get, and given that the tatami interface is R-agnostic, I wouldn't be able to easily justify support for Rcpp iterators. But it doesn't matter if you just convert them to pointers.)

How does the current implementation play with BiocParallel?

Not at all. The parallelization is done using C++11's <thread>, so it will work on every compliant OS/compiler. I no longer use BiocParallel for performance-critical parallelization because (i) forking doesn't work on Windows and spinning up a separate process is too slow, and (ii) I keep getting OOM problems because the child processes think they have more free memory than is present. C++-level parallelization directly shares memory between threads so there's no duplication or copying or whatever. I've never had OOM problems ever since I switched parallelization schemes.

In theory, we could also use OpenMP, but it's not supported on older versions of Clang, and the R-from-C++ evaluation is not protected with OpenMP because I don't know how to do the relevant shenanigans.

I just realized that the documentation for this functionality is not that great, but in your case, it would be something like:

// Do R-related allocations in serial section OR inside mexec.run().
std::vector<Rcpp::NumericVector> buffers;
buffers.reserve(num_threads);
for (int t = 0; t < num_threads; ++t) {
    buffers.emplace_back(n_samples);
}

std::vector<double*> buffer_ptrs;
buffer_ptrs.reserve(num_threads);
for (int t = 0; t < num_threads; ++t) {
     buffer_ptrs.emplace_back(static_cast<double*>(buffers[t].begin()));
}

// The executor ensures that R-related operations run on the main thread.
auto& mexec = tatami_r::executor();

tatami_r::parallelize([&](size_t thread, int start, int length) -> void {
    auto ext = mat->dense_row();

    // OR, for the advanced: request extraction of [start, start + length) dense rows in this thread.
    // auto ext = tatami::consecutive_extractor</* row = */ true, /* sparse = */ false>(mat.get(), start, length);

    auto& buffer = buffers[thread];
    auto bptr = buffer_ptrs[thread];

    for (int r = start, end = start + length; r < end; ++r) {
        auto ptr = ext->fetch_copy(r, bptr); // Do something non-R related with ptr/bptr, they're the same here.

        mexec.run([&]() -> void { // Now for something potentially R-related.
            some_kind_of_R_operation(buffer);
        });
    }
}, mat->nrow(), num_threads);

Ooof. I'll just say that parallel tatami code is usually much cleaner; it's just that your use case needs to interact with the R API and that requires some gymnastics to protect R from simultaneous activity on multiple threads.

Is it problematic to call checkUserInterrupt()?

You'd have to call that from a serial section, so the code would be a little more complicated. For example, you'd need to parallelize 100 rows at a time in each thread, then return to the serial section to check the interrupt, and then parallelize the next 100 * num_threads rows and so on. I'm not sure I can easily catch the interrupt signal in the threads.

LTLA commented 1 year ago

Nudge - any progress? Need help?

const-ae commented 1 year ago

Hey, sorry for the silence. I was traveling last week and am currently at ICML. I didn't manage to spend any more time on it. Maybe we can discuss next week in person :)

LTLA commented 1 year ago

Just a heads up that, during the next release, I will be slowly removing some of the tests related to the old beachmat API (not any functionality, just the tests) in order to reduce beachmat's build time on the BioC machines.

This may be a good opportunity to switch to new tatami API, which will be much more comprehensively tested than beachmat's old API, especially once I start trimming back the test suite.