const-ae / glmGamPoi

Fit Gamma-Poisson Generalized Linear Models Reliably
105 stars 15 forks source link

Switch from the old beachmat API to the new tatami library. #66

Closed LTLA closed 1 week ago

LTLA commented 2 weeks ago

I'm looking to remove the old beachmat API, so I've updated your code to use the latest tatami library.

Most of the changes should be obvious, but I'll mention a few things:

I have omitted some further optimizations, e.g., parallelization of the loops is a little bit complicated because there are a few references to strictly serial R code and data structures. However, it can be done with a bit more effort.

And yes, it is C++17, but beachmat has been relying on C++17 for a while so a user would have needed it in the stack anyway to install glmGamPoi.

const-ae commented 1 week ago

Hi Aaron, thank you so much for making the PR. This would have taken me days. I agree that it would be very cool to enable parallelization (and it's amazing that tatami makes this possible), but as it would require additional parameters to user-facing functions, I will wait with this.

And yes, it is C++17, but beachmat has been relying on C++17 for a while so a user would have needed it in the stack anyway to install glmGamPoi.

That is reassuring to hear. My fear was always that being an indirect dependency of Seurat (through sctransform) meant that if I asked a for a too recent C++ version, would break the installations for many users. But I just saw that sctransform also requires C++17, so is no longer a concern :)

Each fetch() call returns a pointer to the requested data, which is the real "result" of the call; any modifications to the supplied buffer is just a side-effect. For ordinary dense matrices, this distinction is useful as a pointer to the underlying array can be returned directly, avoiding an unnecessary copy into the supplied buffer. However, this also means that the buffer may not contain the contents of the desired row/column after fetch() returns. Normally this is not an issue, as callers can just use the returned pointer for data access. But if the buffer must be populated (e.g., for later Armadillo operations), the caller should use tatami::copy_n() to guarantee so.

I don't quite understand what you mean here. Do you mean that unlike the old beachmat get_row function, the tatami fetch function does not necessarily fill the pointer that is supplied as an argument? The documentation of tatami's fetch method for dense data, mentions that one can compare the returned pointer with the provided pointer; would this make sense in fitBeta_fisher_scoring_impl? Or is there some other way to detect that for dense matrices I can use the feature of Armadillo to wrap external memory?


Beachmat used to distinguish if the matrix was filled with doubles or integers. Do I understand correctly that in the new code everything is simply cast to double?

LTLA commented 1 week ago

Do you mean that unlike the old beachmat get_row function, the tatami fetch function does not necessarily fill the pointer that is supplied as an argument?

Correct.

The documentation of tatami's fetch method for dense data, mentions that one can compare the returned pointer with the provided pointer; would this make sense in fitBeta_fisher_scoring_impl?

Indeed, tatami::copy_n does the pointer comparison and fills the buffer if the pointers are not the same. After this point, you can consider the buffers to be populated as they were previously.

Or is there some other way to detect that for dense matrices I can use the feature of Armadillo to wrap external memory?

That would be another option if you don't want to perform a copy. Skipping the copy might be faster in the short term, but on the other hand, the copy might improve performance if you're doing more operations down the line, e.g., if the Armadillo vectors are properly aligned for SIMD operations. Probably doesn't matter; I'd just do whatever is easier to understand.

Beachmat used to distinguish if the matrix was filled with doubles or integers. Do I understand correctly that in the new code everything is simply cast to double?

Correct. Technically, tatami can be templated to support integer interfaces, but this would double the compilation time and binary size of beachmat, so I didn't bother.

const-ae commented 1 week ago

Excellent, then I believe I understand what the changes do and will merge :)