Closed LTLA closed 1 week ago
Hi Aaron, thank you so much for making the PR. This would have taken me days. I agree that it would be very cool to enable parallelization (and it's amazing that tatami makes this possible), but as it would require additional parameters to user-facing functions, I will wait with this.
And yes, it is C++17, but beachmat has been relying on C++17 for a while so a user would have needed it in the stack anyway to install glmGamPoi.
That is reassuring to hear. My fear was always that being an indirect dependency of Seurat (through sctransform) meant that if I asked a for a too recent C++ version, would break the installations for many users. But I just saw that sctransform also requires C++17, so is no longer a concern :)
Each fetch() call returns a pointer to the requested data, which is the real "result" of the call; any modifications to the supplied buffer is just a side-effect. For ordinary dense matrices, this distinction is useful as a pointer to the underlying array can be returned directly, avoiding an unnecessary copy into the supplied buffer. However, this also means that the buffer may not contain the contents of the desired row/column after fetch() returns. Normally this is not an issue, as callers can just use the returned pointer for data access. But if the buffer must be populated (e.g., for later Armadillo operations), the caller should use tatami::copy_n() to guarantee so.
I don't quite understand what you mean here. Do you mean that unlike the old beachmat get_row
function, the tatami fetch
function does not necessarily fill the pointer that is supplied as an argument? The documentation of tatami's fetch
method for dense data, mentions that one can compare the returned pointer with the provided pointer; would this make sense in fitBeta_fisher_scoring_impl
? Or is there some other way to detect that for dense matrices I can use the feature of Armadillo to wrap external memory?
Beachmat used to distinguish if the matrix was filled with doubles or integers. Do I understand correctly that in the new code everything is simply cast to double?
Do you mean that unlike the old beachmat
get_row
function, the tatamifetch
function does not necessarily fill the pointer that is supplied as an argument?
Correct.
The documentation of tatami's
fetch
method for dense data, mentions that one can compare the returned pointer with the provided pointer; would this make sense infitBeta_fisher_scoring_impl
?
Indeed, tatami::copy_n
does the pointer comparison and fills the buffer if the pointers are not the same. After this point, you can consider the buffers to be populated as they were previously.
Or is there some other way to detect that for dense matrices I can use the feature of Armadillo to wrap external memory?
That would be another option if you don't want to perform a copy. Skipping the copy might be faster in the short term, but on the other hand, the copy might improve performance if you're doing more operations down the line, e.g., if the Armadillo vectors are properly aligned for SIMD operations. Probably doesn't matter; I'd just do whatever is easier to understand.
Beachmat used to distinguish if the matrix was filled with doubles or integers. Do I understand correctly that in the new code everything is simply cast to double?
Correct. Technically, tatami can be templated to support integer interfaces, but this would double the compilation time and binary size of beachmat, so I didn't bother.
Excellent, then I believe I understand what the changes do and will merge :)
I'm looking to remove the old beachmat API, so I've updated your code to use the latest tatami library.
Most of the changes should be obvious, but I'll mention a few things:
tatami::Matrix
created bybeachmat::initializeCpp()
. This handles the usual dense and sparse matrices as well as the newSVT_SparseMatrix
objects and many delayed operations in DelayedArray. Unlike before, this is true native support, e.g., the delayed operations are executed in C++ rather than calling back into R. It also supports extension by beachmat.hdf5 and the upcoming beachmat.tiledb.tatami::consecutive_extractor()
calls createExtractor
instances that promise to iterate through all the genes in consecutive order. This enables optimizations for some matrix representations that benefit from knowledge of the future, e.g., to pre-fetch data from the filesystem for file-backed matrices. However, it does mean that the subsequent loop must uphold that promise.fetch()
must be called even if we don't intend to use the values, which motivates the shifting of thefetch()
calls beforecontinue
infitBeta_one_group()
.fetch()
call returns a pointer to the requested data, which is the real "result" of the call; any modifications to the supplied buffer is just a side-effect. For ordinary dense matrices, this distinction is useful as a pointer to the underlying array can be returned directly, avoiding an unnecessary copy into the supplied buffer. However, this also means that the buffer may not contain the contents of the desired row/column afterfetch()
returns. Normally this is not an issue, as callers can just use the returned pointer for data access. But if the buffer must be populated (e.g., for later Armadillo operations), the caller should usetatami::copy_n()
to guarantee so.I have omitted some further optimizations, e.g., parallelization of the loops is a little bit complicated because there are a few references to strictly serial R code and data structures. However, it can be done with a bit more effort.
And yes, it is C++17, but beachmat has been relying on C++17 for a while so a user would have needed it in the stack anyway to install glmGamPoi.