Closed ycli1995 closed 4 months ago
Hi @ycli1995, this looks really good, thanks!
I'll probably have a few small suggestions once I've done a full review, likely around adding some detail to the documentation and thinking about the best function name and parameter names for R. I'll add those as comments via the github review interface as I come up with them.
I do have one additional suggestion which is optional, but would massively speed up running PCA on these matrices: if you also implement vecMultiply(Right/Left)
and maybe even denseMultiply(Right/Left)
on the LinearResidual
class. I think Shift.cpp
will be a useful reference here again for the useful Eigen functions to accomplish this.
This will speed up PCA a bunch because explicitly calculating X - Q * Qty
is pretty slow as it's a dense matrix. But if we calculate (X - Q * Qty)v
as (X * v) - Q * (Qty * v)
then we'll have a sparse multiply with X (fast), and two fast multiplies with the relatively-small Q
and Qty
matrices. I'd estimate at 10-20x speedups since RNA matrices are usually around 95% sparse.
Hi, @bnprks.
Thank you so much for those suggestions! I've push new commits for them.
About changing latent_data
to features
, currently the word "features" usually means "genes", "peaks" or something as row names of a single-cell matrix, especially in R (mainly in Seurat
). Therefore, "features" may be not the right name for those confounding data in linear regression. In Seurat
, confounding data to be regressed out are named as latent data
, and in scanpy
are named as regressors
. I think keeping consistent with Seurat
could be a good idea currently.
Hi, @bnprks
expect_equal
but not expect_identical
, although I think all.equal
is restrict enough. Therefore I added an option test_func = expect_identical
for test_dense_multiply_ops
.vars_to_regress
has been added to TransformLinearResidual
.latent_data
, I think it's a good idea that we keep it as simple as possible currently:
latent_data
is empty. In other words, if one doesn't want to regress out anything, just don't need to call the function. Therefore I decided to keep latent_data
as a compulsive parameter. latent_data
may lead to unnecessary confusions. We just need to keep latent_data
as what it is documented.col_params
and row_params
are defined based on, as well as how the model_mat
is constructed from latent_data
. I've tried to make it more clear in the documents.Thanks again for those great suggestions! I think after we add the linear regression feature to BPCells
, we can finally make almost all the general data pre-processing for large scale scRNA-seq on-disk.
Hi @ycli1995, thank you for those additional changes. For (1), using expect_equal
is fine here. For (2) looks good. For (3) your choice for a simple solution seems good for now. If people want to pass their own model matrix in the future it will be easy to update to allow that.
I double-checked the whole test suite passes on my end, and made a couple additional changes reflected in my commits above:
short_description
messageLet me know if this all looks okay to you, in which case I'll merge this in!
Hi, @bnprks
Thanks for the double-checks. Those improvements look great!
For your concerns about mean and variance calculations, I think users can take advantages of setting threads
in matrix_stat()
to speed them up. In fact, I roughly tested regress_out
on the 10x Genomics dataset of 1.3 million mouse brain cells with 2000 highly variable genes, and it just took several minutes to run matrix_stats
with thread = 8L
. And the memory usage was still keep within 10GB, what a wonderful package BPCells
is!
Great, fully merged in now and the docs website is updated now as well. Thanks for your work on this!
Hi, @bnprks,
I finally get
LinearResidual
works correctly. This idea was generally inspired from ResidualMatrix package.A new test has been added to
test-matrix_transforms.R
to make surelinear_regress_out
get the same results compared toSeurat:::RegressOutMatrix
.