Add linear regression for `IterableMatrix`

ycli1995 commented 4 months ago

Hi, @bnprks,

I finally get LinearResidual works correctly. This idea was generally inspired from ResidualMatrix package.

A new test has been added to test-matrix_transforms.R to make sure linear_regress_out get the same results compared to Seurat:::RegressOutMatrix.

bnprks commented 4 months ago

Hi @ycli1995, this looks really good, thanks!

I'll probably have a few small suggestions once I've done a full review, likely around adding some detail to the documentation and thinking about the best function name and parameter names for R. I'll add those as comments via the github review interface as I come up with them.

I do have one additional suggestion which is optional, but would massively speed up running PCA on these matrices: if you also implement vecMultiply(Right/Left) and maybe even denseMultiply(Right/Left) on the LinearResidual class. I think Shift.cpp will be a useful reference here again for the useful Eigen functions to accomplish this.

This will speed up PCA a bunch because explicitly calculating X - Q * Qty is pretty slow as it's a dense matrix. But if we calculate (X - Q * Qty)v as (X * v) - Q * (Qty * v) then we'll have a sparse multiply with X (fast), and two fast multiplies with the relatively-small Q and Qty matrices. I'd estimate at 10-20x speedups since RNA matrices are usually around 95% sparse.

ycli1995 commented 4 months ago

Hi, @bnprks.

Thank you so much for those suggestions! I've push new commits for them.

About changing latent_data to features, currently the word "features" usually means "genes", "peaks" or something as row names of a single-cell matrix, especially in R (mainly in Seurat). Therefore, "features" may be not the right name for those confounding data in linear regression. In Seurat, confounding data to be regressed out are named as latent data, and in scanpy are named as regressors. I think keeping consistent with Seurat could be a good idea currently.

ycli1995 commented 4 months ago

Hi, @bnprks

New commits have been added to ungroup the matrix multiplications and add corresponded R tests. Note that the results of multiplications passed expect_equal but not expect_identical, although I think all.equal is restrict enough. Therefore I added an option test_func = expect_identical for test_dense_multiply_ops.
I also agree that the variables to be regressed out should be printed to users. A new slot vars_to_regress has been added to TransformLinearResidual.
For latent_data, I think it's a good idea that we keep it as simple as possible currently:
- I agree that we should not silently do nothing when latent_data is empty. In other words, if one doesn't want to regress out anything, just don't need to call the function. Therefore I decided to keep latent_data as a compulsive parameter.
- A model matrix and a bundle of variables (stored in a table) are totally different things. Treating both them as a single parameter latent_data may lead to unnecessary confusions. We just need to keep latent_data as what it is documented.
- Should we allow users to pass their own model matrix? In our current implementation, we mainly focus on regressing out the effects of provided variables with a linear model. That is what the calculations and usages for col_params and row_params are defined based on, as well as how the model_mat is constructed from latent_data. I've tried to make it more clear in the documents.
- In the future, if we want to add supports to other methods, the implementions can be carried by independent functions.

Thanks again for those great suggestions! I think after we add the linear regression feature to BPCells, we can finally make almost all the general data pre-processing for large scale scRNA-seq on-disk.

bnprks commented 4 months ago

Hi @ycli1995, thank you for those additional changes. For (1), using expect_equal is fine here. For (2) looks good. For (3) your choice for a simple solution seems good for now. If people want to pass their own model matrix in the future it will be easy to update to allow that.

I double-checked the whole test suite passes on my end, and made a couple additional changes reflected in my commits above:

Add explicit parentheses back to LinearResidual (just the opposite grouping to how they were originally). Without the parentheses still half of the ops would have the wrong grouping due to C++ multiplication operator associativity.
Edit function documentation, focusing on adding a "details" section
Tweak the short_description message
Slightly reduce the test data dimension since I think they are already more than enough to catch errors.
Add news and update the website docs

Let me know if this all looks okay to you, in which case I'll merge this in!

ycli1995 commented 4 months ago

Hi, @bnprks

Thanks for the double-checks. Those improvements look great!

For your concerns about mean and variance calculations, I think users can take advantages of setting threads in matrix_stat() to speed them up. In fact, I roughly tested regress_out on the 10x Genomics dataset of 1.3 million mouse brain cells with 2000 highly variable genes, and it just took several minutes to run matrix_stats with thread = 8L. And the memory usage was still keep within 10GB, what a wonderful package BPCells is!

bnprks commented 4 months ago

Great, fully merged in now and the docs website is updated now as well. Thanks for your work on this!

bnprks / BPCells

Add linear regression for `IterableMatrix` #110