immanuelazn commented 2 weeks ago

Description

As discussed previously, we would like to have a built-in function for creating latent representations of input matrices. The most common way to do this is using LSI. We implemented functions lsi() and highly_variable_features() to implement this.

Tests

Both lsi() and highly_variable_features() have had a rudimentary test that checks the output shapes, to determine whether they look reasonable.
For lsi(), the .computeLSI() function was taken, which does not include the iterative steps, and extra bells and whistles. As low rank SVD approximation is non-deterministic, we re-project from the latent space back to the log(tf-idf) matrix, then compare against both packages.
For highly_variable_features(), we compare against the scanpy function _highly_variable_genes.highly_variable_genes(). Top features, (non)normalized dispersions, and means were all compared against each other.

Details

We are not looking for a one to one implementation of what has been done on ArchR, which is an iterative LSI approach, with clustering, variable feature selection etc. Instead, we implement feature selection as a separate step, as well as LSI procedure using log(tf-idf) -> z-score norm -> SVD.

As for projection into LSI space, that will be built in a follow up PR for the sklearn interface

bnprks commented 2 weeks ago

Looks like a good start! A couple high-level thoughts/comments while this is still in the draft stage:

One of the main goals of the matrix_stats function is to allow collection of multiple row and column stats in a single pass over the matrix. There are a couple spots where calls to rowSums() and colSums() or even separate calls to matrix_stats could be combined into a single call to reduce read passes over the matrix.
For highly variable features, what you've got is perfectly good, though I personally find the dplyr interfaces the easiest way to express this kind of logic. An underutilized feature is being able to a group_by then mutate (rather than group_by + summarize). Would look something like this:
```
features_df %>%
mutate(bin=cut(mean, n_bins, labels=FALSE)) %>%
group_by(bin) %>%
mutate(normalized_dispersion=(dispersion-mean(dispersion))/sd(dispersion))
```
It's a good thought to save intermediate data to avoid re-calculating normalizations. I think saving to an in-memory matrix is not the right approach for BPCells because it re-introduces a memory bottleneck unnecessarily, but saving to disk is a reasonable option to offer. Options that I used benchmarking in the paper are:
- stage_none -- don't save any intermediates. This is totally fine for almost everything downstream except for PCA which will be slower than needed
- stage_int -- Stage the counts matrix subset to just the features of interest for PCA. When utilizing a lot of parallelization, it's better to do this to than to save normalized data as disk I/O becomes the real bottleneck not CPU
- stage_float -- Save the latest normalized copy of the data just prior to any dense transformations like z-scoring. This is best for PCA speed at low core counts but at high core counts will be surpassed by stage_int. You can save PCA disk bandwidth by saving as float rather than double, and keeping compression enabled (gives about 30% total space savings on float, which is not nothing)
I'm not sure what the best options are to offer to end-users and how to explain them. If we got a fancier PCA solver we also might be able to make stage_none work fine in all cases by doing fewer dense matrix multiplies rather than more vector multiplies

immanuelazn commented 2 weeks ago

I'm leaving the memory saving stuff to just be saving to a temporary file for now. Let's discuss further in person, then see the optimal way of implementing!

bnprks / BPCells

[r] add lsi, var feature selection #156

Description

Tests

Details