Open immanuelazn opened 2 weeks ago
Looks like a good start! A couple high-level thoughts/comments while this is still in the draft stage:
One of the main goals of the matrix_stats
function is to allow collection of multiple row and column stats in a single pass over the matrix. There are a couple spots where calls to rowSums()
and colSums()
or even separate calls to matrix_stats
could be combined into a single call to reduce read passes over the matrix.
For highly variable features, what you've got is perfectly good, though I personally find the dplyr interfaces the easiest way to express this kind of logic. An underutilized feature is being able to a group_by then mutate (rather than group_by + summarize). Would look something like this:
features_df %>%
mutate(bin=cut(mean, n_bins, labels=FALSE)) %>%
group_by(bin) %>%
mutate(normalized_dispersion=(dispersion-mean(dispersion))/sd(dispersion))
It's a good thought to save intermediate data to avoid re-calculating normalizations. I think saving to an in-memory matrix is not the right approach for BPCells because it re-introduces a memory bottleneck unnecessarily, but saving to disk is a reasonable option to offer. Options that I used benchmarking in the paper are:
stage_none
-- don't save any intermediates. This is totally fine for almost everything downstream except for PCA which will be slower than neededstage_int
-- Stage the counts matrix subset to just the features of interest for PCA. When utilizing a lot of parallelization, it's better to do this to than to save normalized data as disk I/O becomes the real bottleneck not CPUstage_float
-- Save the latest normalized copy of the data just prior to any dense transformations like z-scoring. This is best for PCA speed at low core counts but at high core counts will be surpassed by stage_int
. You can save PCA disk bandwidth by saving as float
rather than double
, and keeping compression enabled (gives about 30% total space savings on float
, which is not nothing)I'm not sure what the best options are to offer to end-users and how to explain them. If we got a fancier PCA solver we also might be able to make stage_none
work fine in all cases by doing fewer dense matrix multiplies rather than more vector multiplies
I'm leaving the memory saving stuff to just be saving to a temporary file for now. Let's discuss further in person, then see the optimal way of implementing!
Description
As discussed previously, we would like to have a built-in function for creating latent representations of input matrices. The most common way to do this is using LSI. We implemented functions
lsi()
andhighly_variable_features()
to implement this.Tests
lsi()
andhighly_variable_features()
have had a rudimentary test that checks the output shapes, to determine whether they look reasonable.lsi()
, the.computeLSI()
function was taken, which does not include the iterative steps, and extra bells and whistles. As low rank SVD approximation is non-deterministic, we re-project from the latent space back to the log(tf-idf) matrix, then compare against both packages.highly_variable_features()
, we compare against the scanpy function_highly_variable_genes.highly_variable_genes()
. Top features, (non)normalized dispersions, and means were all compared against each other.Details
We are not looking for a one to one implementation of what has been done on ArchR, which is an iterative LSI approach, with clustering, variable feature selection etc. Instead, we implement feature selection as a separate step, as well as LSI procedure using log(tf-idf) -> z-score norm -> SVD.
As for projection into LSI space, that will be built in a follow up PR for the sklearn interface