chanzuckerberg / cellxgene

An interactive explorer for single-cell transcriptomics data
https://chanzuckerberg.github.io/cellxgene/
MIT License
614 stars 116 forks source link

diffexp scikit-learn optimization #2468

Open atolopko-czi opened 2 years ago

atolopko-czi commented 2 years ago

From Slack: Alexander Tarashansky: In backend/common/compute/diffexp_generic.py, (lines 122-133):

    with np.errstate(divide="call", invalid="call", call=fp_err_set):
        n = X.shape[0]
        if sparse.issparse(X):
            mean = X.mean(axis=0).A1
            dfm = X - mean # This is going to be really really slow for big datasets
            sumsq = np.sum(np.multiply(dfm, dfm), axis=0).A1
            v = sumsq / (n - 1)
        else:
            mean = X.mean(axis=0)
            dfm = X - mean
            sumsq = np.sum(np.multiply(dfm, dfm), axis=0)
            v = sumsq / (n - 1)

I added a comment next to the problem line. For extremely huge, sparse matrices (like tabula sapiens), this is going to create an intermediate, totally dense copy of the sparse data. I'd recommend using this utility function from sklearn instead: http://scikit-learn.org/stable/modules/generated/sklearn.utils.sparsefuncs.mean_variance_axis.html

import sklearn.utils.sparsefuncs as sf
def mean_var_n(X):
    """
    Two-pass variance calculation.  Numerically (more) stable 
    (I removed the infinite/nan-checking since that's handled internally by numpy/sklearn)
    than naive methods (and same method used by numpy.var())
    https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Two-pass
    """
    n = X.shape[0]
    if sparse.issparse(X):
        mean,v = sf.mean_variance_axis(X,axis=0)
    else:
        mean,v = X.mean(0),X.var(0)

    return mean, v, n

for ex, finding marker genes for a cluster of 20 cells in tabula sapiens took 8 minutes before the above change and 40s after

atarashansky commented 2 years ago

If the above solution seems reasonable, I can go ahead and make the PR. Let me know!

atolopko-czi commented 2 years ago

If the above solution seems reasonable, I can go ahead and make the PR. Let me know!

Please do! I'm not sure when the next cellxgene release will be cut, but this may be reason enough to do so.