chanzuckerberg / cellxgene-census

CZ CELLxGENE Discover Census
https://chanzuckerberg.github.io/cellxgene-census/
MIT License
77 stars 20 forks source link

Address `pinv` performance #941

Closed atolopko-czi closed 7 months ago

atolopko-czi commented 7 months ago

The numpy.linalg.pinv call is very expensive, as of commit 2601348ba5ab7c04c1ce69166640a3e5ee4e0f04, and takes effectively all of the run-time. See the de_wls_stats_pinv timing line below:

/usr/bin/time -v python -O -m memento.diff_expr "tissue_general_ontology_term_id in ['UBERON:0002405']" sex_ontology_term_id /mnt/census/estimators-cube-70a7705-float32/ 1 10
DEBUG:root:computing for 145 obs groups (49874 cells) and 10 features using 1 processes, 10 features/process
DEBUG:root:design shape: (145, 6231)
DEBUG:root:computing for feature group 0, n=10, ENSG00000183878..ENSG00000259707...
DEBUG:root:computing feature
DEBUG:root:pinv m.shape=(6231, 6231)
DEBUG:root:computing feature
DEBUG:root:pinv m.shape=(6231, 6231)
DEBUG:root:computed for feature group 0, ENSG00000183878..ENSG00000259707, result size=2
[timing 22904] compute_for_feature: cum_time=53.466328713009716 sec; avg_time=26.733; calls=2
[timing 22904] de_wls: cum_time=53.458166673008236 sec; avg_time=26.729; calls=2
[timing 22904] de_wls_stats: cum_time=53.35123460300383 sec; avg_time=26.676; calls=2
[timing 22904] de_wls_stats_pinv: cum_time=53.12332922501082 sec; avg_time=26.562; calls=2
[timing 22904] de_wls_stats_matmul: cum_time=0.22296729500521906 sec; avg_time=0.111; calls=2
[timing 22904] de_wls_fit: cum_time=0.10688159501296468 sec; avg_time=0.053; calls=2
[timing 22904] query_estimators: cum_time=0.10440379299689084 sec; avg_time=0.104; calls=1
[timing 22904] fill_missing_data: cum_time=0.005330480009433813 sec; avg_time=0.003; calls=2
[timing 22904] drop_invalid_data: cum_time=0.0005337570037227124 sec; avg_time=0.001; calls=1
[timing 22904] transform_to_log_space: cum_time=0.00011587899643927813 sec; avg_time=0.000; calls=2
[timing 22904] de_wls_stats_W: cum_time=4.28729981649667e-05 sec; avg_time=0.000; calls=2

The pinv takes a square matrix whose shape (dim size) is dependent upon the number of columns in the design/dummies matrix. And that column count, in turn, is a function of all the distinct values that exist across all of the covariate dimensions. So, for example, if cell type was a covariate, each cell type would add a column to design matrix, and thus to the matrix arg passed to pinv.

The current computation method, which relies upon pinv, may need to be replaced with a more efficient method of computation.

Note, however, that the matrix passed to pinv should be much smaller once for real use cases, where not all cube dimensions are desired as covariates. Currently, all dimensions are used as covariates, but this will be changed by #942. Once that is implemented, and the user can specify specific covariates and the m matrix should be much smaller, which will improve performance. In particular, pinv performance doesn’t scale linearly with the m shape (quadratic?), so performance may be reasonable for most use cases with limited covariates. Its probably worth implementing #942 first, and testing with "real world" (scientifically useful) queries.

atolopko-czi commented 7 months ago

The pinv performance problem was caused by a bug in how pandas.get_dummies() was being called. By passing in categorical columns, get_dummies() used all the categories' values, rather than just those values appearing the rows' data. This cause the returned matrix to be massive (6000x6000), which in turn caused pinv performance to be very slow. Fixing the input data to get_dummies() rectified the performance problem.

Fixed in 861641a65ae19dd297b2168b6f4e89e924e120b6.