Execution Time for highly_variable_genes Function

ckrilow commented 1 month ago

Describe the bug

When executing from cellxgene_census.experimental.pp import highly_variable_genes on 3898 cells, across 60K genes, it takes 38 seconds, and I am not sure if this is expected.

To Reproduce

cell_filter = (
  "is_primary_data == True "
  "and tissue_general == 'lung' "
  "and cell_type == 'T cell' "
  "and disease == 'small cell lung carcinoma'"
 )

query = census["census_data"]["homo_sapiens"].axis_query(
  measurement_name="RNA",
  obs_query=tiledbsoma.AxisQuery(value_filter= cell_filter),
)

from cellxgene_census.experimental.pp import highly_variable_genes

hvg_df = highly_variable_genes(
    query,
    n_top_genes = 200,
    #batch_key = ["dataset_id"]
)
hvg_df.query("highly_variable == True")

Expected behavior

I would expect the result to come back in less time than 38seconds.

Environment

Machine: x86_64, Jupyternotebook
OS: Linux
Software versions:
- Package Version
  
  cellxgene-census cell-census/2023-07-25 Python 3.9.19

ckrilow commented 1 month ago

I am working on a better re-prex and will have more information posted shortly.

ivirshup commented 1 month ago

These timings seem accurate for me too, running on my laptop. Some results:

Setup:

```python In [1]: %%time ...: import cellxgene_census ...: import tiledbsoma ...: from cellxgene_census.experimental.pp import highly_variable_genes CPU times: user 2.92 s, sys: 2.32 s, total: 5.24 s Wall time: 2.59 s In [2]: %%time ...: cell_filter = ( ...: "is_primary_data == True " ...: "and tissue_general == 'lung' " ...: "and cell_type == 'T cell' " ...: "and disease == 'small cell lung carcinoma'" ...: ) ...: CPU times: user 2 µs, sys: 0 ns, total: 2 µs Wall time: 6.91 µs In [3]: %%time ...: census = cellxgene_census.open_soma() The "stable" release is currently 2024-07-01. Specify 'census_version="2024-07-01"' in future calls to open_soma() to ensure data consistency. CPU times: user 144 ms, sys: 51.9 ms, total: 196 ms Wall time: 1.36 s In [4]: %%time ...: query = census["census_data"]["homo_sapiens"].axis_query( ...: measurement_name="RNA", ...: obs_query=tiledbsoma.AxisQuery(value_filter= cell_filter), ...: ) CPU times: user 131 ms, sys: 41.1 ms, total: 172 ms Wall time: 1.66 s ```

In [5]: %%time
   ...: hvg_df = highly_variable_genes(
   ...:     query,
   ...:     n_top_genes = 200,
   ...: )
   ...: hvg_df.query("highly_variable == True")
CPU times: user 1min, sys: 14.8 s, total: 1min 15s
Wall time: 54.6 s
Out[5]: 
                means   variances  highly_variable_rank  variances_norm  highly_variable
soma_joinid                                                                             
179          0.015136    0.057508                 162.0        2.909841             True
295          0.084146    0.350629                 152.0        2.969717             True
...               ...         ...                   ...             ...              ...
32293        0.842996   36.706675                   3.0        9.840940             True
32713        0.319651    1.664284                 191.0        2.761724             True

[200 rows x 5 columns]

In [7]: %%time
   ...: adata = query.to_anndata(X_name="raw")
CPU times: user 30.5 s, sys: 6.08 s, total: 36.6 s
Wall time: 28.2 s

These times are cut by two thirds when I run this on AWS.

For comparison:

In [8]: import scanpy as sc
   ...: %time sc.pp.highly_variable_genes(adata, flavor="seurat_v3", n_top_genes=200)
CPU times: user 279 ms, sys: 17 ms, total: 296 ms
Wall time: 299 ms

The implementation of highly variable genes here assumes that you've got a lot of data, so definitely isn't optimized for the small data use case. But it seems that about half the time of this computation is just pulling down data.

There's also I think a pretty good workaround in just using scanpy here, since there's fairly comprehensive testing that the results are the same.

chanzuckerberg / cellxgene-census