chanzuckerberg / cellxgene-census

CZ CELLxGENE Discover Census
https://chanzuckerberg.github.io/cellxgene-census/
MIT License
84 stars 20 forks source link

get_anndata is slow for large gene lists #36

Closed bkmartinjr closed 11 months ago

bkmartinjr commented 1 year ago

Due to an issue with TileDB query conditions, var/obs queries with very large number of values used in an in expression will be very slow (time is roughly linear to the number of items in the list).

For example, if this query has a very large list of lung_genes, it will be very slow.

lung_genes = list(lung_var["feature_id"])
lung_adata = cell_census.get_anndata(
    census,
    organism="Homo sapiens",
    obs_query={"tissue_general": "lung", "is_primary_data": True},
    var_query={"feature_id": lung_genes},
)

This expands to a value_filter on the var DataFrame that looks like feature_id in ["gene1", "gene2", ..., "geneN"], which currently has performance roughly O(N), where N is the number of possible matches (right side of the in operator).

a MUCH faster alternative is to directly use the soma_joinids (coordinates) and skip the table scan.

The filter issue has been reported to TileDB. Still considering an appropriate work-around for the get_anndata API. It may be helpful to expose the coords that the underlying experiment query supports.


Update: ETA from TileDB for a fix to the underlying query condition is late Q1. Tracking id 24310

Update [2023-07-11] this work did not make the TileDB embedded 2.16 release train, and is now slated for 2.17 (ETA Q3 2023)

Update [2023-09-29] this is now slated for early Q4 2023 in a 1.5.X release.

Update [2023-10-12] this has been fixed via single-cell-data/TileDB-SOMA#1756 Should be in the next release.


performance test case:

"""
Test case for cellxgene-census#36
"""

import sys
import time

import cellxgene_census
import tiledbsoma

def main():
    print(tiledbsoma.show_package_versions())
    organism = "homo_sapiens"
    with cellxgene_census.open_soma(census_version="latest") as census:
        exp = census["census_data"][organism]
        var_df = exp.ms["RNA"].var.read().concat().to_pandas()
        test_cases = [
            slice(0, 10),
            slice(0, 100),
            slice(0, 1000, 10),
            slice(0, 250),
            slice(0, 500),
            slice(0, 1000),
            slice(0, 2000),
            slice(0, 4000),
        ]

        print("test,                   coords,    filter")
        for test_slc in test_cases:
            genes = var_df[test_slc]

            # Coordinates test
            var_coords = genes.soma_joinid.to_numpy()
            t_start = time.perf_counter()
            exp.ms['RNA'].var.read(coords=(var_coords,)).concat()
            t_coords = time.perf_counter() - t_start

            # Value filter test
            var_value_filter = f"feature_id in {genes.feature_id.to_list()}"
            t_start = time.perf_counter()
            exp.ms['RNA'].var.read(value_filter=var_value_filter).concat()
            t_value_filter = time.perf_counter() - t_start

            print(f"{str(test_slc):>20}, {t_coords:8.2f}, {t_value_filter:8.2f}")

if __name__ == "__main__":
    sys.exit(main())

Results of above test case as of 20230927, on main branch and previous release:

tiledbsoma.__version__        1.4.3
TileDB-Py tiledb.version()    (0, 22, 3)
TileDB core version           2.16.3
libtiledbsoma version()       libtiledb=2.16.2
python version                3.10.12.final.0
OS version                    Linux 6.2.0-1012-aws
None
The "latest" release is currently 2023-09-25. Specify 'census_version="2023-09-25"' in future calls to open_soma() to ensure data consistency.
test,                   coords,    filter
  slice(0, 10, None),     0.29,     0.33
 slice(0, 100, None),     0.30,     0.70
  slice(0, 1000, 10),     0.25,     0.62
 slice(0, 250, None),     0.36,     1.01
 slice(0, 500, None),     0.24,     1.71
slice(0, 1000, None),     0.24,     3.25
slice(0, 2000, None),     0.29,     6.28
slice(0, 4000, None),     0.29,    12.81

tiledbsoma.__version__        1.5.0rc0.post9.dev3876950273
TileDB-Py tiledb.version()    (0, 23, 0)
TileDB core version           2.17.0
libtiledbsoma version()       libtiledb=2.17.0
python version                3.10.12.final.0
OS version                    Linux 6.2.0-1012-aws
None
The "latest" release is currently 2023-09-25. Specify 'census_version="2023-09-25"' in future calls to open_soma() to ensure data consistency.
test,                   coords,    filter
  slice(0, 10, None),     0.28,     0.44
 slice(0, 100, None),     0.29,     0.68
  slice(0, 1000, 10),     0.24,     0.67
 slice(0, 250, None),     0.34,     1.09
 slice(0, 500, None),     0.26,     2.42
slice(0, 1000, None),     0.28,     3.23
slice(0, 2000, None),     0.25,     6.32
slice(0, 4000, None),     0.31,    13.51
atolopko-czi commented 1 year ago

@bkmartinjr: This issue is now just waiting for the TileDB fix for the underlying query condition, right? Unless we want to add performance tests it seems this isn't actionable. Close?

Or maybe we have the API issue a warning or error on a large value set for the var query.

bkmartinjr commented 1 year ago

I don't think we should add any work-arounds.

I was leaving it open to track the issue on our side, as we don't have access to their tracking system. I didn't want to lose sight of our need for this to be fixed soon. Do you have an alternative preference for tracking these types of dependencies?

atolopko-czi commented 1 year ago

I don't think we should add any work-arounds.

I was leaving it open to track the issue on our side, as we don't have access to their tracking system. I didn't want to lose sight of our need for this to be fixed soon. Do you have an alternative preference for tracking these types of dependencies?

Added a (new) blocked label for now.

brianraymor commented 1 year ago

Update: ETA from TileDB for a fix to the underlying query condition is late Q1. Tracking id 24310

@ebezzi to follow up with TileDB to understand if this is unblocked. In the future, we will request that TileDB open a corresponding github issue that can block issues like this one.

bkmartinjr commented 1 year ago

No need to ask - I already did two weeks ago :-).

The enhancement is slated for TileDB core v 2.16, which is imminent. After that, it simply requires incorporation into tiledbsoma.

bkmartinjr commented 1 year ago

Update: verified that this is resolved by the tiledbsoma 1.5RC. Awaiting the actual 1.5 release, after which the cellxgene-census package will release with an updated dependency pin.