Closed bkmartinjr closed 11 months ago
@bkmartinjr: This issue is now just waiting for the TileDB fix for the underlying query condition, right? Unless we want to add performance tests it seems this isn't actionable. Close?
Or maybe we have the API issue a warning or error on a large value set for the var query.
I don't think we should add any work-arounds.
I was leaving it open to track the issue on our side, as we don't have access to their tracking system. I didn't want to lose sight of our need for this to be fixed soon. Do you have an alternative preference for tracking these types of dependencies?
I don't think we should add any work-arounds.
I was leaving it open to track the issue on our side, as we don't have access to their tracking system. I didn't want to lose sight of our need for this to be fixed soon. Do you have an alternative preference for tracking these types of dependencies?
Added a (new) blocked
label for now.
Update: ETA from TileDB for a fix to the underlying query condition is late Q1. Tracking id 24310
@ebezzi to follow up with TileDB to understand if this is unblocked. In the future, we will request that TileDB open a corresponding github issue that can block issues like this one.
No need to ask - I already did two weeks ago :-).
The enhancement is slated for TileDB core v 2.16, which is imminent. After that, it simply requires incorporation into tiledbsoma.
Update: verified that this is resolved by the tiledbsoma 1.5RC. Awaiting the actual 1.5 release, after which the cellxgene-census package will release with an updated dependency pin.
Due to an issue with TileDB query conditions, var/obs queries with very large number of values used in an
in
expression will be very slow (time is roughly linear to the number of items in the list).For example, if this query has a very large list of
lung_genes
, it will be very slow.This expands to a
value_filter
on thevar
DataFrame that looks likefeature_id in ["gene1", "gene2", ..., "geneN"]
, which currently has performance roughly O(N), where N is the number of possible matches (right side of thein
operator).a MUCH faster alternative is to directly use the
soma_joinids
(coordinates) and skip the table scan.The filter issue has been reported to TileDB. Still considering an appropriate work-around for the
get_anndata
API. It may be helpful to expose the coords that the underlying experiment query supports.Update: ETA from TileDB for a fix to the underlying query condition is late Q1. Tracking id 24310Update [2023-07-11] this work did not make the TileDB embedded 2.16 release train, and is now slated for 2.17 (ETA Q3 2023)Update [2023-09-29] this is now slated for early Q4 2023 in a 1.5.X release.
Update [2023-10-12] this has been fixed via single-cell-data/TileDB-SOMA#1756 Should be in the next release.
performance test case:
Results of above test case as of 20230927, on main branch and previous release: