chanzuckerberg / cellxgene-census

CZ CELLxGENE Discover Census
https://chanzuckerberg.github.io/cellxgene-census/
MIT License
83 stars 20 forks source link

Embeddings search experimental API #1164

Closed mlin closed 1 month ago

mlin commented 4 months ago

Adds two new functions to cellxgene_census.experimental:

  1. find_nearest_obs uses TileDB-Vector-Search indexes of Census embeddings to find nearest neighbors of given embedding vectors (in an AnnData obsm layer). #1114
  2. predict_obs_metadata uses the nearest neighbors to predict metadata attributes like cell_type and tissue_general for the query cells. Naive initial implementation is just a starting point to start experimenting with. #1115

The TileDB-Vector-Search query speed seems to be very S3-latency-sensitive, even moreso than typical Census queries. It's many times faster to run from within AWS us-west-2 than externally.

codecov[bot] commented 3 months ago

Codecov Report

Attention: Patch coverage is 96.82540% with 2 lines in your changes missing coverage. Please review.

Project coverage is 91.41%. Comparing base (eb8f449) to head (5c0668e). Report is 2 commits behind head on main.

Files Patch % Lines
...cellxgene_census/experimental/_embedding_search.py 96.72% 2 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #1164 +/- ## ========================================== + Coverage 91.26% 91.41% +0.15% ========================================== Files 80 82 +2 Lines 6329 6463 +134 ========================================== + Hits 5776 5908 +132 - Misses 553 555 +2 ``` | [Flag](https://app.codecov.io/gh/chanzuckerberg/cellxgene-census/pull/1164/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=chanzuckerberg) | Coverage Δ | | |---|---|---| | [unittests](https://app.codecov.io/gh/chanzuckerberg/cellxgene-census/pull/1164/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=chanzuckerberg) | `91.41% <96.82%> (+0.15%)` | :arrow_up: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=chanzuckerberg#carryforward-flags-in-the-pull-request-comment) to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

mlin commented 3 months ago

@ebezzi Putting this up for initial review since it's working well, but we still need to plan action on #1181 -- this still copies the approach of hard-coding the base S3 URI.

mlin commented 3 months ago

@ebezzi @pablo-gar @ivirshup Updated this to resolve indexes through mirrors/contributions json and remove the need for caller to use get_embedding_metadata_by_name() on their own. Please take another pass including the prior discussion. Unfortunately we have known CI issues currently but I've run the new test cases locally. 🙏

mlin commented 1 month ago

@ivirshup I split out the perf optimization to #1257 since I was still getting an error, will write more there -- hope you don't mind, it's only because I need to triage desperately right now!