chanzuckerberg / cellxgene-census

CZ CELLxGENE Discover Census
https://chanzuckerberg.github.io/cellxgene-census/
MIT License
72 stars 19 forks source link

Census cell similarity search: finalize how to deploy TileDB vector indexes #1116

Open mlin opened 2 months ago

mlin commented 2 months ago

The Census cell similarity search is backed by TileDB-Vector-Search indexes of the embeddings. These indexes are themselves TileDB arrays to store on S3. Finalize details of where they should be stored on S3 and the procedures we'll use to build and publish them there for each Census LTS release. And the experimental Python APIs should use the finalized locations of course.

mlin commented 1 month ago

S3 folder with the final-for-now indexes currently in a private bucket (details on Slack).

The folder structure there starts with a 2023-12-15 subfolder (census version).

The goal is to copy it into a suitable public location under s3://cellxgene-contrib-public, similar to CENSUS_EMBEDDINGS_LOCATION_BASE_URI we use to resolve the embedding arrays themselves, specifically by appending the census version and embedding ID.

Again the staging folder has the desired structure, which we just need to preserve in copying it to the public bucket.

mlin commented 4 weeks ago

@ebezzi @metakuni To close out this ticket, is there somewhere we're documenting the Census/LTS release process where we could include information about the indexes?

metakuni commented 3 weeks ago

I believe @ebezzi had put together a doc.