chanzuckerberg / cellxgene-census

CZ CELLxGENE Discover Census
https://chanzuckerberg.github.io/cellxgene-census/
MIT License
83 stars 20 forks source link

Census cell similarity search: pipeline to build TileDB Vector Indexes of cell embeddings #1112

Closed mlin closed 3 months ago

mlin commented 5 months ago

Develop a productionizable pipeline to build the indexes for TileDB-Vector-Search from the stored Census embeddings (starting with scVI but also UCE, Geneformer, etc.). This consists of some Python code to read the embeddings sparse arrays and build the indexes (which are themselves TileDB arrays), then packaged up for cloud deployment. It's expected to take a few hours for each set of embeddings, and the different sets of embeddings can be processed in parallel.

Unless suggested otherwise, I'll package this as a dockerized WDL pipeline since that's most familiar to me (@mlin).