chroma-core / chroma

the AI-native open-source embedding database
https://www.trychroma.com/
Apache License 2.0
14.72k stars 1.23k forks source link

[Bug]: Large collection impacting performance of other collections #2769

Closed grosjeang closed 2 weeks ago

grosjeang commented 3 weeks ago

It seems like large collections impact the retrieval speed of other collections for a given chroma.sqlite3 database.

I ran a few tests starting from an empty chroma database, adding at each step a collection of 50k embeddings. Everytime I add a new collection, I run a retrieval on the first collection. The retrieval speed on that first collection (and all the others) grows almost linearly with the number of embeddings stored in other collections.

As an example, here is the number of chunks versus retrieval time i get (default ChromaDB and HNSW parameters, persistent client):

Is this expected behavior in the ChromaDB implementation? Shouldn't all collections be independent?

Versions

Chroma 0.5.3 Python 3.11.7 Oracle Linux 8

tazarov commented 2 weeks ago

@grosjeang, can you share some more info on your type of queries? HNSW, which is the foundation of the vector query has a O(logN) complexity, so I suspect you have a where filter somewhere in your queries. If that is the case, then sqlite3, which does sqlite3 filtering, may indeed bump up the latency at scale. Recently, indices were added to speed up queries - #2623 (to try it out you'll need to use the latest main).

grosjeang commented 2 weeks ago

@tazarov Thank you for your answer, it was indeed a problem with my where filter that was a bit too complex !