[Bug]: Large collection impacting performance of other collections

grosjeang commented 3 weeks ago

It seems like large collections impact the retrieval speed of other collections for a given chroma.sqlite3 database.

I ran a few tests starting from an empty chroma database, adding at each step a collection of 50k embeddings. Everytime I add a new collection, I run a retrieval on the first collection. The retrieval speed on that first collection (and all the others) grows almost linearly with the number of embeddings stored in other collections.

As an example, here is the number of chunks versus retrieval time i get (default ChromaDB and HNSW parameters, persistent client):

50 000 chunks -> 22s
100 000 chunks -> 38s
150 000 chunks -> 51s
200 000 chunks -> 68s
250 000 chunks -> 81s

Is this expected behavior in the ChromaDB implementation? Shouldn't all collections be independent?

Versions

Chroma 0.5.3 Python 3.11.7 Oracle Linux 8

tazarov commented 2 weeks ago

@grosjeang, can you share some more info on your type of queries? HNSW, which is the foundation of the vector query has a O(logN) complexity, so I suspect you have a where filter somewhere in your queries. If that is the case, then sqlite3, which does sqlite3 filtering, may indeed bump up the latency at scale. Recently, indices were added to speed up queries - #2623 (to try it out you'll need to use the latest main).

grosjeang commented 2 weeks ago

@tazarov Thank you for your answer, it was indeed a problem with my where filter that was a bit too complex !

chroma-core / chroma

[Bug]: Large collection impacting performance of other collections #2769

Versions