chroma-core / chroma

the AI-native open-source embedding database
https://www.trychroma.com/
Apache License 2.0
13.36k stars 1.14k forks source link

[Bug]: query after add does not use updated db #2399

Closed DuijnO closed 5 days ago

DuijnO commented 1 week ago

What happened?

I add items to a chromadb instance located on my filesystem with a celery worker. I use collection.query(query=query, ef=ef) in my flask api. chromadb can't reproduce the newly added items.

If I use collection.get() chromadb can reproduce the newly added items.

If I restart the application loading the collection with get_or_create_collection("mycollection") collection.query can find the previously added items.

I would expect .query to return the items or .get to not return the items since I'm essentially opening the database file twice. But since .get does return the items from the flask instance i would expect .query to behave likewise.

Versions

chromadb 0.5.3 python 3.12.2

Relevant log output

No response

tazarov commented 1 week ago

@DuijnO, can you elaborate a little on your Chroma setup? How are you running Chroma? Is that a docker or CLI?

DuijnO commented 1 week ago

I'm setting it up in python to create a database chroma.sqlite3 on filesystem. persistent_client = chromadb.PersistentClient(path='my/db/path') collection= persistent_client.get_or_create_collection("mycollection")

tazarov commented 1 week ago

@DuijnO, is it possible that you have multiple processes (e.g. each celery task) accessing the same persistent dir? Chroma is not processed.

Is this issue easily reproducible? Can you share some bits of your code to use to reproduce it?

DuijnO commented 1 week ago

i have multiple processes accessing the same persistent dir. it's not easily reproducible i don't think as it consists of a flask api with celery worker. i could try and create a version monday if you're really interested

tazarov commented 1 week ago

@DuijnO can you try to reduce the number of workers to 1 and see if this same behaviour exhibits?

DuijnO commented 1 week ago

is actually already 1, but am also accessing the same persistent dir from the app.py in flask. I think this might be classed as not using it correctly, but still get all files from db works and query does not so it could be that a fix is possible

tazarov commented 1 week ago

Some background on the issue you are potentially facing.

Chroma uses two types of vector indices - bruteforce (a temporary buffer index) and HNSW index. New vectors (default to 100) go to the bruteforce (BF) index first, and only after reaching the default threshold of 100 do they get added to HNSW. The BF index is an in-memory index that is owned by a process. So what I imagine is happening in your case is that some of the processes on these newly added vectors, and when you query these vectors, are not visible to the process executing the query. The get works because it uses the sqlite3 (metadata index), synchronously updated on disk by all processes. Additionally, get will not return the vectors unless you specify include=["embeddings"]. You can read a bit more about this here - https://cookbook.chromadb.dev/core/advanced/wal/

DuijnO commented 1 week ago

yeah, that sounds very likely

tazarov commented 1 week ago

@DuijnO, any chance to run Chroma as a standalone server e.g. via docker or CLI - (see here for options - https://cookbook.chromadb.dev/running/running-chroma/)

DuijnO commented 5 days ago

Yes, I'm running chroma via CLI now and this resolves the issue, Thank you very much