Closed DuijnO closed 5 days ago
@DuijnO, can you elaborate a little on your Chroma setup? How are you running Chroma? Is that a docker or CLI?
I'm setting it up in python to create a database chroma.sqlite3 on filesystem. persistent_client = chromadb.PersistentClient(path='my/db/path') collection= persistent_client.get_or_create_collection("mycollection")
@DuijnO, is it possible that you have multiple processes (e.g. each celery task) accessing the same persistent dir? Chroma is not processed.
Is this issue easily reproducible? Can you share some bits of your code to use to reproduce it?
i have multiple processes accessing the same persistent dir. it's not easily reproducible i don't think as it consists of a flask api with celery worker. i could try and create a version monday if you're really interested
@DuijnO can you try to reduce the number of workers to 1 and see if this same behaviour exhibits?
is actually already 1, but am also accessing the same persistent dir from the app.py in flask. I think this might be classed as not using it correctly, but still get all files from db works and query does not so it could be that a fix is possible
Some background on the issue you are potentially facing.
Chroma uses two types of vector indices - bruteforce (a temporary buffer index) and HNSW index. New vectors (default to 100) go to the bruteforce (BF) index first, and only after reaching the default threshold of 100 do they get added to HNSW. The BF index is an in-memory index that is owned by a process. So what I imagine is happening in your case is that some of the processes on these newly added vectors, and when you query these vectors, are not visible to the process executing the query. The get
works because it uses the sqlite3 (metadata index), synchronously updated on disk by all processes. Additionally, get
will not return the vectors unless you specify include=["embeddings"]
. You can read a bit more about this here - https://cookbook.chromadb.dev/core/advanced/wal/
yeah, that sounds very likely
@DuijnO, any chance to run Chroma as a standalone server e.g. via docker or CLI - (see here for options - https://cookbook.chromadb.dev/running/running-chroma/)
Yes, I'm running chroma via CLI now and this resolves the issue, Thank you very much
What happened?
I add items to a chromadb instance located on my filesystem with a celery worker. I use collection.query(query=query, ef=ef) in my flask api. chromadb can't reproduce the newly added items.
If I use collection.get() chromadb can reproduce the newly added items.
If I restart the application loading the collection with get_or_create_collection("mycollection") collection.query can find the previously added items.
I would expect .query to return the items or .get to not return the items since I'm essentially opening the database file twice. But since .get does return the items from the flask instance i would expect .query to behave likewise.
Versions
chromadb 0.5.3 python 3.12.2
Relevant log output
No response