chroma-core / chroma

the AI-native open-source embedding database
https://www.trychroma.com/
Apache License 2.0
14.58k stars 1.21k forks source link

[Bug]: Non deterministic query results in a local db query #2675

Open naddeoa opened 1 month ago

naddeoa commented 1 month ago

What happened?

Minimal repro here: https://github.com/naddeoa/chromadb_determinism_issue. Check that out and run make install run (with poetry installed)

The same query will yield different results between database loads. As long as the database is loaded the query results are consistent, but if you unload the database (just let the script end and restart it) then you'll randomly get different results. For example, the following are two sample runs using the repro code that differ. Nothing changed in the database (besides the apparent mutations that chromadb always does even though I'm only reading, which makes things very hard to reason about).

[[[0.7512401342391968, 0.7793669104576111, 0.8091723322868347]],
 [[0.7512401342391968, 0.7793669104576111, 0.8091723322868347]],
 [[0.7512401342391968, 0.7793669104576111, 0.8091723322868347]],
 [[0.7512401342391968, 0.7793669104576111, 0.8091723322868347]],
 [[0.7512401342391968, 0.7793669104576111, 0.8091723322868347]],
 [[0.7512401342391968, 0.7793669104576111, 0.8091723322868347]],
 [[0.7512401342391968, 0.7793669104576111, 0.8091723322868347]],
 [[0.7512401342391968, 0.7793669104576111, 0.8091723322868347]],
 [[0.7512401342391968, 0.7793669104576111, 0.8091723322868347]],
 [[0.7512401342391968, 0.7793669104576111, 0.8091723322868347]]]

# Values only change in between runs, not within runs
[[[0.7512401342391968, 0.8091723322868347, 0.8145251274108887]],
 [[0.7512401342391968, 0.8091723322868347, 0.8145251274108887]],
 [[0.7512401342391968, 0.8091723322868347, 0.8145251274108887]],
 [[0.7512401342391968, 0.8091723322868347, 0.8145251274108887]],
 [[0.7512401342391968, 0.8091723322868347, 0.8145251274108887]],
 [[0.7512401342391968, 0.8091723322868347, 0.8145251274108887]],
 [[0.7512401342391968, 0.8091723322868347, 0.8145251274108887]],
 [[0.7512401342391968, 0.8091723322868347, 0.8145251274108887]],
 [[0.7512401342391968, 0.8091723322868347, 0.8145251274108887]],
 [[0.7512401342391968, 0.8091723322868347, 0.8145251274108887]]]

My chroma instance was created with roughly the following code, using the same chromadb version (0.5.5)

client = chromadb.PersistentClient(path=db_path, settings=Settings(anonymized_telemetry=False))
collection = client.create_collection(name="collection", metadata={"hnsw:space": "cosine"})
df = pd.read_csv(precomputed_embeddings)
metadatas = df[[...]].to_dict("records"))
ids = df["id"].astype(str).tolist()
embeddings: List[Sequence[float]] = df["embeddings"].tolist()
collection.add(documents=None, metadatas=metadatas, ids=ids, embeddings=embeddings)

I'm precomputing embeddings using my own encoder (sentence-transformers/all-MiniLM-L6-v2 for this repro still though) so I'm not using any documents and I'm querying by embedding values. There isn't really any mention of determinism in the chromadb docs so I'm not sure what I should be expecting.

Versions

Chromadb 0.5.5, python 3.10.12

Full dependencies exported from poetry lock file: https://github.com/naddeoa/chromadb_determinism_issue/blob/master/requirements.txt

Relevant log output

No response

tazarov commented 4 weeks ago

@naddeoa,

you are right in your observation about the determinism of HNSW, which Chroma relies on for vector storage and search. The HNSW uses RNG for constructing initial connections. It is normally controlled via the random_seed index parameter. However, experimentation shows that index construction still has a level of randomness (it can be mitigated though).

Yet you may be wondering why would an index that has been initialized return different results for the same query and the answer lies in how Chroma creates and persist HNSW indices. Short overview here. The gist is this - if your dataset is >=100 and <1000 (100 and 1000 are defaults in Chroma that can be changed, hnsw:batch_size and hnsw:sync_threshold), then the HNSW index gets reinitialized every time you either recreate the DB from scratch or you restart the DB.

Here's some code to reproduce your observation in a more concise maner:

import numpy as np
import hnswlib

# init dataset of 1000 items
labels = [i for i in range(1000)]
vectors = np.random.uniform(-1, 1, (1000, 384))

# utility functions

def create_index(space='cosine', dim=384, search_ef=10):
    index = hnswlib.Index(space=space, dim=dim)
    index.init_index(
        random_seed=42,
        max_elements=1000,
        ef_construction=100,
        M=16,
    )
    index.set_ef(search_ef)
    return index

def insert(index, labels, vectors):
    index.add_items(vectors, labels)

def query(index, vector, k=10):
    return index.knn_query(vector, k=k)

results = []
# create 100 indices of 1000 elements each with EF search = 10 (default in Chroma)
for i in range(100):
    index = create_index(search_ef=10)
    insert(index, labels, vectors)
    search_labels = query(index, vectors[0])[0].tolist()[0] #query by the first vector from the data set and only get the labels of the vectors in the index as those are easier to work with than actual vectors
    results.append(search_labels)

# Check the query results for all 100 indices and report discrepancies
last_result=None
for r in results:
    if last_result is None:
        last_result = r
    else:
        if last_result != r:
            print('Different results', last_result, r)

The above code simulates 100 runs, and then queries with the same vector, then compares and reports on discrepancies in the results. As you'll observe, if you run it there will be discrepancies reported most of the time, but sometimes there won't be any.

To mitigate this RNG behavior, HNSW offers search_ef or simply ef, which controls how many nearest neighbors to explore during the KNN search. The parameter defaults to 10, but can be changed via collection metadata at creation and post-collection creation time. Bumping this number will give you the consistency in results you want. It is difficult to say how to determine its value, but experimentation shows that higher values usually yield consistent results more frequently.

For instance, take the above code and change the search_ef to 1000 (aka the elements in the index), and you'll get consistent results most of the time (I'd like to say 100% of the time, but I haven't run that many experiments to be all that confident):

results = []

for i in range(100):
    index = create_index(search_ef=1000)
    insert(index, labels, vectors)
    search_labels = query(index, vectors[0])[0].tolist()[0]
    results.append(search_labels)

last_result=None
for r in results:
    if last_result is None:
        last_result = r
    else:
        if last_result != r:
            print('Different results', last_result, r)
naddeoa commented 4 weeks ago

@tazarov thank you for the detailed exclamation! I'm going to try this out later today and report the results. Is there any way for me to set those variables with the current interface? The top level python interface to clients and data sets don't support a random seed as far as I can tell.

tazarov commented 4 weeks ago

@naddeoa have a look here https://cookbook.chromadb.dev/core/configuration/#hnsw-configuration

You need to pass parameters as collection metadata.

coll = client.get_or_create_collection("mycoll", metadata={"hnsw:search_ef":1000})
naddeoa commented 4 weeks ago

@tazarov I see that for most options but I can't find it for the random seed, unless there's an undocumented hnsw:seed param?

Also, its hard to tell if those are actually working or not. You can put nonsense in there and it doesn't seem to mind. For example, if you do

collection = client.get_or_create_collection("collection", metadata={"hnsw:num_threads": "fish"})

Then chroma will happily write that directly to the sqlite database without doing any validation, and I have no idea what to make of that. Are these things actually used?

sqlite> select * from collection_metadata;
7f40ea05-4de0-43c3-9e6c-6dd941933ccd|hnsw:num_threads|fish|||
atroyn commented 4 weeks ago

@naddeoa this is a legacy of maintaining HNSW params as ordinary metadata. We have a PR in progress that is intended to fix this conclusively by removing these settings and validating them separately.

https://github.com/chroma-core/chroma/pull/2495

setting search_ef to ~50 or so ought to resolve the nondeterminism you are experiencing here, because HNSW will explore further in the graph.

naddeoa commented 4 weeks ago

@atroyn Hmm, it definitely doesn't resolve it in my repro. I just update the client to do this

collection = client.get_or_create_collection(
    "collection", metadata={"hnsw:search_ef": 50}
)

And I still get different results if I keep on trying. Does this depend on the size of the database and the number of results you request in the query? Even setting it to 1 doesn't make a difference.

tazarov commented 4 weeks ago

@naddeoa,

We do validate the params:

import chromadb

client = chromadb.Client()

collection = client.get_or_create_collection("name",metadata={"hnsw:num_threads": "fish"})

results in:

ValueError: Invalid value for HNSW parameter: hnsw:num_threads = fish

Although, this works only for the first time the parameter is initialized. These soon-to-be-legacy (as @atroyn mentioned) parameters have both special semantics as HNSW config params and as regular collection metadata, which can be anything arbitrary. Unfortunately all of the params cannot be changed beyond their initial configuration.

Doing this won't work:

collection = client.get_or_create_collection("name",metadata={"hnsw:search_ef":50})

# this will not take effect beyond modifying the metadata
collection = client.get_or_create_collection("name",metadata={"hnsw:search_ef":100})

To check the actual values:

sqlite> select * from segment_metadata;
atroyn commented 4 weeks ago

If the collection already exists, hnsw:search_ef will get updated on the metadata, but not on the index itself - this is pretty unexpected, and we are fixing it.

For now it has to be set on collection creation. The configuration changes we are due to ship soon will fix this to work as expected.

naddeoa commented 4 weeks ago

Ok, just to summarize, I can't actually work around this in the current version, I'll need to wait for the next release that includes these changes you're talking about?

Is there any super-hacky way of doing it temporarily through monkey patching that you can imagine? Or can I avoid the top level chroma apis and use the index directly like you were doing in these examples to get the configuration to take effect?

naddeoa commented 4 weeks ago

I think I may have answered my question. @tazarov is right that I can't update that metadata through the metadata api in the current version, but I can update it with sqlite directly

INSERT INTO segment_metadata (segment_id, key, int_value)
VALUES ('e5ad0deb-1f54-4233-8fb5-4f15fb7d30e3', 'hnsw:search_ef', 100);

From my tests it actually seems to work. It doesn't seem like its any slower (at least at n_results=14). Is there a nicer way to set that hnsw param? I'm happy to do a temporary workaround, and I'd be even happier to avoid temporarily using sqlite for the fix.

EDIT: Realized I can just set this upfront to avoid the hacky sqlite set later. Its easy enough for me to regenerate the instance. With that, I think my problem is solved. I guess I'll leave it up to the team to decide what to do with this issue. Not sure if you want this open to track gaps in determinism instructions in docs or issues with updating config values post-creation. Thanks for the help either way.

naddeoa commented 4 weeks ago

Just a note: I tried 50 and I did see randomness still, but it took a few hundred tests for it to show up. Running longer tests with 100 but that seems much safer so far.

atroyn commented 3 weeks ago

Yes, the workaround is to set the parameter on create_collection