chroma-core / chroma

the AI-native open-source embedding database
https://www.trychroma.com/
Apache License 2.0
14.97k stars 1.26k forks source link

[Bug]: chromadb.api.configuration.InvalidConfigurationError: batch_size must be less than or equal to sync_threshold #2574

Open dddxst opened 2 months ago

dddxst commented 2 months ago

What happened?

from typing import List

import chromadb from chromadb.api.configuration import HNSWConfiguration from chromadb.api.models.Collection import Collection from chromadb.utils.embedding_functions.sentence_transformer_embedding_function import \ SentenceTransformerEmbeddingFunction

from read_word import extract_titles

class EmbeddingDB: def init(self, db, embedding_function=None): """ docker pull chromadb/chroma docker run -p 8000:8000 chromadb/chroma

    m3_model = "D:/models/BGE_models"
    model = SentenceTransformer(m3_model)
    client = chromadb.HttpClient(host='localhost', port=8000)
    :param db:
    :param embedding_function:
    """
    self.db = db
    self.embedding_function = embedding_function

def get_or_create_collection(self, name) -> Collection:
    configuration = HNSWConfiguration(batch_size=100, sync_threshold=100)
    if self.embedding_function:
        collection = self.db.get_or_create_collection(
            name=name,
            # embedding_function=self.embedding_function,
            # configuration=configuration
        )
    else:
        collection = self.db.get_or_create_collection(name=name)

    return collection

def add(self, collection_name: str, string: List[str]):
    """

    :param collection_name: 集合的名字
    :param string:
    :return:
    """
    collection = self.get_or_create_collection(collection_name)
    collection.add(
        embeddings=self.embedding_function(string),
        documents=string,
        ids=[f"id{num}" for num in range(len(string))]
    )
    return collection

def delete_collection(self, name: str) -> None:
    self.db.delete_collection(name=name)

embedding_function1 = SentenceTransformerEmbeddingFunction(model_name=m3_model) client = chromadb.HttpClient(host='xx.xx.xx.xx', port=8000)

eDB = EmbeddingDB(client, embedding_function1) titles, docs = extract_titles('wt.docx')

def load_data():

eDB.delete_collection('docs')

# eDB.delete_collection('titles')

eDB.add("docs", docs)
eDB.add("titles", titles)

if name == 'main': load_data()

the error occur on ubuntu,but it will not occur on windows

Versions

v0.5.4, ubuntu22 (or centos7.9), python3.11.9

Relevant log output

File "/root/proj/datautils.py", line 72, in load_data
    eDB.add("docs", docs)
  File "/root/proj/datautils.py", line 47, in add
    collection = self.get_or_create_collection(collection_name)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/proj/datautils.py", line 30, in get_or_create_collection
    collection = self.db.get_or_create_collection(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/chroma/lib/python3.11/site-packages/chromadb/api/client.py", line 166, in get_or_create_collection
    model = self._server.get_or_create_collection(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/chroma/lib/python3.11/site-packages/chromadb/telemetry/opentelemetry/__init__.py", line 146, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/chroma/lib/python3.11/site-packages/chromadb/api/fastapi.py", line 247, in get_or_create_collection
    return self.create_collection(
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/chroma/lib/python3.11/site-packages/chromadb/telemetry/opentelemetry/__init__.py", line 146, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/chroma/lib/python3.11/site-packages/chromadb/api/fastapi.py", line 206, in create_collection
    model = CollectionModel.from_json(resp_json)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/chroma/lib/python3.11/site-packages/chromadb/types.py", line 139, in from_json
    configuration = CollectionConfigurationInternal.from_json(
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/chroma/lib/python3.11/site-packages/chromadb/api/configuration.py", line 217, in from_json
    return cls(parameters=parameters)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/chroma/lib/python3.11/site-packages/chromadb/api/configuration.py", line 115, in __init__
    parameter.value = child_type.from_json(parameter.value)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/chroma/lib/python3.11/site-packages/chromadb/api/configuration.py", line 217, in from_json
    return cls(parameters=parameters)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/chroma/lib/python3.11/site-packages/chromadb/api/configuration.py", line 130, in __init__
    self.configuration_validator()
  File "/root/miniconda3/envs/chroma/lib/python3.11/site-packages/chromadb/api/configuration.py", line 286, in configuration_validator
    raise InvalidConfigurationError(
chromadb.api.configuration.InvalidConfigurationError: batch_size must be less than or equal to sync_threshold
mikethemerry commented 2 months ago

I've just spent three evenings tracking down the same bug and have managed to figure this out in the last half hour or so.

I think this is a regression introduced by https://github.com/chroma-core/chroma/pull/2526/files

I'm still figuring out the reproduction steps, but I think the process is

  1. Deploy chroma and create a collection using <=0.5.4 with metadata={"hnsw:space": "cosine"} or similar. Specifically for me
    self.collection = self.vdb.get_or_create_collection(
            name=collection_name,
            embedding_function=self.embedding_function,
            metadata={"hnsw:space": "cosine"},
        )

This will create the collection with the defaults in 0.5.4 where sync_threshold=100 and batch_size=1000

  1. Upgrade your client to 0.5.5
  2. It is now checking the sync_threshold and batch_size with the existing defaults and throwing the error

I haven't read through all of the other changes to the HNSW work in 0.5.5 but it looks like there's some changes to persistent properties and similar. I actually was trying to change the configured properties specifically with different metadata definitions and similar, but was having a lot of troubles. Specifically, this was not fixed by changing that code to

    self.collection = self.vdb.get_or_create_collection(
            name=collection_name,
            embedding_function=self.embedding_function,
            metadata={"hnsw:space": "cosine", "sync_threshold":1000, "batch_size":100},
        )

As a short term, I would suggest a downgrade to 0.5.4 (this has worked for me) and wait for a patch as the 0.5.5 is still in pre-release.

tazarov commented 2 months ago

@dddxst and @mikethemerry, thanks for reporting and investigating this. Indeed, it was a bug (#2338) released with 0.5.4 which was fixed (#2526) in 0.5.5. The issue is that any DB created with 0.5.4 would result in a validation issue you reporeted.

To fix the problem (ideally, we should've added a migration script to do that, but alas):

If in docker:

Connect to your docker container:

apt update && apt install sqlite3
sqlite3 /chroma/chroma/chroma.sqlite3 "update collections set config_json_str=json_set(config_json_str,'$.hnsw_configuration.batch_size',100,'$.hnsw_configuration.sync_threshold',1000) where name='test';"
# you don't have to run the below, but for consistency reasons:
sqlite3 /chroma/chroma/chroma.sqlite3 "update collection_metadata set int_value = 100 where key='hnsw:batch_size' and collection_id in (select id from collections where name='test');"
sqlite3 /chroma/chroma/chroma.sqlite3 "update collection_metadata set int_value = 1000 where key='hnsw:hnsw:sync_threshold' and collection_id in (select id from collections where name='test');"
dodeeric commented 2 months ago

@mikethemerry, thanks to you it did not take three evenings to me to solve my problem, but only 3 minutes...

dddxst commented 2 months ago

@dddxst and @mikethemerry, thanks for reporting and investigating this. Indeed, it was a bug (#2338) released with 0.5.4 which was fixed (#2526) in 0.5.5. The issue is that any DB created with 0.5.4 would result in a validation issue you reporeted.

To fix the problem (ideally, we should've added a migration script to do that, but alas):

If in docker:

Connect to your docker container:

apt update && apt install sqlite3
sqlite3 /chroma/chroma/chroma.sqlite3 "update collections set config_json_str=json_set(config_json_str,'$.hnsw_configuration.batch_size',100,'$.hnsw_configuration.sync_threshold',1000) where name='test';"
# you don't have to run the below, but for consistency reasons:
sqlite3 /chroma/chroma/chroma.sqlite3 "update collection_metadata set int_value = 100 where key='hnsw:batch_size' and collection_id in (select id from collections where name='test');"
sqlite3 /chroma/chroma/chroma.sqlite3 "update collection_metadata set int_value = 1000 where key='hnsw:hnsw:sync_threshold' and collection_id in (select id from collections where name='test');"

tks,it works when update to 0.5.5,but error occur on windows ...

dddxst commented 2 months ago

I've just spent three evenings tracking down the same bug and have managed to figure this out in the last half hour or so.

I think this is a regression introduced by https://github.com/chroma-core/chroma/pull/2526/files

I'm still figuring out the reproduction steps, but I think the process is

  1. Deploy chroma and create a collection using <=0.5.4 with metadata={"hnsw:space": "cosine"} or similar. Specifically for me
    self.collection = self.vdb.get_or_create_collection(
            name=collection_name,
            embedding_function=self.embedding_function,
            metadata={"hnsw:space": "cosine"},
        )

This will create the collection with the defaults in 0.5.4 where sync_threshold=100 and batch_size=1000

  1. Upgrade your client to 0.5.5
  2. It is now checking the sync_threshold and batch_size with the existing defaults and throwing the error

I haven't read through all of the other changes to the HNSW work in 0.5.5 but it looks like there's some changes to persistent properties and similar. I actually was trying to change the configured properties specifically with different metadata definitions and similar, but was having a lot of troubles. Specifically, this was not fixed by changing that code to

    self.collection = self.vdb.get_or_create_collection(
            name=collection_name,
            embedding_function=self.embedding_function,
            metadata={"hnsw:space": "cosine", "sync_threshold":1000, "batch_size":100},
        )

As a short term, I would suggest a downgrade to 0.5.4 (this has worked for me) and wait for a patch as the 0.5.5 is still in pre-release.

tks

tazarov commented 2 months ago

@dddxst and @mikethemerry, thanks for reporting and investigating this. Indeed, it was a bug (#2338) released with 0.5.4 which was fixed (#2526) in 0.5.5. The issue is that any DB created with 0.5.4 would result in a validation issue you reporeted. To fix the problem (ideally, we should've added a migration script to do that, but alas): If in docker: Connect to your docker container:

apt update && apt install sqlite3
sqlite3 /chroma/chroma/chroma.sqlite3 "update collections set config_json_str=json_set(config_json_str,'$.hnsw_configuration.batch_size',100,'$.hnsw_configuration.sync_threshold',1000) where name='test';"
# you don't have to run the below, but for consistency reasons:
sqlite3 /chroma/chroma/chroma.sqlite3 "update collection_metadata set int_value = 100 where key='hnsw:batch_size' and collection_id in (select id from collections where name='test');"
sqlite3 /chroma/chroma/chroma.sqlite3 "update collection_metadata set int_value = 1000 where key='hnsw:hnsw:sync_threshold' and collection_id in (select id from collections where name='test');"

tks,it works when update to 0.5.5,but error occur on windows ...

Can you share the error you get on Windows?

codetheweb commented 2 months ago

Hey everyone--I believe this is caused by a version mismatch; this shouldn't happen if your client and server are on the same version. Please make sure that your server and client are both on 0.5.5 and let us know if this is still happening.