chroma-core / chroma

the AI-native open-source embedding database
https://www.trychroma.com/
Apache License 2.0
13.7k stars 1.15k forks source link

[Feature Request]: auto increment option for ids #979

Open Tijana37 opened 11 months ago

Tijana37 commented 11 months ago

Describe the problem

What I find really helpful in other databases is having the option to set auto_increment on primary key and it generates increasing integers as ids.

Describe the proposed solution

What I would like to see is setting some option when creating collection for auto_increment and not settings IDs when adding items because Chroma will create them automatically.

Example code: collection = vector_db_client.create_collection(name=chroma_collection_name, auto_increment=True) # setting something like this collection.add( documents=labels, metadatas=metadatas,

ids=ids_entities --> In order to NOT need this line!

        )

Alternatives considered

No response

Importance

would make my life easier

Additional Information

No response

tazarov commented 11 months ago

@Tijana37 , thank you for the feature request. This perhaps make sense as it will improve developer experience by allowing them to abstract the ID generation.

There are still some considerations to be made here but this seems as a clear cut strategy pattern.

Here's some starting ideas:

class IDGenerationStrategy(ABC):

    @abstractmethod
    def generate_id(self) -> ID:
        pass

# Step 2: Implement concrete ID generation strategies
class IncrementalIDGenerationStrategy(IDGenerationStrategy):
    def __init__(self):
        self.last_id = 0

    def generate_id(self) -> ID:
        self.last_id += 1
        return self.last_id

class RandomIDGenerationStrategy(IDGenerationStrategy):
    import random

    def generate_id(self) -> ID:
        return self.random.randint(1000, 9999)  # Just an example

Then when collection is created:

def create_collection(name: str,
                      metadata: Optional[CollectionMetadata] = None,
                      embedding_function: Optional[EmbeddingFunction] = ef.
                      DefaultEmbeddingFunction(),
                      get_or_create: bool = False,
                       id_strategy: IDGenerationStrategy) -> Collection

Considerations:

@HammadB @jeffchuber, maybe you have more items to add to the above list.

HammadB commented 11 months ago

@tazarov @Tijana37 I think supporting auto-generated IDs on behalf of the user makes sense and is something we have thought about before. However I don't think we can support arbitrary pluggable strategies as this gets quite difficult to maintain correctness in a distributed setting. If there are N chroma-servers and you use an incremental id generation strategy, then correctness becomes difficult to maintain without complex coordination.

I could see us supporting uuid based generation though, is that something that would solve your use case @Tijana37 ? Or do you need the ids to be monotonically increasing integers?

tazarov commented 11 months ago

@HammadB, can the generation be brought all the way down to SQLite segment API. We can still offer several collision-free ID strategies which can be controlled via collection metadata?

Tijana37 commented 11 months ago

@tazarov @Tijana37 I think supporting auto-generated IDs on behalf of the user makes sense and is something we have thought about before. However I don't think we can support arbitrary pluggable strategies as this gets quite difficult to maintain correctness in a distributed setting. If there are N chroma-servers and you use an incremental id generation strategy, then correctness becomes difficult to maintain without complex coordination.

I could see us supporting uuid based generation though, is that something that would solve your use case @Tijana37 ? Or do you need the ids to be monotonically increasing integers?

Absolutely UUID will help me and I believe most of the similar cases! No need to be monotonically increasing integers.

HammadB commented 11 months ago

@tazarov the problem is that the distributed architecture is async, and if we assign the ids I'd expect the client to return them immediately.

@Tijana37 that makes sense - uuid is definitely possible

tazarov commented 11 months ago

@HammadB, FastAPI then or whatever ingests requests, it still gives control to backend which is better than as you pointed out client-side.

But in the end, as of today, Chroma let's users generate there Ids prior to sending, even if there are collisions, the semantics of the respective operations (add/upsert) should prevent or allow the user from having collisions in the namespace. Not too sure about the distributed case though.

tazarov commented 11 months ago

@HammadB,

Here's an idea about implementing this in chromadb.api.fastapi.FastAPI and/or chromadb.api.segment.SegmentAPI (making it applicable for both client and client/server modes):


class UUIDGenerator(List[str]):
    def __init__(self, max: Union[int, List]):
        super().__init__()
        if isinstance(max, int):
            self._max = max
        else:
            self._max = len(max)

    def __iter__(self):
        return self

    def __next__(self):
        if self._max > 0:
            self._max -= 1
            return str(uuid.uuid4())
        else:
            raise StopIteration
...
    @override
    def _add(
        self,
        ids: Optional[IDs],
        collection_id: UUID,
        embeddings: Embeddings,
        metadatas: Optional[Metadatas] = None,
        documents: Optional[Documents] = None,
    ) -> bool:
        """
        Adds a batch of embeddings to the database
        - pass in column oriented data lists
        """
        resp = self._session.post(
            self._api_url + "/collections/" + str(collection_id) + "/add",
            data=json.dumps(
                {
                    "ids": ids if ids is not None else UUIDGenerator(documents),
                    "embeddings": embeddings,
                    "metadatas": metadatas,
                    "documents": documents,
                }
            ),
        )

        raise_chroma_error(resp)
        return True

Note: For chromadb.api.segment.SegmentAPI we'll need to update _add and _record.