Open Tijana37 opened 11 months ago
@Tijana37 , thank you for the feature request. This perhaps make sense as it will improve developer experience by allowing them to abstract the ID generation.
There are still some considerations to be made here but this seems as a clear cut strategy pattern.
Here's some starting ideas:
class IDGenerationStrategy(ABC):
@abstractmethod
def generate_id(self) -> ID:
pass
# Step 2: Implement concrete ID generation strategies
class IncrementalIDGenerationStrategy(IDGenerationStrategy):
def __init__(self):
self.last_id = 0
def generate_id(self) -> ID:
self.last_id += 1
return self.last_id
class RandomIDGenerationStrategy(IDGenerationStrategy):
import random
def generate_id(self) -> ID:
return self.random.randint(1000, 9999) # Just an example
Then when collection is created:
def create_collection(name: str,
metadata: Optional[CollectionMetadata] = None,
embedding_function: Optional[EmbeddingFunction] = ef.
DefaultEmbeddingFunction(),
get_or_create: bool = False,
id_strategy: IDGenerationStrategy) -> Collection
Considerations:
@HammadB @jeffchuber, maybe you have more items to add to the above list.
@tazarov @Tijana37 I think supporting auto-generated IDs on behalf of the user makes sense and is something we have thought about before. However I don't think we can support arbitrary pluggable strategies as this gets quite difficult to maintain correctness in a distributed setting. If there are N chroma-servers and you use an incremental id generation strategy, then correctness becomes difficult to maintain without complex coordination.
I could see us supporting uuid based generation though, is that something that would solve your use case @Tijana37 ? Or do you need the ids to be monotonically increasing integers?
@HammadB, can the generation be brought all the way down to SQLite segment API. We can still offer several collision-free ID strategies which can be controlled via collection metadata?
@tazarov @Tijana37 I think supporting auto-generated IDs on behalf of the user makes sense and is something we have thought about before. However I don't think we can support arbitrary pluggable strategies as this gets quite difficult to maintain correctness in a distributed setting. If there are N chroma-servers and you use an incremental id generation strategy, then correctness becomes difficult to maintain without complex coordination.
I could see us supporting uuid based generation though, is that something that would solve your use case @Tijana37 ? Or do you need the ids to be monotonically increasing integers?
Absolutely UUID will help me and I believe most of the similar cases! No need to be monotonically increasing integers.
@tazarov the problem is that the distributed architecture is async, and if we assign the ids I'd expect the client to return them immediately.
@Tijana37 that makes sense - uuid is definitely possible
@HammadB, FastAPI then or whatever ingests requests, it still gives control to backend which is better than as you pointed out client-side.
But in the end, as of today, Chroma let's users generate there Ids prior to sending, even if there are collisions, the semantics of the respective operations (add/upsert) should prevent or allow the user from having collisions in the namespace. Not too sure about the distributed case though.
@HammadB,
Here's an idea about implementing this in chromadb.api.fastapi.FastAPI
and/or chromadb.api.segment.SegmentAPI
(making it applicable for both client and client/server modes):
class UUIDGenerator(List[str]):
def __init__(self, max: Union[int, List]):
super().__init__()
if isinstance(max, int):
self._max = max
else:
self._max = len(max)
def __iter__(self):
return self
def __next__(self):
if self._max > 0:
self._max -= 1
return str(uuid.uuid4())
else:
raise StopIteration
...
@override
def _add(
self,
ids: Optional[IDs],
collection_id: UUID,
embeddings: Embeddings,
metadatas: Optional[Metadatas] = None,
documents: Optional[Documents] = None,
) -> bool:
"""
Adds a batch of embeddings to the database
- pass in column oriented data lists
"""
resp = self._session.post(
self._api_url + "/collections/" + str(collection_id) + "/add",
data=json.dumps(
{
"ids": ids if ids is not None else UUIDGenerator(documents),
"embeddings": embeddings,
"metadatas": metadatas,
"documents": documents,
}
),
)
raise_chroma_error(resp)
return True
Note: For
chromadb.api.segment.SegmentAPI
we'll need to update_add
and_record
.
Describe the problem
What I find really helpful in other databases is having the option to set auto_increment on primary key and it generates increasing integers as ids.
Describe the proposed solution
What I would like to see is setting some option when creating collection for auto_increment and not settings IDs when adding items because Chroma will create them automatically.
Example code: collection = vector_db_client.create_collection(name=chroma_collection_name, auto_increment=True) # setting something like this collection.add( documents=labels, metadatas=metadatas,
ids=ids_entities --> In order to NOT need this line!
Alternatives considered
No response
Importance
would make my life easier
Additional Information
No response