chroma-core / chroma

the AI-native open-source embedding database
https://www.trychroma.com/
Apache License 2.0
15.69k stars 1.32k forks source link

[Feature Request]: Smoother API for .delete() #3207

Open mr-infty opened 1 week ago

mr-infty commented 1 week ago

Describe the problem

At the moment, there is no convenient way to delete all entries in a collection (without deleting the collection itself). Even though .delete() accepts None as an argument to ids, there is no "wildcard filter" that could be given to where as an argument.

Describe the proposed solution

Either make .delete() delete all entries in the collection or make it possible to pass where={} as a wild-card filter matching all documents.

Alternatives considered

No response

Importance

would make my life easier

Additional Information

No response

tazarov commented 1 week ago

@mr-infty, you can use this:

import uuid
import chromadb
import numpy as np

data = np.random.uniform(-1, 1, (500, 384))

client = chromadb.PersistentClient("delete_all")
collection = client.get_or_create_collection("test_collection")
ids = [f"{uuid.uuid4()}" for i in range(data.shape[0])]
documents = [f"document {i}" for i in range(data.shape[0])]
collection.add(ids=ids, embeddings=data, documents=documents)

print("Collection count", collection.count())

collection.delete(where={"__bastion_key__": {"$ne":1}})

print("Collection count after delete", collection.count())

Works like a charm. However you should note that due to how HNSW index works it is recommended to delete and recreate the collection to avoid a caveats: HNSW has an unbound growth, deleted embeddings are only flagged as deleted.

mr-infty commented 6 days ago

Okay, I guess that collection.delete(where={"__bastion_key__": {"$ne":1}}) is a useable workaround, but surely something as simple as deleting all items in the collection should have a simple interface? Moreover, it appears that the unwillingness of the API to accept empty objects (metadata or where filters) has caused trouble elsewhere.

It seems to me that providing the ability of have empty metadata and empty filters would streamline the API a lot.

HammadB commented 1 day ago

We actually are somewhat opposed to allowing people to easily delete everything in their collection, its too easy a footgun to do accidentally.

Maybe we could do a safety override. I.e

collection.delete(all=true) deletes all vs collection.delete() will no-op. But this creates other confusing states.

tazarov commented 1 day ago

@mr-infty, we have similar mechanic to delete all with reset() however reset, much like delete() with no params throws an error unless a flag is explicitly configured, that is off by default of course).

MySQL has something similar with SET SQL_SAFE_UPDATES = 1. So perhaps a similar, flag can make sense here.

Regarding empty params, it feels to me not very ergonomic. Wouldn't it make sense the absence of parameters to be treated as empty params rather forcing empty params. It introduces a confusion such as, is deleting nothing that matches the same as deleting all - much like the example I've shown you above, it ugly and confusing as hell (it does the job though).

Going down that 🐰 hole you might as well make the argument for a completely separate method that conveys in non-ambiguous terms what it does e.g. collection.truncate(). I think it is not coincidence why the SQL standard defines it. Furthermore we can look for opportunities to make truncate more efficient not in just deleting everything in the collection but also make it so that you start with a fresh empty collection. Today if you apply the workaround above or if we implement the delete() with no params the same way we implement deletions with params we would end up in a situation where you have an HNSW index which is full of "dead" labels and tons of data you don't need/want. Instead what we could do with truncate is recreate the index and make sure the metadata is properly scrubbed, as if you are calling delete + create (but without changing collection characteristics like ID, HNSW config etc).