chroma-core / chroma

the AI-native open-source embedding database
https://www.trychroma.com/
Apache License 2.0
15.29k stars 1.28k forks source link

[Bug]: "Cannot submit more than 5,461 embeddings at once. Please submit your embeddings in batches of size 5,461 or less." but on running *.delete* #2181

Open niceblue88 opened 6 months ago

niceblue88 commented 6 months ago

What happened?

I'm having some serious useability problmes with this embedding limit. I had it on upsert, where i think it is understandable somewhat. I have chunked it as described here in #1049 and that one now works fine. https://github.com/chroma-core/chroma/issues/1049 HOWEVER, I am also having this same problem inexplicably with delete too, which I find much harder to understand why it has to be so with delete. This error is triggered by: collection.delete(where={"dochash": dochash}) where dochash is a single simple dochash string only. I would think this is an extremely common use case, and not something that can be chunked.

Versions

Chroma v0.5.0. Running on Windows. Python 3.10

Relevant log output

No response

tazarov commented 6 months ago

@niceblue88, sorry you're facing this issue. Let me try to explain why it is happening and we can then explore options to fix this.

The sqlite3 version on your system is compiled with some limits in it. One of these limits is MAX_VARIABLE_NUMBER which means that sqlite3 can take up only this many variables at once. This by extensions is reflected in the max_batch_size in Chroma. So every time you add/update/upsert/delete records, this is checked against the number of records you are trying to add/update/upsert/delete. You correctly point out that you are not supplying ids to be deleted in your delete(). However, Chroma turns your where clause into a list of records to delete, which in your case exceeds max_batch_size, hence the error you see.

I must agree that this is not a pleasant issue to have and indeed, it should work as expected, regardless of how many entries the where clause matches. I cannot say without further investigation how much work and impact making this work would be. Nevertheless, I'll dig deeper.

In the meantime, here's a way to avoid this problem:

client = chromadb.PersistentClient(path="persist_dir")
ids_to_delete = collection.get(where={"dochash": dochash})

for batch_index in range(0, len(ids_to_delete),client.max_batch_size):
   collection.delete(ids=[ids_to_delete: ids_to_delete+client.max_batch_size)
niceblue88 commented 5 months ago

Great reply, thank you for explaining so clearly. I thought it was exactly this, but the details you provide on SQLite3 is helpful. Digging into SQLite, any version higher than 3.32 is supposed to have a max_variable_number of 32,726. I checked the version of sqlite3 in Windows python, and it tells me it is 3.43.1. Correct if I am wrong, but that should mean it can handle at least 32,000 ids? It seems to not though. Is that because it is not using the sqlite3 version I see in a python windows, and using some other sqlite3 instance? Or is it that the 3.43.1 version has still for some reason been compiled with a lower max_variable_number? I presume this SQLite comae with the python 3.11 version install.

Either way, I think ideally this is something that is addressed upfront in the install guide for ChromDB. I will look into perhaps trying to upgrade my sqlite version manually in Windows, and see if that fixes the problem. If it does, perhaps that should be the recommendation in the install guide (for Windows). And for others, perhaps ways of ensure the max_variable_number is also similarly large. However, I now understand why others also hit this problem, but at a much higher threshold of over 40,000 ids.

tazarov commented 5 months ago

@niceblue88, I think that the SQLite3 version is heavily dependent on its compile config. For instance on my Mac M3 it can support up to 83k of max_batch_size aka records. You can see how max_batch_size is calculated here:

https://github.com/chroma-core/chroma/blob/37a030c60270fd44b9c025e5c415cfaefe1be410/chromadb/db/mixins/embeddings_queue.py#L257-L275

It is a good idea to add a note to the install or even the API usage to inform users about the system-specific limitations of Chroma and the underlying SQLite3.

niceblue88 commented 5 months ago

I know was considered, but the chunking could be included by default in Chroma (with negligible overhead if ids do not exceed max). Max 5 lines needed on the lib side (as opposed to every client needing to implement chunking). Why is this not done?

tazarov commented 5 months ago

I know was considered, but the chunking could be included by default in Chroma (with negligible overhead if ids do not exceed max). Max 5 lines needed on the lib side (as opposed to every client needing to implement chunking). Why is this not done?

You are right that this may be a few lines, but it comes with significant assumptions. We have been through this a while back have a look here - https://github.com/chroma-core/chroma/pull/1077#pullrequestreview-1609795730