Open atroyn opened 4 months ago
I think the main problem with auto-batching during ingest is that since Chroma doesn't support transactions, there needs to be a good way to report partial failures.
How does this interface feel?
records: (Iterator[ID], Iterator[Document], Iterator[Embedding])
# .batch_add() requires IDs to be specified
for ingested_ids in collection.batch_add(records):
user_db.documents.update(ingested_ids, ingested=True)
This also lends itself to being wrapped with tqdm
/rich.track
.
could we call upsert
instead so that the client doesn't need to know about any of its db internals? Maybe we want to change upsert
to return some indication of what was changed and what wasn't?
could we call
upsert
instead so that the client doesn't need to know about any of its db internals? Maybe we want to changeupsert
to return some indication of what was changed and what wasn't?
by that do you mean something like this?
result = collection.upsert(ids=ids, documents=documents)
user_db.documents.update(result["successful_ids"], ingested=True)
user_db.documents.update(result["failed_ids"], ingested=False)
?
I don't like this as much for a few reasons:
user_db.documents.update()
call after upsert()
fails. With the original API proposed, they would have been able to save their progress to their database until it became unavailable and resume from that point. With this alternative, they would have to redo the entire operation.upsert()
: users currently expect that it raises errorsNo I mean underneath batch_add
we call upsert
ourselves. This will be idempotent for any existing IDs, so a restart won't cause duplicates.
oh, so something like
records: (Iterator[ID], Iterator[Document], Iterator[Embedding])
# .batch_upsert() requires IDs to be specified
for ingested_ids in collection.batch_upsert(records):
user_db.documents.update(ingested_ids, ingested=True) # optional, can use to avoid repeating work
?
that makes sense to me
something like that yeah
I started implementing the proposed API above, but after playing around with it I think it adds some cognitive overhead for not much benefit (when should I call batch_upsert()
instead of upsert()
? can I batch things myself for more control?).
As a client-only feature I'm not really convinced that this approach is better over improving the existing batch utilities. Both approaches end up with roughly the same number of LoC from the user's perspective, but packaging it as a utility for now makes it easier to understand imo.
I don't think I agree. To use the batching utils a user has to, in addition to the regular code:
from chromadb.utils.batch_utils import create_batches
...
for (ids, embeddings) in create_batches(client, (ids, embeddings), print_progress_description="Adding documents..."):
collection.add(ids=ids, embeddings=embeddings)
which is much more confusing. My ideal API for this feature is that it's completely transparent to the user, i.e. they never have to think of the difference between add
and adding as a batch. We decide for them when we should batch and display a progress bar (e.g. if we are over some specific number of elements being added) of the batch progress behind the scenes. We then display a progress bar if we go into batching for them.
So they would never call batch_upsert
or batch_add
. They would just call add
or upsert
and we do the right thing for them behind the scenes, without any additional imports or reasoning necessary. If a user wanted to control batching, we could provide an additional argument on add
which might control the batch size.
Because Chroma doesn't support transactions, automatically batching would raise the same issues I noted here: https://github.com/chroma-core/chroma/issues/2282#issuecomment-2263927583. How do we report a partially succeeded state?
It could work if we ignored collection.add()
(since it's not idempotent) and changed the interface of collection.upsert()
to be def upsert(...) -> Iterator[List[ID]]
, but I'm not entirely convinced that's the best option since the return signature is so different from the rest of the methods (and perhaps the user's expectations).
collection.delete()
also needs to be batched. Would we also change the return signature there?
In the case of a partially failed batch, the simplest thing to do would be assume the user would run the same operation again; in the case of .add
, IDs that were alread added
will fail silently, in case of upsert
they would be updated but the result for the user is the same, then any IDs which did not succeed in the same batch will be added.
I don't think this requires changing the return signature.
In the case of a partially failed batch, the simplest thing to do would be assume the user would run the same operation again; in the case of
.add
, IDs that were alreadadded
will fail silently, in case ofupsert
they would be updated but the result for the user is the same, then any IDs which did not succeed in the same batch will be added.I don't think this requires changing the return signature.
I guess maybe the answer is that if users want to get intermediate results/track progress they can manually batch themselves.
We then display a progress bar if we go into batching for them.
I'm wary of always doing this by default as it's probably not desired if you're running a production app, maybe we only display a progress bar if it's an interactive terminal?
What do you think about doing both automatic batching and providing a batch utility for users who want more control?
The thing we're trying to help / solve for users is that one of the first things they want to do is load a bunch of data into Chroma, and making that experience better (as per the issue description).
I think the feature where users track progress themselves using e.g. an iterator is not as important right now as making it more efficient to upload a bunch of data by not causing hard swapping if / when we can avoid it, and showing users a progress bar.
Could we condition the progress bar on something like log level? Then production users can turn it off if they don't want it.
Providing a batching utility might be nice for users who do want more control themselves, but my aim with this feature is for the default experience to be much better.
The thing we're trying to help / solve for users is that one of the first things they want to do is load a bunch of data into Chroma, and making that experience better (as per the issue description).
👍
Could we condition the progress bar on something like log level? Then production users can turn it off if they don't want it.
Yeah, I think going on whether the terminal is interactive or you're in a notebook is probably a good check; unless you see a scenario where that doesn't work. I'm still wary of taking over their terminal by default--I can't think of another popular SDK package that does this--but I agree with the mission of making the ingest experience better.
How do you mean 'take over the terminal' ? How is this different to regular log output except that if written to a file it looks weird?
How do you mean 'take over the terminal' ?
e.x. if you print something else during the operation the output isn't great:
https://github.com/user-attachments/assets/38580a69-1725-474e-ac45-b4499ac3f9bb
(This is with tqdm, which is actually a much better experience than what I'm used to from Node.js where printing to the console during a spinner like ora tends to break everything.)
rich
handles this better:
https://github.com/user-attachments/assets/ff43644b-05ec-4171-b944-89bb6e7b26b0
rich looks OK to me
@HammadB had some similar concerns about printing progress by default and transparently batching since the resulting guarantees may be different.
How does this API feel? (This is basically what I had proposed before; my concern before was that having a separately-exported batch util was simpler but I've been convinced that's not true.)
for ingested_set in collection.batch_upsert(ids=ids, embeddings=embeddings, print_progress=True):
user_db.update(ids=ingested_set, ingested=True)
Easy to copy & embed, no extra imports, makes the progress printing behavior explicit, and allows users to update external state during the operation if they wish.
Discussed offline;
batch_upsert
/ batch_add
for now, punting on what to do with auto-batching for add
.print_progress=True
part of the copy/paste code examples in the docs for beginners to get it by 'default'
Auto-Ingestion Batching
Chroma users tend to throw a bunch of data at Chroma at once to get their collections set up. On many developer-class machines, and remote notebook environments like Colab, this immediately leads to high memory consumption and hard swapping, making Chroma appear much slower on ingestion than it actually is in the (platonic) world of ideal forms.
Additionally, the user does not know if Chroma has crashed, their system has hung, or how ingestion is going in general.
API Design
[Complexity] Subtask
Misc