atroyn commented 4 months ago

Auto-Ingestion Batching

Chroma users tend to throw a bunch of data at Chroma at once to get their collections set up. On many developer-class machines, and remote notebook environments like Colab, this immediately leads to high memory consumption and hard swapping, making Chroma appear much slower on ingestion than it actually is in the (platonic) world of ideal forms.

Additionally, the user does not know if Chroma has crashed, their system has hung, or how ingestion is going in general.

API Design

# New argument on add
collection.add(..., batch_size=128) 

# Display a progress bar
[---------60%           ]

# Impl 1.
# Wrap add classic in a loop with tqdm strapped on it 
def add(self, batch_size=128, **kwargs):
    for ... 
        self._add(**kwargs)

[Complexity] Subtask

[Low] Ship it with a reasonable default
[Low] Investigate the tradeoffs to make sure we ship a reasonable default
[High] Paralellize and pipeline this with EF (and Chunker). Whole thing is parallelizable. Knowing the batch size in advance helps considerably.

Misc

Is there something here we could do with returning query results also ?

codetheweb commented 2 months ago

I think the main problem with auto-batching during ingest is that since Chroma doesn't support transactions, there needs to be a good way to report partial failures.

How does this interface feel?

records: (Iterator[ID], Iterator[Document], Iterator[Embedding])

# .batch_add() requires IDs to be specified
for ingested_ids in collection.batch_add(records):
  user_db.documents.update(ingested_ids, ingested=True)

This also lends itself to being wrapped with tqdm/rich.track.

atroyn commented 2 months ago

could we call upsert instead so that the client doesn't need to know about any of its db internals? Maybe we want to change upsert to return some indication of what was changed and what wasn't?

codetheweb commented 2 months ago

could we call upsert instead so that the client doesn't need to know about any of its db internals? Maybe we want to change upsert to return some indication of what was changed and what wasn't?

by that do you mean something like this?

result = collection.upsert(ids=ids, documents=documents)

user_db.documents.update(result["successful_ids"], ingested=True)
user_db.documents.update(result["failed_ids"], ingested=False)

?

I don't like this as much for a few reasons:

there's no way for a user to know the state of the operation (is it 50% done yet?)
- could be fixed by adding a parameter to print a progress bar, but there's not an elegant way to inform of progress programmatically
similar to above, users can't update external state during the operation
- In other words, imagine a scenario where the operation on Chroma succeeds but their application database goes down sometime during the operation and so the user_db.documents.update() call after upsert() fails. With the original API proposed, they would have been able to save their progress to their database until it became unavailable and resume from that point. With this alternative, they would have to redo the entire operation.
changes the existing semantics of upsert(): users currently expect that it raises errors

atroyn commented 2 months ago

No I mean underneath batch_add we call upsert ourselves. This will be idempotent for any existing IDs, so a restart won't cause duplicates.

codetheweb commented 2 months ago

oh, so something like

records: (Iterator[ID], Iterator[Document], Iterator[Embedding])

# .batch_upsert() requires IDs to be specified
for ingested_ids in collection.batch_upsert(records):
  user_db.documents.update(ingested_ids, ingested=True) # optional, can use to avoid repeating work

?

that makes sense to me

atroyn commented 2 months ago

something like that yeah

codetheweb commented 2 months ago

I started implementing the proposed API above, but after playing around with it I think it adds some cognitive overhead for not much benefit (when should I call batch_upsert() instead of upsert()? can I batch things myself for more control?).

As a client-only feature I'm not really convinced that this approach is better over improving the existing batch utilities. Both approaches end up with roughly the same number of LoC from the user's perspective, but packaging it as a utility for now makes it easier to understand imo.

atroyn commented 2 months ago

I don't think I agree. To use the batching utils a user has to, in addition to the regular code:

    from chromadb.utils.batch_utils import create_batches

...

    for (ids, embeddings) in create_batches(client, (ids, embeddings), print_progress_description="Adding documents..."):
        collection.add(ids=ids, embeddings=embeddings)

which is much more confusing. My ideal API for this feature is that it's completely transparent to the user, i.e. they never have to think of the difference between add and adding as a batch. We decide for them when we should batch and display a progress bar (e.g. if we are over some specific number of elements being added) of the batch progress behind the scenes. We then display a progress bar if we go into batching for them.

So they would never call batch_upsert or batch_add. They would just call add or upsert and we do the right thing for them behind the scenes, without any additional imports or reasoning necessary. If a user wanted to control batching, we could provide an additional argument on add which might control the batch size.

codetheweb commented 2 months ago

Because Chroma doesn't support transactions, automatically batching would raise the same issues I noted here: https://github.com/chroma-core/chroma/issues/2282#issuecomment-2263927583. How do we report a partially succeeded state?

It could work if we ignored collection.add() (since it's not idempotent) and changed the interface of collection.upsert() to be def upsert(...) -> Iterator[List[ID]], but I'm not entirely convinced that's the best option since the return signature is so different from the rest of the methods (and perhaps the user's expectations).

collection.delete() also needs to be batched. Would we also change the return signature there?

atroyn commented 2 months ago

In the case of a partially failed batch, the simplest thing to do would be assume the user would run the same operation again; in the case of .add, IDs that were alread added will fail silently, in case of upsert they would be updated but the result for the user is the same, then any IDs which did not succeed in the same batch will be added.

I don't think this requires changing the return signature.

codetheweb commented 2 months ago

In the case of a partially failed batch, the simplest thing to do would be assume the user would run the same operation again; in the case of .add, IDs that were alread added will fail silently, in case of upsert they would be updated but the result for the user is the same, then any IDs which did not succeed in the same batch will be added.

I don't think this requires changing the return signature.

I guess maybe the answer is that if users want to get intermediate results/track progress they can manually batch themselves.

We then display a progress bar if we go into batching for them.

I'm wary of always doing this by default as it's probably not desired if you're running a production app, maybe we only display a progress bar if it's an interactive terminal?

What do you think about doing both automatic batching and providing a batch utility for users who want more control?

atroyn commented 2 months ago

The thing we're trying to help / solve for users is that one of the first things they want to do is load a bunch of data into Chroma, and making that experience better (as per the issue description).

I think the feature where users track progress themselves using e.g. an iterator is not as important right now as making it more efficient to upload a bunch of data by not causing hard swapping if / when we can avoid it, and showing users a progress bar.

Could we condition the progress bar on something like log level? Then production users can turn it off if they don't want it.

atroyn commented 2 months ago

Providing a batching utility might be nice for users who do want more control themselves, but my aim with this feature is for the default experience to be much better.

codetheweb commented 2 months ago

The thing we're trying to help / solve for users is that one of the first things they want to do is load a bunch of data into Chroma, and making that experience better (as per the issue description).

👍

Could we condition the progress bar on something like log level? Then production users can turn it off if they don't want it.

Yeah, I think going on whether the terminal is interactive or you're in a notebook is probably a good check; unless you see a scenario where that doesn't work. I'm still wary of taking over their terminal by default--I can't think of another popular SDK package that does this--but I agree with the mission of making the ingest experience better.

atroyn commented 2 months ago

How do you mean 'take over the terminal' ? How is this different to regular log output except that if written to a file it looks weird?

codetheweb commented 2 months ago

How do you mean 'take over the terminal' ?

e.x. if you print something else during the operation the output isn't great:

https://github.com/user-attachments/assets/38580a69-1725-474e-ac45-b4499ac3f9bb

(This is with tqdm, which is actually a much better experience than what I'm used to from Node.js where printing to the console during a spinner like ora tends to break everything.)

codetheweb commented 2 months ago

rich handles this better:

https://github.com/user-attachments/assets/ff43644b-05ec-4171-b944-89bb6e7b26b0

atroyn commented 2 months ago

rich looks OK to me

codetheweb commented 2 months ago

@HammadB had some similar concerns about printing progress by default and transparently batching since the resulting guarantees may be different.

How does this API feel? (This is basically what I had proposed before; my concern before was that having a separately-exported batch util was simpler but I've been convinced that's not true.)

for ingested_set in collection.batch_upsert(ids=ids, embeddings=embeddings, print_progress=True):
  user_db.update(ids=ingested_set, ingested=True)

Easy to copy & embed, no extra imports, makes the progress printing behavior explicit, and allows users to update external state during the operation if they wish.

atroyn commented 2 months ago

Discussed offline;

will implement this as batch_upsert/ batch_add for now, punting on what to do with auto-batching for add.
we can hoist this to be the default way to add data to chroma when you add for the first time in our getting started docs.
we can make print_progress=True part of the copy/paste code examples in the docs for beginners to get it by 'default'
we'll look into what a sensible default here is; remembering that the point of this feature is to avoid developers having a bad first experience where they ingest a bunch of data and chroma appears to be much slower than we really are

chroma-core / chroma

[New Feature][Ease of Use] Auto-Ingestion Batching #2282

Auto-Ingestion Batching

API Design

[Complexity] Subtask

Misc