chroma-core / chroma

the AI-native open-source embedding database
https://www.trychroma.com/
Apache License 2.0
15.4k stars 1.29k forks source link

Discussion: in-memory and chroma client-server --> sharing code? #14

Closed jeffchuber closed 2 years ago

jeffchuber commented 2 years ago

chroma-client is mainly is responsible for the public API and data ferrying to the backend. chroma-server is mainly is responsible for storing data and running computations.

Inside chroma there are 2 wolves:

Wolf B is fairly obvious... there is chroma-client and chroma-server and then work together.

Wolf A however... does chroma-client have extra functionality to handle the in-memory use case? or is there seperate code that is shared between chroma-server and chroma-client?

I think some code has to be shared... the code for "doing the maths" - and possibly more code, things like data formats, etc. Possibly more. How should this be structured?

jeffchuber commented 2 years ago

@levand @atroyn Opened this issue to discuss. I realized we should talk through this while building https://github.com/chroma-core/chroma/pull/13

jeffchuber commented 2 years ago

Making a list (will be updated inline) of projects that we can perhaps find some inspiration from...

Logging / APM

Product Analytics

Orchestration

ML monitoring / experiment management

to be continued........

levand commented 2 years ago

This is a great question.

We're talking about an initial MVP right? My question is... is a "Wolf A" model actually satisfy the "V" in "MVP"... is it viable? It could make a good demo and help generate sales leads, for sure. but as I understand the product, we're almost certainly going to need Wolf B for any kind of production use. For example, as soon as we start persisting data, we're going to need to do it to somewhere other than a developer's laptop or CI instance.

So I'm going to go out on a limb and say that I don't think a completely in-process model makes sense even for an MVP. So, in this hypothesis, we're going to have a frontend, and a backend (chroma-client and chroma-server, as hypothesized)

But there's still a decision point to be made here:

  1. Do we have a "thick" backend and "thin" frontend, with most the logic and algorithmic work performed in the backend and the frontend just serving as a developer interface?
  2. Or do we have a "thick" frontend and a "thin" backend, with all the real logic and work performed in-process in the client, and the backend just being a thin proxy for persistence?
  3. Or technically, you could split it and have some algorithmic work performed on the server and some on the client.

I do tend to disregard #3, just because it could get a lot more complicated for probably not a ton of benefit.

When trying to compare between type 1 and type 2, we need to consider:

Satisfying the computational requirements is probably going to be easier in a 1 model, since it's easier to establish a requirement that a Chroma Server has $x amount of RAM & CPU/GPU power than it is that every possible client will.

The network transport constraints are another question. Ultimately, it's a wash, because the same data has to be brought together at some point to perform the operations we want. If we're considering batch-mode operations, it genuinely doesn't matter because ultimately the same amount of data has to traverse some wire, somewhere, to make it happen. For high-frequency non-batched operations, you have to add 1-3 milliseconds of latency for each request and in that case it could make sense to have the computation local to the request, if it's a particularly performance-intensive scenario.

jeffchuber commented 2 years ago

Here is another slightly different perspective - I like how MLFlow handles tracking - https://mlflow.org/docs/latest/tracking.html#how-runs-and-artifacts-are-recorded. (ignore the artifact part of the charts since we don't have heavy files to move around like mlflow does) In this paradigm, the lightest weight place things are serialized is a .chroma folder. (MLFlow is 100% python a well and Apache 2.0). I think how they accomplish this is that all code is packaged up in the pip project. That means there is not a separate client or server... it just about which code you are using in various scenarios. I guess the downside of that is (1) size of the project in megabytes, and (2) versioning frequency and version management across the front and backend without explicit version pinning.

For (1)/(2)/(3) "where to put the business logic". I 100% agree that (3) is bad. My general bias is towards having a thick backend and thin client. Especially since most operations - the operations will need context from the db in order to complete (will need to query the NN index for example).

The discussion is still very open! :) @atroyn join the mix as well!

jeffchuber commented 2 years ago

one additional note... mlflow is purely a store - it does no processing on it. that is different from us where our processing is computationally expensive.

@atroyn should we try to make the in-memory thing work at all? i'm starting to tend towards "not worth it"

levand commented 2 years ago

One other distinction to make: "in memory" doesn't necessarily mean "all in client with no backend."

It could also mean a full backend/frontend split, but with the backend implemented in-memory + very simple persistence (as opposed to a more complex vector database) for the MVP.

In the past when I've advocated to start "in memory", that's what I was referring to... a full backend but with a trivial in-memory implementation. Not trying to cram all the computation into the frontend.

MLFlow is an interesting model, but note that they are highly modular and support many different topologies. We could do that too, but my guess is we want to streamline and present "one way" as the default. We can enable other modalities as options if that's where the market pushes us.

atroyn commented 2 years ago

I am inclined to thin client Wolf B for a few reasons:

There's risks around things like breaking the client/server versioning in the future, and there is added complexity, but this gut checks as the right move to me.

Anthropic has something called Garçon, similar-ish for probing remote running models with the aim that their scientists can easily examine something running somewhere else, which uses a client-server setup

atroyn commented 2 years ago

I also read luke's 'in memory' as referring to where the processing was done; flat storage with all computation done in memory rather than in a vector DB, I favor this as well for development speed and deployment ease into the user's machine.

jeffchuber commented 2 years ago

Ok I agree with all of this. I think it was good to talk through, thanks for the thoughts! Keeping things simpler and opinionated is the right way to go (assuming we have the right opinions of course).

So I believe we all agree we will move forward with:

That means then that if a user is using a notebook - they will need to do docker-compose up (or whatever our backend init script is) in the notebook. Docker does work on Google colab! I am ok with this. Just confirming we are all on the same page here.

There is the additional question of how thin the client is... and specifically whether the backend has the idea of log or whether the client simply knows to call the things that log does (eg store the data here, trigger this reprocessing). The current open discussion is here https://github.com/chroma-core/chroma/pull/13#discussion_r1002612123

jeffchuber commented 2 years ago

Closing this issue as we have agreed on a direction