OpenFn / apollo

GNU Lesser General Public License v2.1
0 stars 2 forks source link

Search: Fix third-party integrations #98

Open josephjclark opened 1 week ago

josephjclark commented 1 week ago

Before we can deploy the search service, we need to work out who our integrations should be with

Right now, we are using a third-party vector database to store embeddings (Zilliz), and a third-party embeddings database to generate embeddings for queries (with open AI).

Our requirements are:

Self Hosted Database

I am convinced that we should be able to host our own vector database in the container.

The database should be built offline as part of a builder image. We can use whatever dev depenencies are needed in the builder image, and drop them for the final production image.

Once the build is complete, we don't need to write to the database again: we only need read and query capability.

We can even trigger a new Apollo build every time the docsite is updated to keep things in sync. But the doc site doesn't update THAT often so we don't really need a live sync. A weekly update would be fine.

We could I suppose use the database to case searches later (but even then, the database might not be the best way to do this).

We should choose an open source database from the options available.

I don't actually know how big the embeddings are, in memory size, for the docsite. But I doubt it's gigabytes?

Note that this means the apollo server needs to actually run queries against the DB. Up until now apollo has really just been a proxy server - from here it'll start doing its own actual work.

Third party Database

If we really can't bundle up our own database in the container, we'll need to use a third party.

We're currently using Zilliz, which is the SaS version of Milvus.

We should chose a partner which is open source, isn't too expensive, and ideally which aligns to our values.

Self Hosted Embeddings Model

In a perfect world we would keep the embeddings model in the image too. This would mean that the apollo sever needs to be big enough and powerful enough to run an LLM.

Note that the dev dependencies for the model don't need to be in the final production image - we shouldn't need to store torch and all its built in models.

I suspect that we can build an embeddings model in a builder image, then remove all the dev dependencies, and use a final model thats around 1GB in size.

We would need to be careful about which model we pick to ensure that a) its is ethically trained and b) it generates good quality embeddings. We can compare against openAI's embeddings and the existing milvus search to get a sense of how good they are.

The model would be called:

Third-party Embeddings Model

If we can't self host the model, we'll have to stick with a third party. This may well be appropriate, but we'd need a cost effective solution.

We currently use the openAI embeddings service. Anthropic recommends https://www.voyageai.com/, which I'd at least like to take a serious look at

josephjclark commented 1 week ago

I think the right strategy here, as a first pass, is: