[86]: Adding milvus as a vector database and a search service (RAG)

SatyamMattoo commented 4 months ago

Short description

Integrated a vector database embedded with the OpenFn docs and a search service for RAG.

Related Issues

Fixes: #71 Closes: #72

Implementation details

Added Milvus as the vector database, embedding the OpenFn docs from the repo and adding it to Zilliz Cloud Database. You need to sign in to Zilliz and create a cluster to obtain the token and connection URI.

The database can be seeded with the docs during the docker build using the command: docker build --secret id=_env,src=.env -t apollo .

It can also be done by running the py script using: poetry run python services/search/generate_docs_emebeddings.py tmp/repo/docs/jobs collection_name You will need to change the docs_path by the actual path to the docs stored on your machine i.e. ./tmp/repo/docs.

Add the OpenAI API key, Zilliz database URI, and token to the .env file. This mounts the environment variables to the command and helps protect the secrets as well.

The search service currently takes a search string and a OpenAI API key, embeds the search query, searches the vector database for relevant information, and returns the results to the CLI.

Further work

After the integration and embedding of the doc's repo into the vector database we will need to automate the docker build process, i.e. whenever any change is pushed to the main branch of the doc's repo, we will need to rebuild the docker image of the apollo repository. This might be achieved using GitHub actions and workflows:

Create a repository dispatch event as suggested in this issue.
Listen to the dispatch event in the apollo repo and find a way to handle the env variables required for the docker build. (OpenAI API key, Milvus token and database URI)

Acceptence Critiera

Before we can consider releasing this work, we need to:

(Added by @josephjclark )

[ ] Settle on the embedding database and models. Should we try and build it all into the image or use a third party? If we use a third party, is Zilliz appropriate?
[ ] Ensure the returned chunks are reasonable quality and sufficiently related to the query
[ ] Decide whether to keep the summarization feature
[ ] Work out whether and how to handle adaptor docs chunking

josephjclark commented 4 months ago

Hi @SatyamMattoo, your mentor should be helping you with the review at this stage. We're going to get in touch to make sure the right procedure is being followed and that the mentor is on the right page.

josephjclark commented 4 months ago

My guidance: I'd really like to avoid using a third party service if we can. Can we build a very small model into the image, rather than all of torch? That was my expectation. If we are going to use third party services, we might as well do ALL of it with 3rd party services and build nothing into the model.

If we really do have to build a lot of large models into the image, I need to see the evidence for it.

SatyamMattoo commented 4 months ago

Hello @josephjclark,

I understand your concern regarding the use of third-party services for embedding models. After thoroughly researching various models, I found that most of them rely on torch as a dependency. Hence, building a small model directly into the image seems challenging.

However, here are a few approaches we can consider:

1. Adding Sentence-Transformers as a Dependency:

This approach significantly increases the build time up to 900 seconds.

Screenshot 2024-07-09 182027

The resulting image size is approximately 13GB. Although this was my initial approach, the extended build time and large image size pose significant challenges. Moreover, embedding the actual documents will further increases the build time and image size. Therefore, directly including a model in the Docker image may not be the optimal solution.

2. Using OpenAI Model Provided by Milvus:

Instead of utilizing the Hugging Face API, we can leverage the OpenAI model provided by Milvus for embedding the documents. This approach also requires an API key. Since we are already using OpenAI keys, this might be a better alternative. You can find the documentation for this approach here. Additionally, for search queries, we can use the API key provided by the user to obtain the search embeddings.

3. Mounting the Model as a Volume:

We can mount our model using volumes. While I am uncertain about its viability in a production environment, this approach was mentioned here. This might work well locally, but its effectiveness in production remains unclear.

These are the approaches I have identified after extensive research over the past week. I would appreciate your thoughts on this. Am I missing anything here, or can you direct me to any resources that might offer a better approach?

Thank you!

josephjclark commented 4 months ago

Let's break the problem into two.

First, we want to take a text corupus, and convert it into embeddings. This is an offline process outside the docker image (or in a builder image). Then in our docket file, we add an embeddings database and loading the trained data.

I don't expect the database or the embeddings to need torch. It's just a vector database and a bunch of data.

I don't think we need torch and ml to search the database either.

So we should be able to build this into an image without torch and a relatively small file size.

If either of these assumptions is wrong, I need to know so we can pivot.

Secondly, we need to convert natural language queries into embeddings in the fly. This is the harder bit and this either needs an embeddings service or a torch-enabled image. But let's deal with this bit after we've conquered step one.

SatyamMattoo commented 4 months ago

Thank you for your assistance sir. There are some details that I believe we need to discuss. Could we schedule a meeting to go over this and get a clearer idea?

josephjclark commented 4 months ago

@SatyamMattoo I am actually on vacation this week. Your mentor should be able to help you work out a solution - have you reached out to them?

If you're really blocked on this, we should focus on using third party embeddings with openai. If we're satisfied that the embeddings really give us better results, we'll come back and work out whether we can responsibly generate embeddings within the image itself.

SatyamMattoo commented 4 months ago

@josephjclark Yes sir, I have. I will implement that for now and continue to look for a better approach to do it within the image in parallel.

josephjclark commented 4 months ago

Some more thoughts on direction here:

PR Description

I think the PR description has fallen out of sync with the latest commits: please keep the PR description up to date so that when someone from the team comes to do a review, it's easy to get up to speed. Changes can be logged in comments (or commits) as you go, but the description should always reflect the current state of the work

Hosted Embeddings

I was thinking, if we ARE going to use third party services for embeddings, how and when do those embeddings get updated?

The original design asked for the database to be built from scratch each time. But with a third party, when we build, the docs embeddings may well already exist in the third party database.

In fact, ideally the embeddings would be updated whenever the doc site is updated, and then the live server is always querying against the very latest version of the docs.

Also, when the third party database is updated, what happens to the old embeddings? Are they purged or are we paying to store old data?

I don't necessarily need to see an implementation in this PR to automatically update the embeddings, but I would like to see a proposal for a robust, scalable strategy.

Testing

What sort of testing can we use to ensure that the embeddings are working? How do we know the right content is embedded from the doc site? I don't necessarily mean unit testing. Manual tests or logging during the build process could be sufficient. What I'm looking for is an answer to the question "How do I know this is working?"

SatyamMattoo commented 4 months ago

Sure sir, I will update the PR description shortly.

For the updating the database we can set up GitHub actions and set the workflows inside the apollo and docs repositories to rebuild the docker image each time a change is pushed to the main branch of the docs repository.

Steps:

A change is pushed to the main branch of the doc's repo, a GitHub Action workflow is triggered that rebuilds the docker image in the apollo repository.
- Create a GitHub Actions workflow in the docs repository that triggers a repository dispatch event to the apollo repository whenever there is a push to the main branch.
- Create a GitHub Actions workflow in the apollo repository that listens for the repository dispatch event and rebuilds the Docker image.
Rebuilding Docker Image:
- The workflow in the apollo repo will clone the necessary folders from the docs repo.
- It will then process the markdown files, chunk them appropriately, and embed the text using the embedding model.
Database Update:
- Connect to the vector database. Check if the collection exists. If it does, drop the existing collection to remove old embeddings.
- Insert the new embeddings into the database.
Managing Embeddings:
- By dropping the old collection and inserting new embeddings, we ensure that only the latest data is stored. This approach keeps storage costs efficient and maintains the relevance of the data.

Testing:

For testing manually,

I will add some more detailed logs during the docker build.
We can actually see the text embedded into the vector database, here is a screenshot of the docs embedded into zilliz

Screenshot 2024-07-16 004713

Use the search service to query the database and ensure that it returns accurate results based on the latest embeddings. This can be done by querying specific known sections of the docs and verifying the results.

EDIT: I had a query regarding the docs repo, I don't think we need to embed all the .md files into the database. So, it might be efficient to just pull the required folders into the image. What do you think about this? For now, I have included the docs and the adaptors folder.

josephjclark commented 4 months ago

That's really cool that we can see the embeddings in the hosted database. I'll have to explore that.

I this stage I don't want to curate the docs content too heavily - I'm happy just to spam it all in there and come back and refine it later.

Thank you for the notes on automation. I don't particularly want to automate the docs repo or this repo right now, I just want you to propose a plan. Perhaps you can make a note in the PR under a section called Further Work, and maybe raise an issue here and in the docs repo to cover the work required.

What I DO want to see on this branch is the logic to drop the old embeddings on build

Remember the point of this work is to benefit the job expression service. So very soon you should branch off of this work and start integrating the search service with the job writing service. Then, I hope, we'll start seeing much improved job code generation.

Soon after that, we'll have to work out a demo or test suite or gallery or something that you can show us what the code output looks like. I have a few ideas but I don't know exactly what that will look like: I suggest that you think about an easy way to demo your work and propose something to us. I'm sure your mentor can help there.

SatyamMattoo commented 4 months ago

Sure, sir. I will be integrating the search service to add context to the prompt in the job generation service soon. For that, I believe I will need to merge the dmp/620 branch with this one? I have already added the logic to drop the old embeddings from the hosted database. This is included in the init_milvus.py file:

    # Check if the collection exists
    collection_name = "apollo_sample"
    check_collection = utility.has_collection(collection_name)
    if check_collection:
        drop_result = utility.drop_collection(collection_name)

Additionally, I have implemented another approach for chunking the documents using a text splitter. I have added a recursive text splitter along with the markdown splitter to avoid uneven chunks of data. Evenly sized chunks with overlap help us improve the search results.

Best regards.

josephjclark commented 3 months ago

@SatyamMattoo When I try to run init_milvus.py from my file system (I've changed the path to the docs repo), I get this error:

openai.BadRequestError: Error code: 400 - {'error': {'message': "'$.input' is invalid. Please check the API reference: https://platform.openai.com/docs/api-reference.", 'type': 'invalid_request_error', 'param': None, 'code': None}}

Any ideas? The corpus looks to be a large array of strings, and the API token does seem to be set

SatyamMattoo commented 3 months ago

Hey @josephjclark, This error is most probably due to the incorrect repo path. I have cloned the repo in the tmp folder:

Screenshot 2024-08-08 193241

The path needs to be set as repo_path = "./tmp/repo". I got the same error when I set the path as repo_path = "/tmp/repo". I guess this will most probably resolve the error. Also the command I used to run the file was poetry run python init_milvus.py

josephjclark commented 3 months ago

@SatyamMattoo I see - that seems to be working! The path needs to point to a repo containing docs and adaptors.

That's a surprise to me: I don't think we need to embed the adaptors repo? That's huge and I've already created a service to pull down the typescript definitions for the adaptor, which I think will suffice

SatyamMattoo commented 3 months ago

@josephjclark, Yes I realized that, and I had removed the adaptors folder from the docker build, I will remove it from this file in the next commit. For the adaptor I have added the describe_adaptor service to add the context.

josephjclark commented 3 months ago

@SatyamMattoo if you want to keep this branch and this PR as the search feature, you'll need to update the PR title and description.

josephjclark commented 3 months ago

Totally random thought

When indexing the API reference, we need to create one embedding per function. Those docs are very very highly structured and we should be able to create really accurate chunks.

Given a question like "how do I create an sobejct in salesforce", the search should return the API reference for the create function (and probably upsert too but you take my point)

SatyamMattoo commented 3 months ago

I guess you are referring to the adaptor docs? Correct me if I'm wrong as I am not sure about what API reference here is.

Yes, they are very well structured, and we can get even chunks just by splitting them based on headers and we might add some metadata related to the adaptor. Also, I have noticed the describe_adaptor information that is being added to the prompt, plays a major role in returning good results. I think it would be beneficial to embed the adaptor docs to the database and then retrieve information. Cause the describe adaptor service makes the input prompt quite large.

PS: I have some issues with my device that is taking a few days to fix, I will be back on this as soon as that gets fixed.

josephjclark commented 3 months ago

Hey @SatyamMattoo, that comment wasn't neccessarily aimed at you - just a note.

But yes, I am talking about the adaptor docs.

I'm not sure if this is a priority for your last few weeks on this, but it's certainly a candidate to explore. On refletion it's a particularly strong use-case for RAG.

josephjclark commented 3 months ago

@SatyamMattoo if you have time to work on DMP this week I would like you to investigate creating one rag chunk per adaptor function. You might find you need different rules for chunking the adaptors docs versus the main docs.

SatyamMattoo commented 3 months ago

@josephjclark Sure, I am on it. Will be pushing the updates in a couple of days.

SatyamMattoo commented 2 months ago

Hello @josephjclark, I have pushed a few changes

Added semantic chunking method which is providing better quality chunks. We can the change the breakpoint_threshold_amount to increase or decrease the size of chunks (something around 80 seems fine).
There are 3 methods which we can use to add paths to embed.
- (used currently) Maintaining a list of file paths to be embedded and appending that to the argument provided
- Adding all the paths as arguments. (Might not be efficient I guess?).
- The one you suggested. It is quite similar to the first method, we just read the paths from a separate .config file.

About the hybrid search, it will not be possible as OpenAI embeddings do not support sparse embeddings.

About the adaptor docs, I will need more time to figure that thing out. The docs repo generates the adaptor functions and they are not present in the md files (they are generated during the build if I am not wrong). Would running this file do the job? Another way might be creating a similar function and retrieving functions from https://[raw.githubusercontent.com/OpenFn/adaptors/docs/docs/docs.json? I will need some assistance on this.

SatyamMattoo commented 2 months ago

Hello @josephjclark, I have added a feature to the search service where we can summarize the retrieved context based on the user query. I think this would really be beneficial and will return relevant and concise context based on the query. I have kept this as an optional field for now.

I have included options which will provide us more customization over the search. Now we can have these options in the input payload

{
    "api_key": "<OpenAI api key>",
    "query": "What are jobs in OpenFn?",
    "limit": 10, // A custom limit for number of chunks retrieved ranging from 1 to 15
    "collection_name": "openfn_docs", // Might be helpful when we embed the adaptor docs to a new collection
    "summarize": true // Summarizes context based on your query
}

About the semantic chunking, for now it does not provide an option to limit max_tokens in a chunk. It will be added later but for now we can create a custom class based on the SemanticChunker. Here is more information about this. Also, if the chunks are too short, we can add them to the previous or next chunk. This will help us get even better chunks.

josephjclark commented 2 months ago

I'm not sure I agree with you on the semantic chunking. I suppose I don't have the old chunks to compare to, but I think the quality of the chunks looks pretty poor.

For one thing they look very uneven: in job-examples.md_sections.md, there are several examples in one big chunk (looks like 5800 characters). Then there's a chunk in the job writing guide that's just a single sentence.

Skimming over the chunks, I also see blocks of text which are clearly (to me) related - like some example code and its caption - which are split over several chunks. I see chunks begin on a sentence which relates back to the previous sentence. I see splits across HTML elements where the content must surely all be closely related.

Here's a random example:

## Section Start: 
------
It cannot be used in place of the `state` variable. It cannot be assigned to, or
be on the left hand side of an assignment, and can only be used inside an
argument to a function

This also means that Lazy State Operator can only be used to READ from state. It
cannot be used to assign to state directly. These examples are all errors:

```js
❌ const url = $.data.url;
get(url);

❌ get(() => $.data.url);

❌ $.data.x = fn();

❌ fn(state => {
  $.data.x = 10;
});

details summary Compilation rules for advanced users summary

How does the Lazy State Operator work?



(I've mangled a bit of HTML there)

If you gave me, as a human, several of these chunks, I wouldn't understand what you were talking about. I don't expect an AI to fare much better.

When chunking by section, I at least know that each chunk has a fair chance of having enough context for it to be useful by itself. I also know that we can re-write docs in such a way that each section is designed to be read standalone (that's just good practice anyway). I'm sure there are some simple heuristics we can use to break up the longer chunks too.

josephjclark commented 2 months ago

I've run a couple of basic tests and while I'm not sold on the utility of the chunks, it does appear at first glance like the service is actually returning relevant chunks. When I ask about lazy state for example, I get ten chunks of lazy state data.

Can we control the relevance threshold at all? For our purposes, we don't need the top ten results - we need all results that are relevant enough. All results that score above 0.8 or whatever. If the search decides there are no docs worth returning, that's just fine (as would be the case when the user asks "What is Madonna's best album?")

josephjclark commented 2 months ago

I am concerned about the summarisation process. This feels like using AI for the sake of AI to me.

Consider the pipeline. When the user asks a question, we must:

Encode the question into embeddings (via a third party service)
Search for relevant documents (via a third party service)
Add the documents to the prompt
Send the prompt to the model (via at third party service)

The final model must then parse the input prompt and break it down into its own embeddings.

Each step has a cost attached (environmental, temporal, financial and technical). It's a lot of computation already.

And then the summarizer adds an extra step, calling out to another third party service. Assuming the summarisation is good and accurate (which I don't believe), what is the value? Users will never see it.

The idea of generating text in one model to pass it as input to another model is frankly a bit scary.

As a general point of feedback: experimental fringe features like this should really be developed on different branches. It is harder for us to release and maintain code when it contains features that haven't been properly specced or tested. If the work was done on another branch, it could be examined, tested, discussed and released in isolation. Now it's tied up with everything else and will slow down the release.

I'm always happy to sound out new ideas and try new things. But I'm less happy when the experiments threaten the release of the main feature. When they're done in isolation, I can explore the idea in a safe space without having to worry about how it'll impact the release.

Don't branch the work off now. I'll have a think about it. Maybe you can convinced me that it's worth keeping the feature.

josephjclark commented 2 months ago

@SatyamMattoo just a quick note that I've added a list of acceptance criteria to the main PR post. These are things we need to work out before we can release this feature. I'll need to decide on the final embedding strategy next week so don't worry about that bit.

josephjclark commented 2 months ago

Regarding adaptor docs, you should be able to just run yarn generate-adaptors from the docs repo and it'll build the docs as markdown to adaptors/packages. You'll only need to read the <adaptor>-docs.md files. You should be able to chunk up by section.

SatyamMattoo commented 2 months ago

Hello @josephjclark,

Yes, it seems that semantic chunking won’t be the best fit for our docs. I’ve reverted to the previous strategy we used, which involves splitting based on headers and recursively splitting larger chunks.

Regarding the search, I’ve implemented a parameter that filters vectors based on a threshold. We can read more about this here. This ensures that irrelevant search queries return no results, and it works as expected.

As for the adaptor docs, I’ve worked out a solution locally. However, during the Docker build, it becomes a bit cumbersome since we need to install nodejs, npm, corepack, and yarn (with nodejs and npm required for installing yarn). This process adds about two minutes to the build time. Afterward, we still need to build the docs, which would further increase the image size and build time.

Alternatively, we could fetch the docs directly from the GitHub source. There’s likely a way to retrieve functions and examples for each adaptor, create a chunk for each function along with its adaptor name, and generate embeddings from them. I had initially started with this approach before your suggestion.

What are your thoughts on proceeding this way?

SatyamMattoo commented 2 months ago

@josephjclark Regarding the summarization process, I believe it might be more beneficial to discuss this after the release, as it may not directly impact the job generation service. The summarization essentially condenses the context based on the query, which we can potentially handle by adding specific instructions to the job prompt.

However, this could be useful if the user wants to use the search service directly for complex queries, as those can be answered based on the retrieved documents. I have removed this part for now, but this is something we can consider further down the line.

Also, I’ll make sure to branch off such ideas separately.

Best regards

josephjclark commented 2 months ago

Thank you for the update @SatyamMattoo

You must take where when submitting work to thoroughly test it (we've talked about this before). And ideally your test results should be shareable, so that I can check your testing. It's not enough to say "this works great". You must show me it works.

If you cannot prove the quality and robustness of your solution, it will not be merged. That's not my personal policy for openfn - that's how software development works worldwide. Different organisations have different standards, admittedly, but no-one wants to merge untrusted code.

I say this because you told me that semantic chunking method is providing better quality chunks. But you did not provide any proof of this, and now you've had to revert the work.

Experimentation is great - I'm glad we tried semantic chunking (and actually I'm sure we could spend more time on it to find a better solution). But you MUST test well and share your results. Super important.

The search threshold sounds great. What value have you chosen and why? How you can prove to me that it's a reasonable value?

Re adaptors: yes, using docs.json from github is a good solution, let's go with that for now.

SatyamMattoo commented 2 months ago

Hi @josephjclark,

Thank you for the valuable advice; I will definitely keep these points in mind.

Regarding the threshold value, I’ve conducted tests with multiple queries to determine an appropriate value that returns only valid results (in most cases). For example, when testing with an unrelated query like "What is Langchain?", setting the threshold slightly higher (e.g. 1.4) still returns some results for optional chaining.

For relevant queries, I’ve observed that the ideal score range is between 0.7 and 1.2 for the chunks related to the query. We can already see these scores in the stdout. Additionally, I’ve pushed a file containing 2 sample queries along with their respective scores for further reference.

SatyamMattoo commented 2 months ago

Hi @josephjclark, I’ve implemented the code to embed the adaptor functions into the database. You can find the chunks of the adaptor documentation in the tmp/split_sections/adaptors.md file. Additionally, I’ve implemented a partition-based embedding and search strategy to improve search efficiency by distinguishing between adaptor data and the regular documentation.

I’ve also worked on improving the overall code quality by reducing redundancy.

Best regards

josephjclark commented 2 months ago

Thank you for your efforts here @SatyamMattoo

I have spun out a couple of issues for further work which we'll be starting in a couple of weeks. In the meantime I'm going to merge and release this to the staging server :tada:

SatyamMattoo commented 2 months ago

Hello @josephjclark, That’s great news! Thank you so much for your guidance and support throughout—it’s been a great learning experience for me.

OpenFn / apollo