Adding Metadata To Embeddings

awslabs / LISA

LLM inference solution for Amazon Dedicated Cloud (LISA).

Apache License 2.0

34 stars 6 forks source link

Adding Metadata To Embeddings #75

Open david-saeger opened 2 months ago

david-saeger commented 2 months ago

Id like to add s3 metadata to my embeddings during the embedding creation process and realized that I wasnt sure the best place to do that. I wasnt sure if forking the project and adding to the file processing would be ideal or if there was something I could do by defining a ragLambdaLayer as descibed here https://github.com/awslabs/LISA/blob/2c3b03be4ed9010302e412004ad6e3c31f6ca2e7/example_config.yaml#L16 In truth I think I am just a little uncertain what these lambda layers do or how to use them. Do they replace the current rag api or add to it?

petermuller commented 2 months ago

Hi David! For the layer zip files, those are just a way to pre-package the layer as we have it defined here and then move the source into a region of your choice, ideally for network-isolated environments where we can't pull dependencies on the fly.

As for what you'd like to do, it sounds like we may need to update the RAG API itself (and we welcome pull requests against the develop branch 🎉 )

If you are willing to hack on LISA to add this, my first guess would be around this area: https://github.com/awslabs/LISA/blob/develop/lambda/repository/lambda_functions.py

Specifically this function is what we call to generate the initial embeddings: https://github.com/awslabs/LISA/blob/develop/lambda/repository/lambda_functions.py#L125-L151

and then similaritySearch is doing the embedding call for the prompt text https://github.com/awslabs/LISA/blob/develop/lambda/repository/lambda_functions.py#L80-L107

We're using LangChain under the hood, and we've created a form of langchain-compatible openai binding for embeddings specifically over here: https://github.com/awslabs/LISA/blob/develop/lisa-sdk/lisapy/langchain.py#L102-L153 (ignore other things in the file, there are some unused clients that we need to clean up 😬 )

So if there's a solution you had in mind or could point us in a direction to help with, I think these would be the best starting points. I'm not sure if this answers your question or helps guide in a direction, so please let me know!

david-saeger commented 2 months ago

This makes sense and is helpful. I imagine that folks won't want all s3 metadata translated to embeddings, do you think it would make sense to check for a prefix a la if s3 object metadata is prefixed with lisa (or something) then it is translated to vector metadata. Figure its worth asking before heading down the wrong path.

On Mon, Sep 9, 2024, 4:21 PM Peter Muller @.***> wrote:

Hi David! For the layer zip files, those are just a way to pre-package the layer as we have it defined here and then move the source into a region of your choice, ideally for network-isolated environments where we can't pull dependencies on the fly.

As for what you'd like to do, it sounds like we may need to update the RAG API itself (and we welcome pull requests against the develop branch 🎉 )

If you are willing to hack on LISA to add this, my first guess would be around this area: https://github.com/awslabs/LISA/blob/develop/lambda/repository/lambda_functions.py

Specifically this function is what we call to generate the initial embeddings: https://github.com/awslabs/LISA/blob/develop/lambda/repository/lambda_functions.py#L125-L151

and then similaritySearch is doing the embedding call for the prompt text https://github.com/awslabs/LISA/blob/develop/lambda/repository/lambda_functions.py#L80-L107

We're using LangChain under the hood, and we've created a form of langchain-compatible openai binding for embeddings specifically over here: https://github.com/awslabs/LISA/blob/develop/lisa-sdk/lisapy/langchain.py#L102-L153 (ignore other things in the file, there are some unused clients that we need to clean up 😬 )

So if there's a solution you had in mind or could point us in a direction to help with, I think these would be the best starting points. I'm not sure if this answers your question or helps guide in a direction, so please let me know!

— Reply to this email directly, view it on GitHub https://github.com/awslabs/LISA/issues/75#issuecomment-2339012414, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQBXN23WXX4DLGB3OCAGTWLZVX7K3AVCNFSM6AAAAABNZE7EV6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMZZGAYTENBRGQ . You are receiving this because you authored the thread.Message ID: @.***>

petermuller commented 2 months ago

I could see the possibility of adding some fields to the related APIs to add another map to the requests, such that those will contain the additional metadata. The metadata is attached at the Document level, so we could possibly make it as part of the API per document, or if we assume a list of files already in S3, then there's the possibility for us to edit the processing function to add more metadata than just the document location over here: https://github.com/awslabs/LISA/blob/develop/lambda/utilities/file_processing.py#L146

So for your suggestion, would the LISA prefix be related to the metadata already on the S3 object? As in something along the lines of:

Upload file to S3 with Object metadata attached
Use LISA ingestion to consume / embed files
Per file, check if there's S3 metadata (optionally: and check if the metadata is prefixed with a LISA-known prefix)
Add metadata to metadata dictionary that is processed along with the Document object
Metadata is now returned with the document text for requested vectors

Is this the workflow you're thinking of?

david-saeger commented 2 months ago

Yeah that was my first thought. Not sure if tying the vector metadata to S3 metadata is out of line with the goals of the project for some reason but unless your averse I can put it in a PR.

On Mon, Sep 9, 2024, 7:25 PM Peter Muller @.***> wrote:

I could see the possibility of adding some fields to the related APIs to add another map to the requests, such that those will contain the additional metadata. The metadata is attached at the Document https://api.python.langchain.com/en/latest/documents/langchain_core.documents.base.Document.html level, so we could possibly make it as part of the API per document, or if we assume a list of files already in S3, then there's the possibility for us to edit the processing function to add more metadata than just the document location over here: https://github.com/awslabs/LISA/blob/develop/lambda/utilities/file_processing.py#L146

So for your suggestion, would the LISA prefix be related to the metadata already on the S3 object? As in something along the lines of:

Upload file to S3 with Object metadata attached

Use LISA ingestion to consume / embed files

Per file, check if there's S3 metadata (optionally: and check if the metadata is prefixed with a LISA-known prefix)

Add metadata to metadata dictionary that is processed along with the Document object

Metadata is now returned with the document text for requested vectors

Is this the workflow you're thinking of?

— Reply to this email directly, view it on GitHub https://github.com/awslabs/LISA/issues/75#issuecomment-2339322238, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQBXN24EK3OWTYLKZJ6S6UDZVYU6ZAVCNFSM6AAAAABNZE7EV6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMZZGMZDEMRTHA . You are receiving this because you authored the thread.Message ID: @.***>

david-saeger commented 1 month ago

had to step away for the past couple days but kind of circling back to where I was originally when looking through this and trying to figure out a path to get the rag functionality I need. I understand that the layer zip files are included in the config optionally for network isolation. Could they not also serve to replace the RAG functionality if that is my end goal?

Something I am thinking through is what I need out of RAG is pretty boutique and I am doubtful it will be useful to other LISA users (likely would include custom embedding generation logic that is specific to the shape of specific documents) so figure any contribution I make here would end up looking like: place custom functionality somewhere (likely in the form of a lambda) and use it to replace some part or all of the rag API and I am questioning if this already exists in plain site or if I am missing something.

petermuller commented 1 month ago

No worries at all!

I've been thinking on this one for a little bit too and I think the main issue in our way is that our implementation of the RAG feature is fairly limited from the UI. Direct invocation via curl command or similar isn't really documented, but as I'm staring at it, I can see that it is possible to upload a custom list of keys to the rag store, so long as the exist in the LISA-provided document bucket (which is also something that we could edit to be user-provider). And with that, we could then provide additional metadata as part of the ingest_documents request. Several routes to go from here, but possible ones:

any metadata given in a call is added to all docs that are processed (keys: [docname1, docname2], metadata: something_applied_to_all)
metadata is applied to objects whose key match in a key:val relationship (keys: [docname1, docname2], metadata: {docname1: metadata1, docname2: metadata2})

And to answer your question, yes the rag layer could be used that way, but then it's a lot harder for us to support that way or improve on our existing things. I would say even based on all of this, we would still welcome a pull request with your ideas in it, and we can work to find the best path forward on it. If the goal for now is to just make a utility outside of the Chat UI to ingest documents with metadata, I think that backwards compatible changes to the repository API would be fine (as long as it doesn't break the current functionality then I'm good 👍 )

Some points of interest for that:

https://github.com/awslabs/LISA/blob/0e824eb42dd779ab93c6976d2bcde94271c20f50/lambda/repository/lambda_functions.py#L137
- this line takes a list of S3 keys (associated with the LISA-generated rag documents bucket), and will chunk them and add a hardcoded metadata of the s3 location for the text
https://github.com/awslabs/LISA/blob/0e824eb42dd779ab93c6976d2bcde94271c20f50/lambda/utilities/file_processing.py#L146
- this line is what associates the metadata to the langchain Document, and the metadata is just a hardcoded dict right here: https://github.com/awslabs/LISA/blob/0e824eb42dd779ab93c6976d2bcde94271c20f50/lambda/utilities/file_processing.py#L36-L37
I think what we could do is accept a "metadata" parameter in the APIGW API and then pass that into the file_processing file where we append it to the metadata dictionary that's already there

just some ideas and totally not prescriptive by any means!

david-saeger commented 1 month ago

Great ideas peter! I think these are great ways to get to the goal I expressed of adding metadata to vector embeddings. I think I may have convoluted the thread here with a second and related goal I have which I am having a harder time thinking through in terms of how to add in a way that could be useful to the broader LISA community that motivated this comment https://github.com/awslabs/LISA/issues/75#issuecomment-2346326261

Ill leave it hear in case you have thoughts but recognize it should be in another ticket and think I have the information I was seeking about metadata creation.

Basically I would like to be able to use boutique embedding creation logic so that I could parse a document and include some a prior knowledge about its shape in the embedding creation process so that I can for instance inject a title and subheading for each chunk generated from a section in a policy document.

Looking through the codebase I believe that would require replacing the routine here https://github.com/awslabs/LISA/blob/0e824eb42dd779ab93c6976d2bcde94271c20f50/lambda/utilities/file_processing.py#L59 on a one off bases for a particular document which is hard for me to think through how to implement in a way that is useful to anybody else. As I am writing this It strikes me that the answer may be just in just generate boutique embeddings locally and send them direct to pgvector. Do you see any issues with that?