embedding model component?

Quansight / ragna

RAG orchestration framework ⛵️

https://ragna.chat

BSD 3-Clause "New" or "Revised" License

177 stars 22 forks source link

embedding model component? #191

Open pmeier opened 10 months ago

pmeier commented 10 months ago

Currently, the SourceStorage is responsible for extracting the text from the document, embedding it and finally storing it. In this issue, I want to discuss whether it makes sense for us to factor out the extraction and embedding logic into a separate EmbeddingModel component on the same level as SourceStorage and Assistant. Here is what would change:

The new EmbeddingModel class would have an .embed() method that takes a list of documents as inputs and returns a list of Embeddings.
An Embedding is a dataclass that holds two things:
1. The chunk, i.e. the actual text, the number of tokens, the document it was created from, and if available the location of the chunk inside said document
2. The numerical values (probably list[float]) of the embedded text
The SourceStorage.store() method would no longer take a list of documents, but rather a list of embeddings as input.
The chunking parameters like chunk_size and chunk_overlap would move from the SourceStorage to the EmbeddingModel. That means that for all currently implemented source storages, there is no configuration anymore. This is not a bad thing, just pointing it out.

We have not implemented it that way from the start to allow source storages where the user has no control over the extraction and embedding. For example AWS Kendra needs to take the full document and does the extraction and embedding in the background. Meaning, it will not not fit the schema above.

That being said, AWS Kendra will under current policy not land in Ragna, since it requires additional infrastructure. And for custom solution it is always possible to fall back to a custom class NoEmbedding(Embedding) that doesn't do any embedding, but just stores the document so the source storage can do its thing.

So the question here is: do we want to have a separate EmbeddingModel component? It would make it easier for users to try new things, but makes it potentially harder on source storages that don't operate like a "regular" vector database.

dharhas commented 10 months ago

So on another project, we were looking into a more mature RAG pipeline that has been in development for a while and the approach that was implemented is actually even more fine grained. I don't have the chart in front of me but document to embedding to storage was a multi-step pipeline with configurable components including a summarization and metadata extraction step to annotate the chunks. To be fair, they were also looking at supporting extracting structure from powerpoint and tables etc.

Overall, I think it make sense to go this direction.

Would it enale the embed once and use multiple LLM's workflow or is that orthogonal?

pmeier commented 10 months ago

Would it enale the embed once and use multiple LLM's workflow or is that orthogonal?

It is orthogonal. The point is to make the embedding model a "first class citizen". Right now, everything is hardcoded to

https://github.com/Quansight/ragna/blob/8d14f8721edcde129cb469ef60cca9ab6a536bc4/ragna/source_storages/_vector_database.py#L74-L78

We could potentially add a flag for it on the retrieve and store method so users could set it an runtime through the additional chat parameters:

https://github.com/Quansight/ragna/blob/8d14f8721edcde129cb469ef60cca9ab6a536bc4/ragna/core/_rag.py#L62-L70

But that has the downside that all embedding models have to be specified by us. Or, if the user wants to have a different embedding model, they would need to create their own source storage (subclassing will do the heavy lifting, but still).

By adding a EmbeddingModel component, it is easier for us to implement new embedding models, because they are no longer tied to a source storage. Plus, users can extend Ragna with their own embedding models the same way they can now for source storages and assistants.

nenb commented 10 months ago

tl;dr Independent configurability for both chunking and embedding behaviour, along with support for hyrbid-search (semantic + lexical) would hopefully bring 80% of the performance with 20% of the work.

Details:

... the approach that was implemented is actually even more fine grained ... document to embedding to storage was a multi-step pipeline with configurable components including a summarization and metadata extraction step to annotate the chunks.

This has been my experience to date. I suspect it may be necessary to get the level of quality that most users will expect. A useful graphic from OpenAI on their RAG experience was recently posted on Twitter/X (see the image in the tweet if it doesn't appear through the link). The numbers can be taken with a grain of salt, but from a qualitative perspective, configurability around embeddings and chunkings seems very important.

My personal preference (and I also think that this is reasonably in-line with other RAG tools currently available) is to have at least 3 largely-independent pieces: document loading, document embedding and document querying. I think this is quite do-able with the current configuration, and not too far off what @pmeier is suggesting.

document loading - This includes both the data connectors and the data chunking strategies. I would recommend the chunking live here (rather than on the embedding) as users will likely want to vary both independently ie is it the chunking or the embedding that is degrading my performance. It also means a 'document store' can be populated, of documents that have been loaded and chunked but not embedded.

document embedding - @pmeier suggestions seem sensible to me. An additional point to keep in mind is that I suspect a lot of embedding models will be hosted elsewhere ie this will transform the current in-process, compute-bound approach into a network call that is more like I/O work. A queue is not really needed for this I think - perhaps this can be leveraged in some way.

document querying - hybrid-search that combines lexical and semantic search is a feature that has long been used in RecSys. Supported storages like lance already have partial support for this behaviour. It's probably not something that will be implemented in the short-term in ragna (if it is decided to implement it), but having a separate query abstraction makes this straightforward to implement I think.

Other

Is support for embedding models like Whisper in-scope? The reason I ask is that in my experience so far, a large number of groups just want to use RAG to interrogate long, recorded meetings!
What about support for multiple embedding models in a single chat, a bit too complex to support?

peachkeel commented 10 months ago

In #176, @pmeier thought I might like to chime in here based on my use-case. I think it would be great to have a separate EmbeddingModel. That being said, I've already written a plugin that integrates Vectara into Ragna (refer to #203). Vectara, like AWS Kendra, is a service where the user does not have control over the extraction and embedding process. Still, in writing my plugin, I came to realize that chunk_size and num_tokens were actually still quite relevant while chunk_overlap was not (even though there is no way to disable select parameters in the UI). So, I would be very careful about just ripping them out and assuming they don't matter or aren't tuneable for special source storages. Perhaps, with the creation of a few more plugins, more refined abstractions will become apparent.

pmeier commented 10 months ago

@nenb https://github.com/Quansight/ragna/issues/191#issuecomment-1807937253

document loading - This includes both the data connectors and the data chunking strategies. I would recommend the chunking live here (rather than on the embedding) as users will likely want to vary both independently ie is it the chunking or the embedding that is degrading my performance. It also means a 'document store' can be populated, of documents that have been loaded and chunked but not embedded.

Makes sense to me. One thing that I don't want to support right now is to have a separate "chunk storage". Meaning, we only store documents and chunking will always be calculated on the fly. This ties into the "chunk / embed" once use case mentioned in https://github.com/Quansight/ragna/issues/191#issuecomment-1806514700 and https://github.com/Quansight/ragna/issues/176#issuecomment-1803446728 for which we don't have a good solution yet.

document embedding - @pmeier suggestions seem sensible to me. An additional point to keep in mind is that I suspect a lot of embedding models will be hosted elsewhere ie this will transform the current in-process, compute-bound approach into a network call that is more like I/O work. A queue is not really needed for this I think - perhaps this can be leveraged in some way.

See #204 for a write-up of potentially removing the task queue from the Python API. For the REST API / UI though it will stay.

document querying - hybrid-search that combines lexical and semantic search is a feature that has long been used in RecSys. Supported storages like lance already have partial support for this behaviour. It's probably not something that will be implemented in the short-term in ragna (if it is decided to implement it), but having a separate query abstraction makes this straightforward to implement I think.

Not sure I understand. Isn't SourceStorage the abstraction that you are looking for here? Can't we just simply add a search_method: str = "semantic" to the .retrieve() method for the source storages that allow this?

Is support for embedding models like Whisper in-scope? The reason I ask is that in my experience so far, a large number of groups just want to use RAG to interrogate long, recorded meetings!

IMO no. If uses want that, they can write a DocumentHandler that takes audio files and uses any speech to text model to extract the content. From there on you can use the regular flow again. Cross ref to #202.

What about support for multiple embedding models in a single chat, a bit too complex to support?

I don't understand the use case. Why would you want to do that and not just start another chat?

@peachkeel https://github.com/Quansight/ragna/issues/191#issuecomment-1811653174

chunk_overlap was not (even though there is no way to disable select parameters in the UI)

Yeah, the UI unfortunately makes a lot of hardcoded assumptions right now. This is something that we urgently need to address. However, to make this truly dynamic, we need to annotate the parameters on the components methods to have some extra metadata. I have some ideas for that, but no proper design yet.

So, I would be very careful about just ripping them out and assuming they don't matter or aren't tuneable for special source storages.

They will not be ripped out, but just move to a different component. Will answer on the discussion in detail.

nenb commented 10 months ago

Not sure I understand. Isn't SourceStorage the abstraction that you are looking for here? Can't we just simply add a search_method: str = "semantic" to the .retrieve() method for the source storages that allow this?

Yes, poorly worded on my part. I was pointing out that I considered this to be the relevant abstraction, I should also have mentioned that the abstraction already exists! Like you said, it can be configured for further use-cases in the future if this is a path that is considered useful for ragna.

I don't understand the use case. Why would you want to do that and not just start another chat?

It's probably a bit contrived, but I was imagining a case where an org has multiple embedding models for distinct sets of data. For example, they have an embedding model for emails and another for slack messages (as both will be written in different styles, lengths etc). If I wanted to explore my emails and slack messages at the same time, then ideally I would be able to have both embedding models in the same chat. On reflection though, it does seem like this is not really a high-priority at the current stage of development. Perhaps best to pretend I never mentioned it. ;)

peachkeel commented 10 months ago

@pmeier wrote: Not sure I understand. Isn't SourceStorage the abstraction that you are looking for here? Can't we just simply add a search_method: str = "semantic" to the .retrieve() method for the source storages that allow this?

@nenb wrote: document querying - hybrid-search that combines lexical and semantic search is a feature that has long been used in RecSys. Supported storages like lance already have partial support for this behaviour. It's probably not something that will be implemented in the short-term in ragna (if it is decided to implement it), but having a separate query abstraction makes this straightforward to implement I think.

Just my $0.02 here, but I think we should be careful about ever-expanding method signatures. I would love to see an abstraction like follows (with appropriate UI support):

class HybridSourceStorage(SourceStorage):
    def __init__(self, config: Config, *args):
        # stuff

hybrid = HybridSourceStorage(
    config,
    SemanticSourceStorage(…),
    LexicalSourceStorage(…)
)

pmeier commented 10 months ago

@peachkeel

Although currently marked experimental, lancedb already supports both modes through a query_type parameter on the .search() method.

Meaning, if we would go with your abstraction, this would be quite a bit harder, since we now have two different source storages that need to hit the same table.

peachkeel commented 10 months ago

@pmeier,

My gut tells me you should keep the methods on SourceStorage as clean as possible. That's going to allow you to do composing (i.e., HybridSourceStorage) and other cool stuff so much easier. The implementation details are left as an exercise for the reader :smile:

You could do something as follows, though:

class LanceDB(SourceStorage):
    def __init__(self, config, values: list[Value]):
        value_dict = {value.name: value.value for value in values}

        self.chunk_size = value_dict["chunk_size"]
        self.chunk_overlap = value_dict["chunk_overlap"]
        self.num_tokens = value_dict["num_tokens"]
        self.query_type = value_dict["query_type"]

    def store(self, documents: list[Document]) -> None:
        # Implementation for storing documents
        ...

    def retrieve(self, documents: list[Document], prompt: str) -> list[Source]:
        # Implementation for retrieving sources
        ...

    @staticmethod
    def get_params():
        return [
            Param("chunk_size", int, default=1024, values=range(512, 4096), description="Size of data chunks"),
            Param("chunk_overlap", int, default=100, values=range(50, 500), description="Overlap size between chunks"),
            Param("num_tokens", int, default=500, values=range(100, 1000), description="Number of tokens to process"),
            Param("query_type", str, default="semantic", values=["semantic", "lexical"], description="Type of query: semantic or lexical")
        ]

Then, to create a hybrid:

hybrid_storage = HybridSourceStorage(
    config,
    Chroma(config, [Value("param1", "value1"), ...]), # Chroma for semantic search
    LanceDB(config, [Value("query_type", "lexical"), ...]) # LanceDB for lexical search
)

Now the question is what if you want to do:

hybrid_storage = HybridSourceStorage(
    config,
    LanceDB(config, [Value("query_type", "semantic"), ...]), # LanceDB for semantic search
    LanceDB(config, [Value("query_type", "lexical"), ...]) # LanceDB for lexical search
)

That's a good question, and I haven't studied LanceDB to know what it does under the covers. But my opinion is that the code should be structured such that this works (even if the implementation requires redundant tables for now).

Tengal-Teemo commented 6 months ago

@pmeier I would be interested in working on this as it really, really needs to be done. I propose a component called an EmbeddingModel, which would be configured on SourceStorage creation optionally if the subclass accepts an embedding model (for pinecone, chroma, etc.). If not specified, behavior would be similar to the current, however instead of it being default via a hardcoded embedding model, it would simply default to an equivalent EmbeddingModel. Then, I will update the UI so that advanced tab allows you to edit the embedding model (if available for the source storage, e.g. vectara would not have this option). I'll fork the repository and submit a pull request for these changes so you can review them. What are your thoughts on this? I don't really see another good way to implement this, and as far as the end user experience goes it would be functionally identical to right now unless you want to change the embedding model.

Tengal-Teemo commented 6 months ago

This system would be expandable for different chunking methods as well, should you want to edit them.

pmeier commented 6 months ago

it really, really needs to be done.

Agreed.

Then, I will update the UI so that advanced tab allows you to edit the embedding model (if available for the source storage, e.g. vectara would not have this option).

Let's start with just the design for the Python API. If we have that down, we can deal with the rest. This also hinges on #313.

What are your thoughts on this?

I think we should not have a default value for the EmbeddingModel. Similar to the SourceStorage and the Assistant, the user would need to select one explicitly. Of course we should provide the one that we have currently as part of Ragna, but it should not be selected by default. This removes a lot of magic, which I think is a good thing.

Assuming we now have an EmbeddingModel component that takes list[Document] and returns list[Embedding]. As was pointed out multiple times in this issue, some source storages do not support embeddings. So how can the SourceStorage communicate that it wants to receive list[Document] rather than list[Embedding]. I think we can achieve this with the type information:

https://github.com/Quansight/ragna/blob/8b0b0f3871b4f5929d1d02f9dd6aada06c4f3f67/ragna/core/_components.py#L103

When the chat is created and we find list[Documents] we keep the current behavior. If we find list[Embedding], we require the user to select an EmbeddingModel and process the documents with this before passing it on the the SourceStorage.

Based on the same information, we could also communicate whether or not the chat requires an embedding model when using the REST API / web UI.

With this proposal, the SourceStorage is fully separated from the EmbeddingModel and stays declarative by simply annotating what it needs as input.

What do you think?

This system would be expandable for different chunking methods as well, should you want to edit them.

Indeed. See also #297. With my proposal above, we could add another option for the SourceStorage to request: list[Chunk].

Tengal-Teemo commented 6 months ago

@pmeier I've read through your proposal. I think I understand. I've been working on this problem today, and what I have ended up doing (which I think is inline with your vision) is completely separating the ChunkingModel and EmbeddingModel from the SourceStorage, and adding them as ragna components. As suggested, the behavior of store and retrieve is then influenced by the type-hint of the "Documents" variable (be they embeddings, or documents). I've had to tweak the way chats and components in general are created a bit, because the SourceStorage and EmbeddingModel are fundamentally linked by the dimension of the embed, meaning that on creation of the SourceStorage you must know the dimension used by the EmbeddingModel and pass it in.

I'll continue working on this until I'm happy with it, and then I'll submit a pull request. Let me know if I've misunderstood what your vision for this feature is.

pmeier commented 6 months ago

completely separating the ChunkingModel and EmbeddingModel from the SourceStorage, and adding them as ragna components. As suggested, the behavior of store and retrieve is then influenced by the type-hint of the "Documents" variable (be they embeddings, or documents).

I'll continue working on this until I'm happy with it, and then I'll submit a pull request. Let me know if I've misunderstood what your vision for this feature is.

Yes, that sounds correct. If you are unsure about anything, feel free to post a draft PR so we can have an early look.

I've had to tweak the way chats and components in general are created a bit, because the SourceStorage and EmbeddingModel are fundamentally linked by the dimension of the embed, meaning that on creation of the SourceStorage you must know the dimension used by the EmbeddingModel and pass it in.

If that is the case, there is no real way of separating the EmbeddingModel from the SourceStorage is there? I hoped we could simply make this an attribute of the Embedding that is created by the EmbeddingModel.

That being said, why do we need to know this upfront? In our builtin source storages we have two different cases:

LanceDB: we create the schema upfront in __init__

https://github.com/Quansight/ragna/blob/8b0b0f3871b4f5929d1d02f9dd6aada06c4f3f67/ragna/source_storages/_lancedb.py#L38-L50

and use that in .store()

https://github.com/Quansight/ragna/blob/8b0b0f3871b4f5929d1d02f9dd6aada06c4f3f67/ragna/source_storages/_lancedb.py#L62

But this is only a tiny optimization. I don't see a reason why we couldn't create the schema inside .store().
Chroma: We never use the self._embedding_dimensions attribute. This already hints at the ability for this to be handled at runtime rather than upfront.

Do you have a different use case in mind?

Tengal-Teemo commented 6 months ago

I've changed lanced to use the method you describe, where schema is created when you store a list of Embeddings. I'll post a draft PR for you to take a look. Everything seems to work fine, but the only embedding model at the moment is the MiniML6v2 used by chroma.