Open pmeier opened 10 months ago
So on another project, we were looking into a more mature RAG pipeline that has been in development for a while and the approach that was implemented is actually even more fine grained. I don't have the chart in front of me but document to embedding to storage was a multi-step pipeline with configurable components including a summarization and metadata extraction step to annotate the chunks. To be fair, they were also looking at supporting extracting structure from powerpoint and tables etc.
Overall, I think it make sense to go this direction.
Would it enale the embed once and use multiple LLM's workflow or is that orthogonal?
Would it enale the embed once and use multiple LLM's workflow or is that orthogonal?
It is orthogonal. The point is to make the embedding model a "first class citizen". Right now, everything is hardcoded to
We could potentially add a flag for it on the retrieve and store method so users could set it an runtime through the additional chat parameters:
But that has the downside that all embedding models have to be specified by us. Or, if the user wants to have a different embedding model, they would need to create their own source storage (subclassing will do the heavy lifting, but still).
By adding a EmbeddingModel
component, it is easier for us to implement new embedding models, because they are no longer tied to a source storage. Plus, users can extend Ragna with their own embedding models the same way they can now for source storages and assistants.
tl;dr Independent configurability for both chunking and embedding behaviour, along with support for hyrbid-search (semantic + lexical) would hopefully bring 80% of the performance with 20% of the work.
Details:
... the approach that was implemented is actually even more fine grained ... document to embedding to storage was a multi-step pipeline with configurable components including a summarization and metadata extraction step to annotate the chunks.
This has been my experience to date. I suspect it may be necessary to get the level of quality that most users will expect. A useful graphic from OpenAI on their RAG experience was recently posted on Twitter/X (see the image in the tweet if it doesn't appear through the link). The numbers can be taken with a grain of salt, but from a qualitative perspective, configurability around embeddings and chunkings seems very important.
My personal preference (and I also think that this is reasonably in-line with other RAG tools currently available) is to have at least 3 largely-independent pieces: document loading, document embedding and document querying. I think this is quite do-able with the current configuration, and not too far off what @pmeier is suggesting.
document loading - This includes both the data connectors and the data chunking strategies. I would recommend the chunking live here (rather than on the embedding) as users will likely want to vary both independently ie is it the chunking or the embedding that is degrading my performance. It also means a 'document store' can be populated, of documents that have been loaded and chunked but not embedded.
document embedding - @pmeier suggestions seem sensible to me. An additional point to keep in mind is that I suspect a lot of embedding models will be hosted elsewhere ie this will transform the current in-process, compute-bound approach into a network call that is more like I/O work. A queue is not really needed for this I think - perhaps this can be leveraged in some way.
document querying - hybrid-search that combines lexical and semantic search is a feature that has long been used in RecSys. Supported storages like lance
already have partial support for this behaviour. It's probably not something that will be implemented in the short-term in ragna
(if it is decided to implement it), but having a separate query abstraction makes this straightforward to implement I think.
Other
In #176, @pmeier thought I might like to chime in here based on my use-case. I think it would be great to have a separate EmbeddingModel
. That being said, I've already written a plugin that integrates Vectara into Ragna (refer to #203). Vectara, like AWS Kendra, is a service where the user does not have control over the extraction and embedding process. Still, in writing my plugin, I came to realize that chunk_size
and num_tokens
were actually still quite relevant while chunk_overlap
was not (even though there is no way to disable select parameters in the UI). So, I would be very careful about just ripping them out and assuming they don't matter or aren't tuneable for special source storages. Perhaps, with the creation of a few more plugins, more refined abstractions will become apparent.
@nenb https://github.com/Quansight/ragna/issues/191#issuecomment-1807937253
document loading - This includes both the data connectors and the data chunking strategies. I would recommend the chunking live here (rather than on the embedding) as users will likely want to vary both independently ie is it the chunking or the embedding that is degrading my performance. It also means a 'document store' can be populated, of documents that have been loaded and chunked but not embedded.
Makes sense to me. One thing that I don't want to support right now is to have a separate "chunk storage". Meaning, we only store documents and chunking will always be calculated on the fly. This ties into the "chunk / embed" once use case mentioned in https://github.com/Quansight/ragna/issues/191#issuecomment-1806514700 and https://github.com/Quansight/ragna/issues/176#issuecomment-1803446728 for which we don't have a good solution yet.
document embedding - @pmeier suggestions seem sensible to me. An additional point to keep in mind is that I suspect a lot of embedding models will be hosted elsewhere ie this will transform the current in-process, compute-bound approach into a network call that is more like I/O work. A queue is not really needed for this I think - perhaps this can be leveraged in some way.
See #204 for a write-up of potentially removing the task queue from the Python API. For the REST API / UI though it will stay.
document querying - hybrid-search that combines lexical and semantic search is a feature that has long been used in RecSys. Supported storages like
lance
already have partial support for this behaviour. It's probably not something that will be implemented in the short-term inragna
(if it is decided to implement it), but having a separate query abstraction makes this straightforward to implement I think.
Not sure I understand. Isn't SourceStorage
the abstraction that you are looking for here? Can't we just simply add a search_method: str = "semantic"
to the .retrieve()
method for the source storages that allow this?
Is support for embedding models like Whisper in-scope? The reason I ask is that in my experience so far, a large number of groups just want to use RAG to interrogate long, recorded meetings!
IMO no. If uses want that, they can write a DocumentHandler
that takes audio files and uses any speech to text model to extract the content. From there on you can use the regular flow again. Cross ref to #202.
What about support for multiple embedding models in a single chat, a bit too complex to support?
I don't understand the use case. Why would you want to do that and not just start another chat?
@peachkeel https://github.com/Quansight/ragna/issues/191#issuecomment-1811653174
chunk_overlap
was not (even though there is no way to disable select parameters in the UI)
Yeah, the UI unfortunately makes a lot of hardcoded assumptions right now. This is something that we urgently need to address. However, to make this truly dynamic, we need to annotate the parameters on the components methods to have some extra metadata. I have some ideas for that, but no proper design yet.
So, I would be very careful about just ripping them out and assuming they don't matter or aren't tuneable for special source storages.
They will not be ripped out, but just move to a different component. Will answer on the discussion in detail.
Not sure I understand. Isn't SourceStorage the abstraction that you are looking for here? Can't we just simply add a search_method: str = "semantic" to the .retrieve() method for the source storages that allow this?
Yes, poorly worded on my part. I was pointing out that I considered this to be the relevant abstraction, I should also have mentioned that the abstraction already exists! Like you said, it can be configured for further use-cases in the future if this is a path that is considered useful for ragna
.
I don't understand the use case. Why would you want to do that and not just start another chat?
It's probably a bit contrived, but I was imagining a case where an org has multiple embedding models for distinct sets of data. For example, they have an embedding model for emails and another for slack messages (as both will be written in different styles, lengths etc). If I wanted to explore my emails and slack messages at the same time, then ideally I would be able to have both embedding models in the same chat. On reflection though, it does seem like this is not really a high-priority at the current stage of development. Perhaps best to pretend I never mentioned it. ;)
@pmeier wrote: Not sure I understand. Isn't
SourceStorage
the abstraction that you are looking for here? Can't we just simply add asearch_method: str = "semantic"
to the.retrieve()
method for the source storages that allow this?@nenb wrote: document querying - hybrid-search that combines lexical and semantic search is a feature that has long been used in RecSys. Supported storages like
lance
already have partial support for this behaviour. It's probably not something that will be implemented in the short-term inragna
(if it is decided to implement it), but having a separate query abstraction makes this straightforward to implement I think.
Just my $0.02 here, but I think we should be careful about ever-expanding method signatures. I would love to see an abstraction like follows (with appropriate UI support):
class HybridSourceStorage(SourceStorage):
def __init__(self, config: Config, *args):
# stuff
hybrid = HybridSourceStorage(
config,
SemanticSourceStorage(…),
LexicalSourceStorage(…)
)
@peachkeel
Although currently marked experimental, lancedb
already supports both modes through a query_type
parameter on the .search()
method.
Meaning, if we would go with your abstraction, this would be quite a bit harder, since we now have two different source storages that need to hit the same table.
@pmeier,
My gut tells me you should keep the methods on SourceStorage as clean as possible. That's going to allow you to do composing (i.e., HybridSourceStorage) and other cool stuff so much easier. The implementation details are left as an exercise for the reader :smile:
You could do something as follows, though:
class LanceDB(SourceStorage):
def __init__(self, config, values: list[Value]):
value_dict = {value.name: value.value for value in values}
self.chunk_size = value_dict["chunk_size"]
self.chunk_overlap = value_dict["chunk_overlap"]
self.num_tokens = value_dict["num_tokens"]
self.query_type = value_dict["query_type"]
def store(self, documents: list[Document]) -> None:
# Implementation for storing documents
...
def retrieve(self, documents: list[Document], prompt: str) -> list[Source]:
# Implementation for retrieving sources
...
@staticmethod
def get_params():
return [
Param("chunk_size", int, default=1024, values=range(512, 4096), description="Size of data chunks"),
Param("chunk_overlap", int, default=100, values=range(50, 500), description="Overlap size between chunks"),
Param("num_tokens", int, default=500, values=range(100, 1000), description="Number of tokens to process"),
Param("query_type", str, default="semantic", values=["semantic", "lexical"], description="Type of query: semantic or lexical")
]
Then, to create a hybrid:
hybrid_storage = HybridSourceStorage(
config,
Chroma(config, [Value("param1", "value1"), ...]), # Chroma for semantic search
LanceDB(config, [Value("query_type", "lexical"), ...]) # LanceDB for lexical search
)
Now the question is what if you want to do:
hybrid_storage = HybridSourceStorage(
config,
LanceDB(config, [Value("query_type", "semantic"), ...]), # LanceDB for semantic search
LanceDB(config, [Value("query_type", "lexical"), ...]) # LanceDB for lexical search
)
That's a good question, and I haven't studied LanceDB to know what it does under the covers. But my opinion is that the code should be structured such that this works (even if the implementation requires redundant tables for now).
@pmeier I would be interested in working on this as it really, really needs to be done. I propose a component called an EmbeddingModel
, which would be configured on SourceStorage
creation optionally if the subclass accepts an embedding model (for pinecone, chroma, etc.). If not specified, behavior would be similar to the current, however instead of it being default via a hardcoded embedding model, it would simply default to an equivalent EmbeddingModel
. Then, I will update the UI so that advanced tab allows you to edit the embedding model (if available for the source storage, e.g. vectara would not have this option). I'll fork the repository and submit a pull request for these changes so you can review them. What are your thoughts on this? I don't really see another good way to implement this, and as far as the end user experience goes it would be functionally identical to right now unless you want to change the embedding model.
This system would be expandable for different chunking methods as well, should you want to edit them.
it really, really needs to be done.
Agreed.
Then, I will update the UI so that advanced tab allows you to edit the embedding model (if available for the source storage, e.g. vectara would not have this option).
Let's start with just the design for the Python API. If we have that down, we can deal with the rest. This also hinges on #313.
What are your thoughts on this?
I think we should not have a default value for the EmbeddingModel
. Similar to the SourceStorage
and the Assistant
, the user would need to select one explicitly. Of course we should provide the one that we have currently as part of Ragna, but it should not be selected by default. This removes a lot of magic, which I think is a good thing.
Assuming we now have an EmbeddingModel
component that takes list[Document]
and returns list[Embedding]
. As was pointed out multiple times in this issue, some source storages do not support embeddings. So how can the SourceStorage
communicate that it wants to receive list[Document]
rather than list[Embedding]
. I think we can achieve this with the type information:
When the chat is created and we find list[Documents]
we keep the current behavior. If we find list[Embedding]
, we require the user to select an EmbeddingModel
and process the documents with this before passing it on the the SourceStorage
.
Based on the same information, we could also communicate whether or not the chat requires an embedding model when using the REST API / web UI.
With this proposal, the SourceStorage
is fully separated from the EmbeddingModel
and stays declarative by simply annotating what it needs as input.
What do you think?
This system would be expandable for different chunking methods as well, should you want to edit them.
Indeed. See also #297. With my proposal above, we could add another option for the SourceStorage
to request: list[Chunk]
.
@pmeier I've read through your proposal. I think I understand. I've been working on this problem today, and what I have ended up doing (which I think is inline with your vision) is completely separating the ChunkingModel
and EmbeddingModel
from the SourceStorage
, and adding them as ragna components. As suggested, the behavior of store and retrieve is then influenced by the type-hint of the "Documents" variable (be they embeddings, or documents). I've had to tweak the way chats and components in general are created a bit, because the SourceStorage
and EmbeddingModel
are fundamentally linked by the dimension of the embed, meaning that on creation of the SourceStorage
you must know the dimension used by the EmbeddingModel
and pass it in.
I'll continue working on this until I'm happy with it, and then I'll submit a pull request. Let me know if I've misunderstood what your vision for this feature is.
completely separating the
ChunkingModel
andEmbeddingModel
from theSourceStorage
, and adding them as ragna components. As suggested, the behavior of store and retrieve is then influenced by the type-hint of the "Documents" variable (be they embeddings, or documents).I'll continue working on this until I'm happy with it, and then I'll submit a pull request. Let me know if I've misunderstood what your vision for this feature is.
Yes, that sounds correct. If you are unsure about anything, feel free to post a draft PR so we can have an early look.
I've had to tweak the way chats and components in general are created a bit, because the
SourceStorage
andEmbeddingModel
are fundamentally linked by the dimension of the embed, meaning that on creation of theSourceStorage
you must know the dimension used by theEmbeddingModel
and pass it in.
If that is the case, there is no real way of separating the EmbeddingModel
from the SourceStorage
is there? I hoped we could simply make this an attribute of the Embedding
that is created by the EmbeddingModel
.
That being said, why do we need to know this upfront? In our builtin source storages we have two different cases:
LanceDB
: we create the schema
upfront in __init__
and use that in .store()
But this is only a tiny optimization. I don't see a reason why we couldn't create the schema inside .store()
.
Chroma
: We never use the self._embedding_dimensions
attribute. This already hints at the ability for this to be handled at runtime rather than upfront.
Do you have a different use case in mind?
I've changed lanced to use the method you describe, where schema is created when you store a list of Embedding
s. I'll post a draft PR for you to take a look. Everything seems to work fine, but the only embedding model at the moment is the MiniML6v2 used by chroma.
Currently, the
SourceStorage
is responsible for extracting the text from the document, embedding it and finally storing it. In this issue, I want to discuss whether it makes sense for us to factor out the extraction and embedding logic into a separateEmbeddingModel
component on the same level asSourceStorage
andAssistant
. Here is what would change:EmbeddingModel
class would have an.embed()
method that takes a list of documents as inputs and returns a list ofEmbedding
s.Embedding
is a dataclass that holds two things:list[float]
) of the embedded textSourceStorage.store()
method would no longer take a list of documents, but rather a list of embeddings as input.chunk_size
andchunk_overlap
would move from theSourceStorage
to theEmbeddingModel
. That means that for all currently implemented source storages, there is no configuration anymore. This is not a bad thing, just pointing it out.We have not implemented it that way from the start to allow source storages where the user has no control over the extraction and embedding. For example AWS Kendra needs to take the full document and does the extraction and embedding in the background. Meaning, it will not not fit the schema above.
That being said, AWS Kendra will under current policy not land in Ragna, since it requires additional infrastructure. And for custom solution it is always possible to fall back to a custom
class NoEmbedding(Embedding)
that doesn't do any embedding, but just stores the document so the source storage can do its thing.So the question here is: do we want to have a separate
EmbeddingModel
component? It would make it easier for users to try new things, but makes it potentially harder on source storages that don't operate like a "regular" vector database.