Feat(vector store): lancedb

garance-buricatu commented 1 day ago

Adds lanceDB as a vector store in rig-lancedb crate
Modifies VectorStoreIndex trait to include a generic type type SearchParams which can be passed to the vector store query to customize the query.

Questions

I find that creating a new struct LanceDbVectorIndex that implements the VectorStoreIndex trait doesn't add much value, since it is the same as LanceDbVectorStore struct but with a model as well. I would implement the trait VectorStoreIndex on LanceDbVectorStore instead.
I created a new struct SearchParams to be used as part of the VectorStoreIndex. There is already a struct in the lancedb library that includes all of the search params. I wanted to use it instead of creating a new struct but since the type needs to implement deserialize, I cannot.

cvauclair commented 1 day ago

@0xMochan @garance-buricatu would like your opinion on something here. This PR adds an associated type for the VectorStoreIndex trait SearchParams and adds it as an argument to the trait method, e.g.:

// in rig-core/src/vector_store/mod.rs
pub trait VectorStoreIndex: Send + Sync {
    type SearchParams: for<'a> Deserialize<'a> + Send + Sync;    // <-------- This new type

    fn top_n_from_query(
        &self,
        query: &str,
        n: usize,
        search_params: Self::SearchParams,  // <--------- used here
    ) -> impl std::future::Future<Output = Result<Vec<(f64, DocumentEmbeddings)>, VectorStoreError>> + Send;

    // ...
}

However, a downstream effect is that you now need to pass an VectorStoreIndex::SearchParams argument when constructing a RAG agent, e.g.:

// in rig-core/examples/rag.rs
    let rag_agent = openai_client.agent("gpt-4")
        .preamble("
            You are a dictionary assistant here to assist the user in understanding the meaning of words.
            You will find additional non-standard word definitions that could be useful below.
        ")
        .dynamic_context(1, index, "".to_string())  // <---- new empty string argument
        .build();

This feels a little awkward, even though it "makes sense". However, another approach would be to push whatever search/index parameters into the struct that implements the VectorStoreIndex trait. So instead of passing those params as an argument to the search functions, they would be fixed for a given VectorStoreIndex. E.g. you'd have this instead:

pub struct LanceDbVectorStore<M: EmbeddingModel> {
    model: M,
    document_table: lancedb::Table,
    embedding_table: lancedb::Table,
    params: ....   // <------- params would go inside the index object
}

This also has the advantage of making the VectorStoreIndex object safe avoid the need for the serde "hack" to get around the object safety requirements.

0xMochan commented 1 day ago

Interesting, for context, can you provide an example for how the search params can be used?

garance-buricatu commented 1 day ago

@cvauclair

Pro for new approach:

wouldn't force the SearchParams struct to be deserializable/serializable, this has been annoying.
improves readability like in the example you showed where the empty string was confusing

Con for new approach:

can't use different settings of SearchParams for different calls to top_n_from_query or top_n_from_embedding. (unless we can modify the entire LanceDbVectorStore struct)

garance-buricatu commented 1 day ago

@0xMochan

in lanceDB, search params can be used to define the distance type (cosine, l2, dot, ...), search type (approximate nearest neighbor, exact nearest neighbor), filters, and more fine tuning options on the search index.

cvauclair commented 1 day ago

Con for new approach:

- can't use different settings of SearchParams for different calls to `top_n_from_query` or `top_n_from_embedding`. (unless we can modify the entire `LanceDbVectorStore` struct)

I think in the context of vector search, this is fine since these settings will most likely be static once the user has found the configuration that works for their RAG system. Plus, although these methods are public, users would usually interact with VectorStoreIndex types only insofar as they would create one and give it as a parameter to their Agent. Even in the current implementation, the SearchParams of an agent's dynamic context sources are static, so we'd just make it static across the board (if that makes sense).

0xMochan commented 1 day ago

Con for new approach:
- can't use different settings of SearchParams for different calls to `top_n_from_query` or `top_n_from_embedding`. (unless we can modify the entire `LanceDbVectorStore` struct)
I think in the context of vector search, this is fine since these settings will most likely be static once the user has found the configuration that works for their RAG system. Plus, although these methods are public, users would usually interact with VectorStoreIndex types only insofar as they would create one and give it as a parameter to their Agent. Even in the current implementation, the SearchParams of an agent's dynamic context sources are static, so we'd just make it static across the board (if that makes sense).

I do think it might be useful to change it on the fly, or if the user of an application wants to adjust this based on a UI / TUI. But you can always reconstruct an agent so it may not be that big of a deal. There can always be an extra search command to use a specific method of querying that overrides the default (so you choose a default when you create the Agent) but when you query, there's the ability to override that by using a specific function top_n_from_query_using_search_param (bad naming but ya know). This can be an extra function defined by a trait for this feature (theoretically) and can always be added in the future if there's a demand for it (perhaps we leave a comment about it?)

garance-buricatu commented 13 hours ago

@0xMochan @cvauclair thanks guys! I will implement both moving the search params into the vector store struct and also adding an extra override method on the vector search trait

0xPlaygrounds / rig

Feat(vector store): lancedb #33