BoxcarsAI / boxcars

Building applications with composability using Boxcars with LLM's. Inspired by LangChain.
MIT License
434 stars 39 forks source link

meta data with vector store #64

Closed francis closed 11 months ago

francis commented 1 year ago

Vector search is good for capturing semantically similar texts, but other systems are adding the ability to store meta data with each vector that can be used in union with the vector search to find things.

@jaigouk - since you wrote the vector store, is this concept there already, or can it be added with a little work?

obie commented 1 year ago

Marqo is great for this. Can include a filter expression in the query to scope the search down to certain end users, for instance. On May 1, 2023, 4:18 PM -0600, Francis @.***>, wrote:

Vector search is good for capturing semantically similar texts, but other systems are adding the ability to store meta data with each vector that can be used in union with the vector search to find things. @jaigouk - since you wrote the vector store, is this concept there already, or can it be added with a little work? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>

jaigouk commented 1 year ago

update: the file is visible with this PR https://github.com/BoxcarsAI/boxcars/pull/65

@francis something like this? https://github.com/jaigouk/boxcars/blob/feature/add-in-memory-vector-store/lib/boxcars/boxcar/vector_stores/document.rb

class Document
      attr_accessor :page_content, :metadata

      def initialize(fields = {})
        @page_content = fields[:page_content] || ""
        @metadata = fields[:metadata] || {}
      end

here is a spec that is using that structure. not the metadata directly.

it would be nice to have a use case or examples for that.

and i need to clean up hnswlib search after in_memory is done.

francis commented 1 year ago

Yes, I think this is on the right path @jaigouk

francis commented 1 year ago

@obie - I will see what I can learn about Marqo. Thank you for the pointer.

francis commented 1 year ago

@jaigouk - I pushed a small change to the main branch to move all calls to OpenAI::Client to one shared method in Boxcars::Openai and updated my sample notebook at https://github.com/BoxcarsAI/boxcars/blob/main/notebooks/Embeddings%20Search.ipynb with the new names.

I guess the next step would be to have Boxcars::VectorStores::SimilaritySearch return multiple results with a passed param and metadata about the results.

Something like: similarity_search = Boxcars::VectorStores::SimilaritySearch.new params results = similarity_search.call query: "Am I provided a laptop?", count: 3 and this would return the three closest matches (preferably with additional metadata such as reference path/URL and other attributes a user wanted to store with a record, but minimally the search distance)

jaigouk commented 1 year ago

@francis yup.

I will continue https://github.com/BoxcarsAI/boxcars/issues/60 I also want to clean up hnswlib based codes.

Question. we just return metadata as it is and user will decide to use metadata for filtering the result further. right?

francis commented 1 year ago

@jaigouk also, I closed the ticket already, but I changed VectorStores to VectorStore and moved as a top-level peer to a Boxcar, Train, and Engine. I hope that doesn't mess up your tree too much.

francis commented 1 year ago

It looks like the VectorSearch PR might be some or all of this ticket. I think we want to filter results based on metadata and I can do this after the search. Is it possible to refine the search possibilities before searching?

jaigouk commented 1 year ago

vector search will return Document array based on distances. and the document instance has metadata method that returns the orignal hash if there were some. I am not injecting much info within metadata. the order of array is based on distance or similarity already. So I thought that it is upto users of this gem to "filter" the result further.

jaigouk commented 1 year ago

I thought about creating a demo app with hanami v2. and then even though I create a crawler, i may want to use the metadata for navigating the website structure. i guess we might need extra boxcar for crawlers. I might need to investigate for combining that with vector storage.