CassioML / cassio

A framework-agnostic Python library to seamlessly integrate Cassandra with ML/LLM/genAI workloads
Apache License 2.0
103 stars 18 forks source link

Investigate Reranking Algorithm to improve vector search results #48

Closed karlunho-datastax closed 7 months ago

karlunho-datastax commented 1 year ago

When retrieving data via ANN from Cassandra, a light-weight re-ranking for the purposes of determining what vector search results to pass the the LLM is necessary.

karlunho-datastax commented 1 year ago

There are multiple methods.

One method is reranking based on metadata - similar to Pinecone.

Rahul to do a spike based on the core dataset they are working on. Rahul to add more details.

karlunho-datastax commented 1 year ago

I've been talking to Alejandro about what his team did at shopify. Simple things that can be done with re-ranking include looking at the articles that have lots of page views. Let's also take a look if there is some CQL query optimization opportunities as well.

xingh commented 1 year ago

LLAMAIndex has some built in reranking

For Cassandra / Cassandra Specific / SAI there are two use cases:

Explicit Declaration of Meta Data In this case the user defines a schema for the source data (from CQL table or other Array of Dictionaries/Objects) and this structure can be reflected upon to determine the metadata SAI indexes, alternatively, they can choose fields in the dictionary/object should be indexed as an override. In most search indexes that are Lucene based - each field can be specifically configured if not using the default settings.

Implicit Extraction of Meta Data In this case the user sends unstructured information, and we can imply a metadata schema either using LLM or explicitly define it. For example: keywords, entities, topics, etc. that can be extracted using LLM, and then filtered.

karlunho-datastax commented 1 year ago

@xingh Take a look at Meta-Rank too - https://www.pinecone.io/learn/metarank/ We might just integrate with this product for now.

xingh commented 1 year ago

@karlunho-datastax

Updates on reranking re Cassio

  1. @hemidactylus last iteration essentially takes care of the first version of this. I added suggestions for how to manage this with columns, but this is good enough for rock and roll.
  2. When the llamaindex PR goes in with the data model refactor we can test these out of the box.

Pinecone's hybrid search is pretty good. Investigating Metarank, this is what I found:

Metarank

ML / Data Engineering

  1. We would train it on some data to get it started. We can look at arxiv as a general data set. Featurization is done using configuration.
 ltr-bm25-meta-minilm-ce:
    type: lambdamart
    backend:
      type: xgboost
      iterations: 100
    features:
      - query_title_minilm_ft
      - query_title_ce_ft
      - query_title_bm25
      - query_desc_bm25
      - query_bullets_bm25
      - category0
      - category1
      - category2
      - color
      - material
      - price
      - ratings
      - stars
      - template
      - weight
  1. We can continue training it as new data set comes in. Arxiv has new data that comes in all the time.

Retrieval

  1. We can take whatever we are getting from CQL, Vector Search, Elastic/Lucene combination there of send it to Metarank and it will give us back better results, supposedly.
  2. Another approach which is only integrated with opensearch as a plugin is to use Metarank as the LTR (learning to rank) sub system.
  3. The advanced features as it relates to real-time user event based reranking is the cream.

Conclusion re Metarank

  1. Its worth a try immediately without doing realtime training / realtime event based reranking via hardcoded metadata.
  2. Next would be to feed metadata to an API that wraps metarank + cassio - would kick off a training job once a good amount of data is ingested, and then would be able to send reranked data.

Breadcrumbs/ Reference