karlunho-datastax commented 1 year ago

When retrieving data via ANN from Cassandra, a light-weight re-ranking for the purposes of determining what vector search results to pass the the LLM is necessary.

karlunho-datastax commented 1 year ago

There are multiple methods.

One method is reranking based on metadata - similar to Pinecone.

Rahul to do a spike based on the core dataset they are working on. Rahul to add more details.

karlunho-datastax commented 1 year ago

I've been talking to Alejandro about what his team did at shopify. Simple things that can be done with re-ranking include looking at the articles that have lots of page views. Let's also take a look if there is some CQL query optimization opportunities as well.

xingh commented 1 year ago

LLAMAIndex has some built in reranking

LLM Reranking LLM Rerank Demo - Lyft 10k LLM Rerank Demo - Great Gatsby
Also has Temporal Ranking Recency and Timeweighted Before/After

For Cassandra / Cassandra Specific / SAI there are two use cases:

Explicit Declaration of Meta Data In this case the user defines a schema for the source data (from CQL table or other Array of Dictionaries/Objects) and this structure can be reflected upon to determine the metadata SAI indexes, alternatively, they can choose fields in the dictionary/object should be indexed as an override. In most search indexes that are Lucene based - each field can be specifically configured if not using the default settings.

Implicit Extraction of Meta Data In this case the user sends unstructured information, and we can imply a metadata schema either using LLM or explicitly define it. For example: keywords, entities, topics, etc. that can be extracted using LLM, and then filtered.

karlunho-datastax commented 1 year ago

@xingh Take a look at Meta-Rank too - https://www.pinecone.io/learn/metarank/ We might just integrate with this product for now.

xingh commented 1 year ago

@karlunho-datastax

Updates on reranking re Cassio

@hemidactylus last iteration essentially takes care of the first version of this. I added suggestions for how to manage this with columns, but this is good enough for rock and roll.
When the llamaindex PR goes in with the data model refactor we can test these out of the box.

Pinecone's hybrid search is pretty good. Investigating Metarank, this is what I found:

Metarank

ML / Data Engineering

We would train it on some data to get it started. We can look at arxiv as a general data set. Featurization is done using configuration.

 ltr-bm25-meta-minilm-ce:
    type: lambdamart
    backend:
      type: xgboost
      iterations: 100
    features:
      - query_title_minilm_ft
      - query_title_ce_ft
      - query_title_bm25
      - query_desc_bm25
      - query_bullets_bm25
      - category0
      - category1
      - category2
      - color
      - material
      - price
      - ratings
      - stars
      - template
      - weight

We can continue training it as new data set comes in. Arxiv has new data that comes in all the time.

Retrieval

We can take whatever we are getting from CQL, Vector Search, Elastic/Lucene combination there of send it to Metarank and it will give us back better results, supposedly.
Another approach which is only integrated with opensearch as a plugin is to use Metarank as the LTR (learning to rank) sub system.
The advanced features as it relates to real-time user event based reranking is the cream.

Conclusion re Metarank

Its worth a try immediately without doing realtime training / realtime event based reranking via hardcoded metadata.
Next would be to feed metadata to an API that wraps metarank + cassio - would kick off a training job once a good amount of data is ingested, and then would be able to send reranked data.

CassioML / cassio

Investigate Reranking Algorithm to improve vector search results #48

Updates on reranking re Cassio

Metarank

ML / Data Engineering

Retrieval

Conclusion re Metarank

Breadcrumbs/ Reference