support selection of similarity metrics for diversity and relevancy in MMR

LangStream / langstream

LangStream. Event-Driven Developer Platform for Building and Running LLM AI Apps. Powered by Kubernetes and Kafka.

https://langstream.ai

Apache License 2.0

387 stars 28 forks source link

support selection of similarity metrics for diversity and relevancy in MMR #514

Open acantarero opened 11 months ago

acantarero commented 11 months ago

Background

The literature seems unclear on what similarity metrics perform best for diversity and relevancy. (if anyone has found any good analysis on this would be great to see).
bm25 works better if a lot of text pre-processing is performed (stemming / lemmatization, word normalization, stop word removal, etc.) that is not as common in genAI / embedding workflows. User data may be better suited to a vector search similarity function instead of keyword type method

Suggestion

We have already implemented bm25 and cosine similarity. Allow users to select which similarity method they want to use (with reasonable defaults).

eolivelli commented 11 months ago

I agree that we must make it configurable to chose which metrics use.

But I disagree that we could let users use a vector search similarity function (like cosine similarity) for ensuring "diversity" in BM25. The documents have been already retrieved from the vector database as the closest according to the same function, so using the function won't help in reducing redundancy on the set of documents sent to the LLM in the prompt.

eolivelli commented 11 months ago

One of the main benefits of LangStream, thanks to its asynchronous nature, is that it makes it easy to perform preprocessing before storing the text on the vector database (we already have a a few agents that help with a good configuration out-of-the-box)

eolivelli commented 11 months ago

@acantarero do you have some proposal of other metrics to use ?