OoriData / OgbujiPT

Client-side toolkit for using large language models, including where self-hosted
Apache License 2.0
103 stars 8 forks source link

Using similarity threshold with PGVector #59

Closed uogbuji closed 11 months ago

uogbuji commented 11 months ago

A vector search will typically return a variety of results with different similarity scores. Right now in PGVectorHelper.query we just sort them all and return the one with the highest score. You can change the limit arg. We hard-code cosine similarity from the options PGVector provides.

Overall I'd like us to think about how we can expose many more of the options through our interface, but for now a glaring omission is the ability to limit responses by a threshold similarity/distance. If there are no good matches, for example, it's more useful to get an empty result set than one or more terrible matches. We want to be able to say "return up to the top N (or all) where similarity score is over S".

cc @chimezie @choccccy

uogbuji commented 11 months ago

Working on this & will use the pgvector_bigdata_insert_and_search branch.

chimezie commented 11 months ago

I agree, 100% (re: filtering by similarity threshold). It is what I tend to end up doing w/ related similarity measures such as Levenstein distance, for e.g.

uogbuji commented 11 months ago

I noticed we're returning cosine sim in an opposite way to what's intuitive, where lower scores mean a better match. That make sense when you think of it as distance, but we're using the term "similarity". Also, that means the semantics for PGVector are the opposite to those in Qdrant & Chroma. I've reversed this logic and the tests still pass (as I'd expected).

uogbuji commented 11 months ago
import os
from sentence_transformers import SentenceTransformer
from ogbujipt.embedding.pgvector import DocDB

DOC_EMBEDDINGS_LLM = 'all-MiniLM-L12-v2'
STATEMENT_TABLE_NAME = 'statements'
embedding_model = SentenceTransformer(DOC_EMBEDDINGS_LLM)
db = await DocDB.from_conn_params(
        embedding_model,
        STATEMENT_TABLE_NAME,
        os.environ['DB_USER'],
        os.environ['DB_PASSWORD'],
        os.environ['DB_NAME'],
        os.environ['DB_HOST'],
        os.environ['DB_PORT'],
        )
await db.create_table()

texts = ['Hello world', 'Hello Dolly', 'Good-Bye to All That']
authors = ['Brian Kernighan', 'Louis Armstrong', 'Robert Graves']
metas = [[f'author={a}'] for a in authors]
count = len(texts)
records = zip(texts, ['']*count, [None]*count, metas)
await db.insert_many(records)

qe = list(embedding_model.encode('Hi there!'))
qe

I pasted the output of qe into a test query (set as $1 below, since it's so long):

SELECT
    1 - (embedding <=> $1) AS cosine_similarity,
    title,
    content,
    page_numbers,
    tags
FROM
    statements
ORDER BY
    cosine_similarity DESC
;
image

So if I want to set a 0.4 threshold, the naive attempt is:

SELECT
    1 - (embedding <=> $1) AS cosine_similarity,
    title,
    content,
    page_numbers,
    tags
FROM
    statements
WHERE cosine_similarity >= 0.4
ORDER BY
    cosine_similarity DESC
;

And as usual, SQL reminds me it has teeth. Can't use aliases & aggregates in a WHERE clause. Subquery required:

SELECT * FROM
(SELECT
    1 - (embedding <=> $1) AS cosine_similarity,
    title,
    content,
    page_numbers,
    tags
FROM
    statements
) subq
WHERE cosine_similarity >= 0.4
ORDER BY
    cosine_similarity DESC
;

Working that update into the code now.

uogbuji commented 11 months ago

pgvector.py is way too unwieldy. I'm going to look to split it into a couple of files, with a deprecation path, of course. Also going to look into separating snippets-type DocDBs, as useful for SKA, versus proper docs, e.g. traditional RAG.

uogbuji commented 11 months ago

OK. Split it up. Closing this. I think, pending review, it's ready for a 0.7.0 release.