drizzle-team / drizzle-orm-docs

Drizzle ORM documentation website
https://orm.drizzle.team
144 stars 233 forks source link

Vector similarity search Guide isn't using the vector index #436

Open martinloretzzz opened 1 week ago

martinloretzzz commented 1 week ago

In the guide for Drizzle | Vector similarity search with pgvector extension postgres isn't using the vector index, because it is looking for the 1 - cosineDistance, while the index is built for the cosine distance only.

Text from the same issue I opened on https://github.com/vercel/examples/issues/984:


I built a project on top of the postgres-pgvector template, but the database queries were really slow. I had 2.8 million rows in the database with 512 dimensional embedding vectors. Queries took like 12 seconds. (After using an ivfflat index it took 100ms)

Reproduction repo is here: https://github.com/martinloretzzz/nextjs-drizzle-pgvector

After investigating the issue, it turns out the queries don't use the vector index, because we're looking for the 1 - cosineDistance, while the index is built for the cosine distance only.

The fix is quite simple, need to look for the smallest cosineDistance, instead of the largest 1-cosineDistance:

const similarity = sql<number>`${cosineDistance(pokemons.embedding, vectorQuery)}`
const pokemon = await db
    .select({ id: pokemons.id, name: pokemons.name, similarity })
    .from(pokemons)
    .where(lt(similarity, 0.5))
    .orderBy((t) => asc(t.similarity))
    .limit(8)

Note: Use postgres EXPLAIN to see the query plan. For this small dataset of 150 pokemons, for both queries the index aren't used, to test if index would be used on big datasets, so set SET SESSION enable_seqscan=false; Exported sql queries: before, after, use with SET SESSION enable_seqscan=false; EXPLAIN {query} is psql/pgadmin