ankane / neighbor

Nearest neighbor search for Rails
MIT License
589 stars 14 forks source link

simpify nearest_neighbors query when ORDER BY clause matches SELECT alias #20

Closed moracca closed 5 months ago

moracca commented 6 months ago

when the ORDER BY clause exactly matches the "AS neighbor_distance" select clause, we can simply use the neighbor_distance alias to simplify the query.

Ultimately doesn't change the function of the query, but cuts the length in half which simplifies things when the query is being logged to log files etc. since it removes the need for including all the vectors 2x in the query

e.g. changes a query like this:

SELECT "llm_embeddings"."id", "llm_embeddings"."source_type", "llm_embeddings"."source_id", "llm_embeddings"."created_at", "llm_embeddings"."updated_at", "llm_embeddings"."created_by", "llm_embeddings"."updated_by", "llm_embeddings"."llm_model_id",
  "llm_embeddings"."embedding" <-> '[-0.0017242150271110192,-0.029317252896789353,<.....>,0.024415132566991064]' AS neighbor_distance
FROM "llm_embeddings"
WHERE "llm_embeddings"."source_type" = 'LlmSource' AND "llm_embeddings"."embedding" IS NOT NULL
ORDER BY "llm_embeddings"."embedding" <-> '[-0.0017242150271110192,-0.029317252896789353,<.....>,0.024415132566991064]'
LIMIT 5;

into

SELECT "llm_embeddings"."id", "llm_embeddings"."source_type", "llm_embeddings"."source_id", "llm_embeddings"."created_at", "llm_embeddings"."updated_at", "llm_embeddings"."created_by", "llm_embeddings"."updated_by", "llm_embeddings"."llm_model_id",
  "llm_embeddings"."embedding" <-> '[-0.0017242150271110192,-0.029317252896789353<.....>,0.024415132566991064]' AS neighbor_distance
FROM "llm_embeddings"
WHERE "llm_embeddings"."source_type" = 'LlmSource' AND "llm_embeddings"."embedding" IS NOT NULL
ORDER BY neighbor_distance
LIMIT 5;

When the vector list is many hundreds or thousands of vectors long, this can really help clean up log files

ankane commented 5 months ago

Hi @moracca, thanks for the PR. However, this will cause issues with methods that change the SELECT clause afterwards, like reselect and pluck (see the failing test case).