ankane / neighbor

Nearest neighbor search for Rails
MIT License
590 stars 14 forks source link
nearest-neighbor-search

Neighbor

Nearest neighbor search for Rails

Supports:

Build Status

Installation

Add this line to your application’s Gemfile:

gem "neighbor"

For Postgres

Neighbor supports two extensions: cube and pgvector. cube ships with Postgres, while pgvector supports more dimensions and approximate nearest neighbor search.

For cube, run:

rails generate neighbor:cube
rails db:migrate

For pgvector, install the extension and run:

rails generate neighbor:vector
rails db:migrate

For SQLite

Add this line to your application’s Gemfile:

gem "sqlite-vec"

And run:

rails generate neighbor:sqlite

Getting Started

Create a migration

class AddEmbeddingToItems < ActiveRecord::Migration[8.0]
  def change
    # cube
    add_column :items, :embedding, :cube

    # pgvector and MySQL
    add_column :items, :embedding, :vector, limit: 3 # dimensions

    # sqlite-vec and MariaDB
    add_column :items, :embedding, :binary
  end
end

Add to your model

class Item < ApplicationRecord
  has_neighbors :embedding
end

Update the vectors

item.update(embedding: [1.0, 1.2, 0.5])

Get the nearest neighbors to a record

item.nearest_neighbors(:embedding, distance: "euclidean").first(5)

Get the nearest neighbors to a vector

Item.nearest_neighbors(:embedding, [0.9, 1.3, 1.1], distance: "euclidean").first(5)

Records returned from nearest_neighbors will have a neighbor_distance attribute

nearest_item = item.nearest_neighbors(:embedding, distance: "euclidean").first
nearest_item.neighbor_distance

See the additional docs for:

Or check out some examples

cube

Distance

Supported values are:

For cosine distance with cube, vectors must be normalized before being stored.

class Item < ApplicationRecord
  has_neighbors :embedding, normalize: true
end

For inner product with cube, see this example.

Dimensions

The cube type can have up to 100 dimensions by default. See the Postgres docs for how to increase this.

For cube, it’s a good idea to specify the number of dimensions to ensure all records have the same number.

class Item < ApplicationRecord
  has_neighbors :embedding, dimensions: 3
end

pgvector

Distance

Supported values are:

Dimensions

The vector type can have up to 16,000 dimensions, and vectors with up to 2,000 dimensions can be indexed.

The halfvec type can have up to 16,000 dimensions, and half vectors with up to 4,000 dimensions can be indexed.

The bit type can have up to 83 million dimensions, and bit vectors with up to 64,000 dimensions can be indexed.

The sparsevec type can have up to 16,000 non-zero elements, and sparse vectors with up to 1,000 non-zero elements can be indexed.

Indexing

Add an approximate index to speed up queries. Create a migration with:

class AddIndexToItemsEmbedding < ActiveRecord::Migration[8.0]
  def change
    add_index :items, :embedding, using: :hnsw, opclass: :vector_l2_ops
    # or
    add_index :items, :embedding, using: :ivfflat, opclass: :vector_l2_ops
  end
end

Use :vector_cosine_ops for cosine distance and :vector_ip_ops for inner product.

Set the size of the dynamic candidate list with HNSW

Item.connection.execute("SET hnsw.ef_search = 100")

Or the number of probes with IVFFlat

Item.connection.execute("SET ivfflat.probes = 3")

Half-Precision Vectors

Use the halfvec type to store half-precision vectors

class AddEmbeddingToItems < ActiveRecord::Migration[8.0]
  def change
    add_column :items, :embedding, :halfvec, limit: 3 # dimensions
  end
end

Half-Precision Indexing

Index vectors at half precision for smaller indexes

class AddIndexToItemsEmbedding < ActiveRecord::Migration[8.0]
  def change
    add_index :items, "(embedding::halfvec(3)) vector_l2_ops", using: :hnsw
  end
end

Get the nearest neighbors

Item.nearest_neighbors(:embedding, [0.9, 1.3, 1.1], distance: "euclidean", precision: "half").first(5)

Binary Vectors

Use the bit type to store binary vectors

class AddEmbeddingToItems < ActiveRecord::Migration[8.0]
  def change
    add_column :items, :embedding, :bit, limit: 3 # dimensions
  end
end

Get the nearest neighbors by Hamming distance

Item.nearest_neighbors(:embedding, "101", distance: "hamming").first(5)

Binary Quantization

Use expression indexing for binary quantization

class AddIndexToItemsEmbedding < ActiveRecord::Migration[8.0]
  def change
    add_index :items, "(binary_quantize(embedding)::bit(3)) bit_hamming_ops", using: :hnsw
  end
end

Sparse Vectors

Use the sparsevec type to store sparse vectors

class AddEmbeddingToItems < ActiveRecord::Migration[8.0]
  def change
    add_column :items, :embedding, :sparsevec, limit: 3 # dimensions
  end
end

Get the nearest neighbors

embedding = Neighbor::SparseVector.new({0 => 0.9, 1 => 1.3, 2 => 1.1}, 3)
Item.nearest_neighbors(:embedding, embedding, distance: "euclidean").first(5)

sqlite-vec

Distance

Supported values are:

Dimensions

For sqlite-vec, it’s a good idea to specify the number of dimensions to ensure all records have the same number.

class Item < ApplicationRecord
  has_neighbors :embedding, dimensions: 3
end

Virtual Tables

You can also use virtual tables

class AddEmbeddingToItems < ActiveRecord::Migration[8.0]
  def change
    # Rails 8+
    create_virtual_table :items, :vec0, [
      "embedding float[3] distance_metric=L2"
    ]

    # Rails < 8
    execute <<~SQL
      CREATE VIRTUAL TABLE items USING vec0(
        embedding float[3] distance_metric=L2
      )
    SQL
  end
end

Use distance_metric=cosine for cosine distance

You can optionally ignore any shadow tables that are created

ActiveRecord::SchemaDumper.ignore_tables += [
  "items_chunks", "items_rowids", "items_vector_chunks00"
]

Create a model with rowid as the primary key

class Item < ApplicationRecord
  self.primary_key = "rowid"

  has_neighbors :embedding, dimensions: 3
end

Get the k nearest neighbors

Item.where("embedding MATCH ?", [1, 2, 3].to_s).where(k: 5).order(:distance)

Filter by primary key

Item.where(rowid: [2, 3]).where("embedding MATCH ?", [1, 2, 3].to_s).where(k: 5).order(:distance)

Int8 Vectors

Use the type option for int8 vectors

class Item < ApplicationRecord
  has_neighbors :embedding, dimensions: 3, type: :int8
end

Binary Vectors

Use the type option for binary vectors

class Item < ApplicationRecord
  has_neighbors :embedding, dimensions: 8, type: :bit
end

Get the nearest neighbors by Hamming distance

Item.nearest_neighbors(:embedding, "\x05", distance: "hamming").first(5)

MariaDB

Distance

Supported values are:

For cosine distance with MariaDB, vectors must be normalized before being stored.

class Item < ApplicationRecord
  has_neighbors :embedding, normalize: true
end

Indexing

Vector columns must use null: false to add a vector index

class CreateItems < ActiveRecord::Migration[8.0]
  def change
    create_table :items do |t|
      t.binary :embedding, null: false
      t.index :embedding, type: :vector
    end
  end
end

Binary Vectors

Use the bigint type to store binary vectors

class AddEmbeddingToItems < ActiveRecord::Migration[8.0]
  def change
    add_column :items, :embedding, :bigint
  end
end

Note: Binary vectors can have up to 64 dimensions

Get the nearest neighbors by Hamming distance

Item.nearest_neighbors(:embedding, 5, distance: "hamming").first(5)

MySQL

Distance

Supported values are:

Note: The DISTANCE() function is only available on HeatWave

Binary Vectors

Use the binary type to store binary vectors

class AddEmbeddingToItems < ActiveRecord::Migration[8.0]
  def change
    add_column :items, :embedding, :binary
  end
end

Get the nearest neighbors by Hamming distance

Item.nearest_neighbors(:embedding, "\x05", distance: "hamming").first(5)

Examples

OpenAI Embeddings

Generate a model

rails generate model Document content:text embedding:vector{1536}
rails db:migrate

And add has_neighbors

class Document < ApplicationRecord
  has_neighbors :embedding
end

Create a method to call the embeddings API

def fetch_embeddings(input)
  url = "https://api.openai.com/v1/embeddings"
  headers = {
    "Authorization" => "Bearer #{ENV.fetch("OPENAI_API_KEY")}",
    "Content-Type" => "application/json"
  }
  data = {
    input: input,
    model: "text-embedding-3-small"
  }

  response = Net::HTTP.post(URI(url), data.to_json, headers).tap(&:value)
  JSON.parse(response.body)["data"].map { |v| v["embedding"] }
end

Pass your input

input = [
  "The dog is barking",
  "The cat is purring",
  "The bear is growling"
]
embeddings = fetch_embeddings(input)

Store the embeddings

documents = []
input.zip(embeddings) do |content, embedding|
  documents << {content: content, embedding: embedding}
end
Document.insert_all!(documents)

And get similar documents

document = Document.first
document.nearest_neighbors(:embedding, distance: "cosine").first(5).map(&:content)

See the complete code

Cohere Embeddings

Generate a model

rails generate model Document content:text embedding:bit{1024}
rails db:migrate

And add has_neighbors

class Document < ApplicationRecord
  has_neighbors :embedding
end

Create a method to call the embed API

def fetch_embeddings(input, input_type)
  url = "https://api.cohere.com/v1/embed"
  headers = {
    "Authorization" => "Bearer #{ENV.fetch("CO_API_KEY")}",
    "Content-Type" => "application/json"
  }
  data = {
    texts: input,
    model: "embed-english-v3.0",
    input_type: input_type,
    embedding_types: ["ubinary"]
  }

  response = Net::HTTP.post(URI(url), data.to_json, headers).tap(&:value)
  JSON.parse(response.body)["embeddings"]["ubinary"].map { |e| e.map { |v| v.chr.unpack1("B*") }.join }
end

Pass your input

input = [
  "The dog is barking",
  "The cat is purring",
  "The bear is growling"
]
embeddings = fetch_embeddings(input, "search_document")

Store the embeddings

documents = []
input.zip(embeddings) do |content, embedding|
  documents << {content: content, embedding: embedding}
end
Document.insert_all!(documents)

Embed the search query

query = "forest"
query_embedding = fetch_embeddings([query], "search_query")[0]

And search the documents

Document.nearest_neighbors(:embedding, query_embedding, distance: "hamming").first(5).map(&:content)

See the complete code

Sentence Embeddings

You can generate embeddings locally with Informers.

Generate a model

rails generate model Document content:text embedding:vector{384}
rails db:migrate

And add has_neighbors

class Document < ApplicationRecord
  has_neighbors :embedding
end

Load a model

model = Informers.pipeline("embedding", "sentence-transformers/all-MiniLM-L6-v2")

Pass your input

input = [
  "The dog is barking",
  "The cat is purring",
  "The bear is growling"
]
embeddings = model.(input)

Store the embeddings

documents = []
input.zip(embeddings) do |content, embedding|
  documents << {content: content, embedding: embedding}
end
Document.insert_all!(documents)

And get similar documents

document = Document.first
document.nearest_neighbors(:embedding, distance: "cosine").first(5).map(&:content)

See the complete code

Hybrid Search

You can use Neighbor for hybrid search with Informers.

Generate a model

rails generate model Document content:text embedding:vector{768}
rails db:migrate

And add has_neighbors and a scope for keyword search

class Document < ApplicationRecord
  has_neighbors :embedding

  scope :search, ->(query) {
    where("to_tsvector(content) @@ plainto_tsquery(?)", query)
      .order(Arel.sql("ts_rank_cd(to_tsvector(content), plainto_tsquery(?)) DESC", query))
  }
end

Create some documents

Document.create!(content: "The dog is barking")
Document.create!(content: "The cat is purring")
Document.create!(content: "The bear is growling")

Generate an embedding for each document

embed = Informers.pipeline("embedding", "Snowflake/snowflake-arctic-embed-m-v1.5")
embed_options = {model_output: "sentence_embedding", pooling: "none"} # specific to embedding model

Document.find_each do |document|
  embedding = embed.(document.content, **embed_options)
  document.update!(embedding: embedding)
end

Perform keyword search

query = "growling bear"
keyword_results = Document.search(query).limit(20).load_async

And semantic search in parallel (the query prefix is specific to the embedding model)

query_prefix = "Represent this sentence for searching relevant passages: "
query_embedding = embed.(query_prefix + query, **embed_options)
semantic_results =
  Document.nearest_neighbors(:embedding, query_embedding, distance: "cosine").limit(20).load_async

To combine the results, use Reciprocal Rank Fusion (RRF)

Neighbor::Reranking.rrf(keyword_results, semantic_results).first(5)

Or a reranking model

rerank = Informers.pipeline("reranking", "mixedbread-ai/mxbai-rerank-xsmall-v1")
results = (keyword_results + semantic_results).uniq
rerank.(query, results.map(&:content)).first(5).map { |v| results[v[:doc_id]] }

See the complete code

Sparse Search

You can generate sparse embeddings locally with Transformers.rb.

Generate a model

rails generate model Document content:text embedding:sparsevec{30522}
rails db:migrate

And add has_neighbors

class Document < ApplicationRecord
  has_neighbors :embedding
end

Load a model to generate embeddings

class EmbeddingModel
  def initialize(model_id)
    @model = Transformers::AutoModelForMaskedLM.from_pretrained(model_id)
    @tokenizer = Transformers::AutoTokenizer.from_pretrained(model_id)
    @special_token_ids = @tokenizer.special_tokens_map.map { |_, token| @tokenizer.vocab[token] }
  end

  def embed(input)
    feature = @tokenizer.(input, padding: true, truncation: true, return_tensors: "pt", return_token_type_ids: false)
    output = @model.(**feature)[0]
    values = Torch.max(output * feature[:attention_mask].unsqueeze(-1), dim: 1)[0]
    values = Torch.log(1 + Torch.relu(values))
    values[0.., @special_token_ids] = 0
    values.to_a
  end
end

model = EmbeddingModel.new("opensearch-project/opensearch-neural-sparse-encoding-v1")

Pass your input

input = [
  "The dog is barking",
  "The cat is purring",
  "The bear is growling"
]
embeddings = model.embed(input)

Store the embeddings

documents = []
input.zip(embeddings) do |content, embedding|
  documents << {content: content, embedding: Neighbor::SparseVector.new(embedding)}
end
Document.insert_all!(documents)

Embed the search query

query = "forest"
query_embedding = model.embed([query])[0]

And search the documents

Document.nearest_neighbors(:embedding, Neighbor::SparseVector.new(query_embedding), distance: "inner_product").first(5).map(&:content)

See the complete code

Disco Recommendations

You can use Neighbor for online item-based recommendations with Disco. We’ll use MovieLens data for this example.

Generate a model

rails generate model Movie name:string factors:cube
rails db:migrate

And add has_neighbors

class Movie < ApplicationRecord
  has_neighbors :factors, dimensions: 20, normalize: true
end

Fit the recommender

data = Disco.load_movielens
recommender = Disco::Recommender.new(factors: 20)
recommender.fit(data)

Store the item factors

movies = []
recommender.item_ids.each do |item_id|
  movies << {name: item_id, factors: recommender.item_factors(item_id)}
end
Movie.create!(movies)

And get similar movies

movie = Movie.find_by(name: "Star Wars (1977)")
movie.nearest_neighbors(:factors, distance: "cosine").first(5).map(&:name)

See the complete code for cube and pgvector

History

View the changelog

Contributing

Everyone is encouraged to help improve this project. Here are a few ways you can help:

To get started with development:

git clone https://github.com/ankane/neighbor.git
cd neighbor
bundle install

# Postgres
createdb neighbor_test
bundle exec rake test:postgresql

# SQLite
bundle exec rake test:sqlite

# MariaDB
docker run -e MARIADB_ALLOW_EMPTY_ROOT_PASSWORD=1 -e MARIADB_DATABASE=neighbor_test -p 3307:3306 quay.io/mariadb-foundation/mariadb-devel:11.6-vector-preview
bundle exec rake test:mariadb

# MySQL
docker run -e MYSQL_ALLOW_EMPTY_PASSWORD=1 -e MYSQL_DATABASE=neighbor_test -p 3306:3306 mysql:9
bundle exec rake test:mysql