feat: Embedding Model for Databend

BohuTANG commented 1 year ago

Summary

Tasks

[x] https://github.com/datafuselabs/databend/pull/10737
[ ] #10769
[x] https://github.com/datafuselabs/databend/issues/10775
[x] Implement ai_embedding_vector(<string>) to get data vectors from openai api
[x] https://github.com/datafuselabs/databend/pull/12382
[ ] Implement IVF PQ index for VECTOR data type

Introduction

An embedding model is designed to map high-dimensional data into a lower-dimensional vector space, which facilitates various applications such as NLP, recommendation systems, and anomaly detection.

Obtaining Embedding Vectors with OpenAI API

To extract embedding vectors using the OpenAI API, utilize OpenAI's pre-trained language models. Below is a Python example:

import openai

openai.api_key = "your_openai_api_key"

def get_embedding(text):
    response = openai.Completion.create(
        engine="davinci-codex",
        prompt=f"Embed the following text: {text}",
        max_tokens=16,
        n=1,
        stop=None,
        temperature=0.5,
    )
    embedding = response.choices[0].text.strip()
    return embedding

text = "Databend warehouse"
embedding = get_embedding(text)
print(embedding)

Storing Embedding Vectors in Databend

To store the embedding vectors returned by the OpenAI API in Databend, create a table with a column of Vector(Alias Array(Float32) can be with IVF PQ index) type for holding the vectors. Assuming you have connected to a Databend instance:

CREATE TABLE embeddings (
    id INT,
    text VARCHAR NOT NULL,
    vector VECTOR NOT NULL
);

Computing the Distance Between Vectors in Databend

Databend can compute the distance between a query vector and stored vectors using a built-in function called cosine_distance. This function calculates the distance between two ARRAY(FLOAT32) inputs and can be used directly in SQL queries.

However, calculating vector distance for every pair of vectors becomes computationally expensive and slow with large-scale datasets and high-dimensional vectors. To tackle this issue, we propose the following techniques:

Inverted File (IVF) Index: An inverted file is an index data structure that maps words or terms to their locations in a set of documents. Within a vector database, it stores a mapping from a set of quantized vectors to their locations. An inverted file enables fast and memory-efficient search for approximate nearest neighbors.
Product Quantization (PQ) Index: Product Quantization is a vector compression technique that reduces memory footprint and computational cost while searching for nearest neighbors in high-dimensional spaces. PQ quantizes the original vector space into a Cartesian product of multiple lower-dimensional subspaces, compressing each high-dimensional vector into a compact code by quantizing its sub-vectors and concatenating the quantization indices. This enables efficient and approximate distance computation between compressed vectors.

The IVF PQ index is a combination of these techniques, where the database vectors are first quantized using product quantization, followed by the creation of an inverted file to index the quantized vectors. This approach allows for a fast and memory-efficient search of approximate nearest neighbors in high-dimensional vector spaces, particularly beneficial in large-scale multimedia retrieval systems.

Example SQL Queries

CREATE TABLE embeddings (
    id INT,
    text VARCHAR NOT NULL,
    vector VECTOR NOT NULL
);

Insert sample data

INSERT INTO embeddings (text, vector) VALUES
(1, 'Databend warehouse', ARRAY[0.12, 0.34, -0.56, 0.78]),
(2, 'Data warehouse', ARRAY[-0.15, 0.37, 0.29, -0.22]);

Query

WITH query_vector AS (
    SELECT ARRAY[0.11, 0.33, -0.55, 0.77] AS vector
)
SELECT id, text, cosine_distance(vector, query_vector.vector) AS distance
FROM embeddings, query_vector
ORDER BY distance ASC
LIMIT 1;

mokeyish commented 1 year ago

vector_distance: Similarity Metrics

BohuTANG commented 1 year ago

From openai doc: https://platform.openai.com/docs/guides/embeddings/which-distance-function-should-i-use

We recommend [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity). The choice of distance function typically doesn’t matter much.

OpenAI embeddings are normalized to length 1, which means that:

Cosine similarity can be computed slightly faster using just a dot product
Cosine similarity and Euclidean distance will result in the identical rankings

thatcort commented 1 year ago

When is the vector index feature expected to be complete?

BohuTANG commented 1 year ago

When is the vector index feature expected to be complete?

Indeed, there is a PR already https://github.com/datafuselabs/databend/pull/11318

But still a lot of work needs to do. From Databend users case, their data is not large, so we make this ticket to low priority.

thatcort commented 1 year ago

I'm considering Databend for querying over large data sets of text and vectors. Vector indexing would allow replacing the current vector DB and save a lot of money by using object storage. Would be great if you raised the priority of that feature!

BohuTANG commented 12 months ago

Thank you for your explanation. We will raise the priority of this feature, but there is still no definite expected time, as there are many higher-priority tasks that need to be completed.

thatcort commented 8 months ago

Another library worth looking at for vector ann support is USearch: https://unum-cloud.github.io/usearch/

datafuselabs / databend