Open BohuTANG opened 1 year ago
From openai doc: https://platform.openai.com/docs/guides/embeddings/which-distance-function-should-i-use
We recommend [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity). The choice of distance function typically doesnโt matter much.
OpenAI embeddings are normalized to length 1, which means that:
Cosine similarity can be computed slightly faster using just a dot product
Cosine similarity and Euclidean distance will result in the identical rankings
When is the vector index feature expected to be complete?
When is the vector index feature expected to be complete?
Indeed, there is a PR already https://github.com/datafuselabs/databend/pull/11318
But still a lot of work needs to do. From Databend users case, their data is not large, so we make this ticket to low priority.
I'm considering Databend for querying over large data sets of text and vectors. Vector indexing would allow replacing the current vector DB and save a lot of money by using object storage. Would be great if you raised the priority of that feature!
Thank you for your explanation. We will raise the priority of this feature, but there is still no definite expected time, as there are many higher-priority tasks that need to be completed.
Another library worth looking at for vector ann support is USearch: https://unum-cloud.github.io/usearch/
Summary
Tasks
ai_embedding_vector(<string>)
to get data vectors from openai apiVECTOR
data typeIntroduction
An embedding model is designed to map high-dimensional data into a lower-dimensional vector space, which facilitates various applications such as NLP, recommendation systems, and anomaly detection.
Obtaining Embedding Vectors with OpenAI API
To extract embedding vectors using the OpenAI API, utilize OpenAI's pre-trained language models. Below is a Python example:
Storing Embedding Vectors in Databend
To store the embedding vectors returned by the OpenAI API in Databend, create a table with a column of
Vector
(AliasArray(Float32)
can be with IVF PQ index) type for holding the vectors. Assuming you have connected to a Databend instance:Computing the Distance Between Vectors in Databend
Databend can compute the distance between a query vector and stored vectors using a built-in function called cosine_distance. This function calculates the distance between two ARRAY(FLOAT32) inputs and can be used directly in SQL queries.
However, calculating vector distance for every pair of vectors becomes computationally expensive and slow with large-scale datasets and high-dimensional vectors. To tackle this issue, we propose the following techniques:
The IVF PQ index is a combination of these techniques, where the database vectors are first quantized using product quantization, followed by the creation of an inverted file to index the quantized vectors. This approach allows for a fast and memory-efficient search of approximate nearest neighbors in high-dimensional vector spaces, particularly beneficial in large-scale multimedia retrieval systems.
Example SQL Queries
Insert sample data
Query