Verifiable vector similarity queries over a committed vector database.
This projects aims to obtain a proof-of-concept for a verifiable vector database using zero-knowledge proofs. We make heavy use of the awesome ZKFixedPointChip which enables fixed-point arithmetic with halo2-lib.
You need Rust installed:
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
Then, you can clone this repository and use the chips inside it:
git clone https://github.com/erhant/halo2-vectordb.git
cd halo2-vectordb
We implement two chips, one for distance metrics in halo2, and the other for basic vector database operations.
DistanceChip
DistanceChip
provides distance metrics that operate on two vectors of equal length. The vector elements are expected to be quantized with the FixedPointChip
. The following distance metrics are implemented:
euclidean_distance
computes the Euclidean distance between two vectors.manhattan_distance
computes the Manhattan distance between two vectors.hamming_distance
computes one minus Hamming similarity between two vectors.cosine_distance
computes one minus Cosine similarity between two vectors.VectorDBChip
VectorDBChip
implements basic vector database functionality over a set of vectors. Similar to DistanceChip
, it requires a FixedPointChip
to operate over quantized values. It exposes the following functions:
nearest_vector
takes a set of vectors and a query vector, and finds the vector that is most similar to the query w.r.t. a given distance function. It also returns an indicator (i.e. one-hot encoded vector that indicates the index of the result vector) which may be used at later steps.merkle_commitment
takes a set of vectors, and commits to them using a Merkle tree with Poseidon hashes. If the given set does not include power-of-two many elements, it will pad zeros to the remaining leaves. In our scenario, we only need the entire vector or none at all, and for that reason we do not care about committing to elements within the vector. As such, we first hash the entire vector, and then treat that hash as the leaf node.kmeans
takes a set of vectors, a K
constant to determine the number of centroids and an I
constant to determine the number of iterations. K-means usually is an iterative algorithm that terminates when the centroids are no more updated; however, such a control-flow is not possible in a zk-circuit. Therefore, the I
parameter determines a fixed number of iterations.We also have a trait FixedPointVectorInstructions
and its implementation for the FixedPointChip
, which are simple utility functions to quantize and dequantize vectors.
A demonstrative test suite can be found at demo_test
:
K
centroids & clusters.K
centroids & clusters along with K+2
merkle roots.f64
Rust implementation.Run the examples via one of the following:
# demonstrate distance computations
LOOKUP_BITS=12 cargo run --example distances -- \
--name distances -k 13 mock
# example merkle commitment to vectors
LOOKUP_BITS=12 cargo run --example merkle -- \
--name merkle -k 13 mock
# exhaustively find the similar vector & commit to the database
LOOKUP_BITS=12 cargo run --example query -- \
--name query -k 13 mock
# compute centroids
LOOKUP_BITS=15 cargo run --example kmeans -- \
--name kmeans -k 16 mock
You can provide a specific input via the --input <input-name>
option.
To run tests:
cargo test
Some of the tests make use of the ANN_SIFT_10K
dataset by Jégou et al. which can be downloaded at Corpus-Texmex. This dataset 128-dimensional vectors. Within our tests folder, the common
module exposes two functions to read these vectors:
read_vectors_from_disk
takes a path and dimension (128 in this case) and reads the .fvec
files from the dataset, returning a single vector of f32
values. This single vector is composed of all vector elements, composed together.select_from_vectors
takes that single vector and a list of indices, and returns the selected vectors only.