Semafind / semadb

No fuss multi-index hybrid vector database / search engine
https://semadb.com/
Apache License 2.0
9 stars 2 forks source link
search-engine vector-database

SemaDB

No fuss multi-index hybrid vector database / search engine

Build Docker Build Test Build Docs Go Report Card GitHub Issues or Pull Requests GitHub License

SemaDB is a multi-index, multi-vector, document-based vector database / search engine. It is designed to offer a clear and easy-to-use JSON RESTful API. The original components of SemaDB were built for a knowledge-management project at Semafind before it was developed into a standalone project. The goal is to provide a simple, modern, and efficient search engine that can be used in a variety of applications.

Looking for a hosted solution? SemaDB Cloud Beta is available on RapidAPI.

Features ⚡

Getting Started

To get started from source, please follow the instructions to install Go. That is the only dependency required to run SemaDB. We try to keep SemaDB as self-contained as possible and up-to-date with the latest Go releases.

SemaDB reads all the configuration from a yaml file, there are some examples contained in the config folder. You can run a single server using:

SEMADB_CONFIG=./config/singleServer.yaml go run ./

If you are using VS Code as your editor, then there are already pre-made tasks that do the same thing but also launch a cluster locally too in debug mode.

After you have a server running, you can use the samples file to see some example requests that can be made to the server. To make the most of it, install the REST Client extension which will allow you to make requests directly in the editor and show the results.

Docker & Podman

You can run the latest version of SemaDB using the following repository container image:

docker run -it --rm -v ./config:/config -e SEMADB_CONFIG=/config/singleServer.yaml -v ./data:/data -p 8081:8081 ghcr.io/semafind/semadb:main
# If using podman
podman run -it --rm -v ./config:/config:Z -e SEMADB_CONFIG=/config/singleServer.yaml -v ./data:/data:Z -p 8081:8081 ghcr.io/semafind/semadb:main

which will run the main branch. There are also tagged versions for specific releases. See the container registry of the repository stable and production ready versions.

You can locally build and run the container image using:

docker build -t semadb ./
docker run -it --rm -v ./config:/config -e SEMADB_CONFIG=/config/singleServer.yaml -v ./data:/data -p 8081:8081 semadb
# If using podman
podman build -t semadb ./
# The :Z argument relabels to access: see https://github.com/containers/podman/issues/3683
podman run -it --rm -v ./config:/config:Z -e SEMADB_CONFIG=/config/singleServer.yaml -v ./data:/data:Z -p 8081:8081 semadb

Data Persistence: SemaDB stores data in a directory on disk which is specified in the configuration file as rootDir. By default, the data directory is ./data and the semadb executable is located at / giving /data as the mount point in the container.

Please note that when using docker, the hostname and whitelisting of IPs may need to be adjusted depending on the network configuration of docker. Leaving hostname as a blank string and setting whitelisting to '*' opens up SemaDB to every connection as done in the singleServer.yaml configuration.

Contributing

Contributions are welcome! Please read the contributing guide file for more information. The contributing guide also contains information about the architecture of SemaDB and how to get started with development.

Search Algorithm 🔍

SemaDB's core vector search algorithm is based on the following excellent research papers:

Other indices such as string, or text follows an inverted index approach. The inverted index is a data structure that stores a mapping from content, such as words or numbers, to its locations in a database file, or in a document or a set of documents. The purpose of an inverted index is to allow fast full-text searches, string prefix lookups, integer range search etc.

Performance

SemaDB with default configuration values on a Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz commodity workstation with 16GB RAM achieves good recall across standard benchmarks, similar to the reported results:

v1 v2 v2-PQ v2-BQ
Dataset Recall QPS Recall QPS Recall QPS Recall QPS
glove-100-angular 0.924 973.6 0.853 773.9 0.526 628.6
dbpedia-openai-100k-angular 0.990 519.9 0.920 240.8 0.766 978.6
glove-25-angular 0.999 1130.3 0.992 914.4 0.989 805.8
mnist-784-euclidean 0.999 1898.6 0.999 1267.4 0.928 571.6 0.667 2369.7
nytimes-256-angular 0.903 1020.6 0.891 786.7 0.438 983.6
sift-128-euclidean 0.999 1537.7 0.991 1272.9 0.696 967.4

The results are obtained using ann-benchmarks. The queries per second (QPS) uses full in memory cache with a single thread similar to other methods but is not a good indication of overall performance. The full pipeline would be slower because of the end-to-end journey of a request has the overhead of HTTP handling, encoding, decoding of query, parsing, validation, cluster routing, remote procedure calls, loading data from disk etc. This further depends on the hardware, especially SSD vs hard disk. However, the raw performance of the search algorithm within a single Shard would, in theory, be similar to that reported in the research papers.

Version 1 (v1) is the original pure vector search implementation of SemaDB. Version 2 (v2) is the multi-index, hybrid, keyword search etc implementation which has a much higher overhead of decoding, dispatching data into indices and using quantizers. Version 2 with Product Quantization (v2-PQ) and Binary Quantization (v2-BQ) use the respective quantization methods to reduce memory usage. We expect the recall to be lower because the quantization methods are lossy and the search is approximate.

Limitations 🪧

Cold disk starts can be really slow. At the bottom of the chain sits the disk where all the data is stored. There are two caches in play: the in-memory cache and the operating system file cache. The OS cache is not in our control and gets populated as files are read or written to. When a request is made, the index graph gets traversed and points are loaded from the disk to operating system cache and decoded into in-memory set of points. The search operation often performs random reads from disk as it traverses the similarity graph; hence, during a cold start it can take a long time (1 second, 10 seconds or more) depending on the hardware. Solid-state disks (SSDs) are strongly recommended for this reason as they serve random reads better. For single application deployments, this is not a major concern because we expect a portion of the data / index to be cached either in-memory or by the operating system during operation. An alternative is to use a custom graph oriented storage layout on disk so blocks / pages are better aligned with neighbours of nodes in the similarity graph.

Automatic horizontal scaling: the number of servers in SemaDB can be adjusted but it only syncs on startup. The rendezvous hashing used will move 1/n amount of data to the new server or move the removed server data back to the remaining ones. Since this only happens on startup, it is more geared towards scaling up or down deployments in advance rather than under live load. Live automatic scaling is tricky to perform safely while the database is operating due to race conditions across servers. Some pitfalls are: a server lagging behind in configuration sending data to old servers, while data transfer is happening user requests must be handled, any mis-routed data must eventually arrive at the correct server, the system must recover from a split-brain scenario if the network is partitioned. Many distributed databases incorporate additional machinery that adds significant complexity to handles these such as versioned keys, vector clocks etc. At the moment, you can adjust the servers and restart the cluster to redistribute the data.

No write high availability: SemaDB is optimised for search heavy workloads. Collection and point write operations require all involved (servers that have been distributed data) to participate. In the search path, failures can be tolerated because it is a Stochastic search and occasional drops in performance due to unavailable shards can be acceptable. We offload maintaining a healthy system across physical server failures to a container orchestration tool such as Kubernetes. We assume that the configured state of SemaDB will be actively maintained and as a result do not contain any peer discovery or consensus algorithms in the design. This design choice again simplifies the architecture of SemaDB and aids with rapid development. Original designs included consensus mechanisms such as Raft and a fully self-contained distributed system with peer discovery, but this was deemed overkill.

Related Projects

There are many open-source vector search and search engine projects out there. It may be helpful to compare SemaDB with some of them to see if one fits your use case better: