Introduction

Vector databases have gained significant importance due to the rise of AI, machine learning, and deep learning applications. These databases store high-dimensional vectors representing data types such as images, text, and audio. One of the critical components in vector databases is the indexing technique, which allows for fast similarity searches by organizing vectors in ways that expedite querying. This survey provides an overview of the main indexing techniques, the vector data type, vector distance metrics, and query languages.

Vector Data Type

A vector in a vector database is an ordered set of scalar data types, often represented by floating-point numbers, with no additional internal structure. Most vector databases enforce limits on the number of elements in a vector, which can affect performance. For instance, ClickHouse and Zilliz use arrays to represent vectors.

The vector data type is primarily used in high-dimensional data representations and is queried using specific distance metrics. These metrics determine how similar two vectors are and form the foundation of retrieval processes in vector databases.

Indexing Techniques in Vector Databases

Approximate Nearest Neighbor (ANN) Search
Most vector database systems employ ANN search instead of exact search to significantly reduce the time complexity. This involves algorithms that focus on retrieving vectors that are approximately similar to the query vector based on a chosen similarity metric.
Index Types Different vector databases employ a variety of indexing methods. Some of the key methods include:
- Locality-Sensitive Hashing (LSH): LSH maps vectors into smaller buckets using hashing techniques, allowing for faster lookups.
- Product Quantization (PQ): PQ divides vectors into sub-vectors and compresses them for efficient search and storage.
- Hierarchical Navigable Small World (HNSW): HNSW uses graph-based data structures to quickly navigate between vectors in multi-layered structures.
- Inverted File Index (IVF): IVF clusters vectors into centroids and assigns each centroid an inverted list for more efficient querying.
- Flat Indexing: This method, though simple, has no optimization for search speed and examines all vectors in the dataset.
- Annoy (Approximate Nearest Neighbors Oh Yeah): A tree-based approach that speeds up nearest-neighbor search by partitioning the vector space.
Quantization Techniques
Quantization techniques like Product Quantization (PQ) and Scalar Quantization (SQ) transform vectors into smaller, quantized parts, allowing for more compact storage and faster retrieval.
Other Indexes
Newer methods like Vamana/DiskANN and libraries such as FAISS (used in Milvus and Pinecone) and ScaNN continue to emerge, providing custom similarity metrics and optimized search times.

Distance Metrics in Vector Databases

Different distance metrics define how vector similarity is computed. Not all vector databases support all distance metrics, but popular options include:

Euclidean Distance: Measures the straight-line distance between two points in space.
Cosine Similarity: Determines similarity based on the angle between two vectors.
Angular Distance: Similar to cosine similarity but focuses on the angular relationship between vectors.
Manhattan Distance: Sums the absolute differences between vector coordinates.
Dot Product/Inner Product: Calculates the product of vector components, often used in machine learning.
Minkowski Distance: A generalized distance metric that includes both Euclidean and Manhattan distances.

The following are some of the prominent indexes used in vector databases, each offering different functional and non-functional properties. Many vector database systems provide explanations for the indexes they implement. The list below is not exhaustive.

Libraries for Vector Indexing

In many databases, users can select a specific metric during query execution, which can impact the retrieval results based on the nature of the dataset.

Query Languages for Vector Databases

Most vector databases support query languages that allow users to search for similar vectors:

SQL: Systems like ClickHouse use SQL to execute vector similarity searches, making integration with traditional relational databases more straightforward.
API-based: Some databases, like Zilliz and Pinecone, expose APIs for vector queries, which might integrate more easily with modern applications and data pipelines.

Implementation Approaches

Vector databases are typically implemented in two ways:

Extending Existing Database Systems: Traditional databases like PostgreSQL are extended to support vector operations. This allows the system to leverage existing infrastructure for transactions, backups, and query languages.
Building Specialized Systems: Systems like Milvus, Pinecone, and Marqo are built from the ground up to optimize for vector data, ensuring fast and scalable retrieval of high-dimensional data.

Benchmarks and Comparisons

Several benchmarks and tools are emerging to compare the performance and capabilities of vector databases. These include ANN Benchmarks, VectorDBBench, and MTEB (Massive Text Embedding Benchmark). Comparisons between vector databases focus on factors like query latency, scalability, support for different distance metrics, and various available indexes.

The development of vector databases is ongoing, and various systems continue to explore ways to optimize search, indexing, and distance metrics. Although specialized vector databases offer more optimized performance for vector-specific tasks, traditional databases are rapidly extending their capabilities to include vector search functionalities, offering robust solutions for a wider range of applications. This survey provides a foundation for understanding vector database indexing and retrieval mechanisms and highlights the importance of selecting appropriate indexing techniques and distance metrics for specific use cases.

Embeddings Overview

Embeddings play a crucial role in many use cases, as they are directly derived from models. Any changes in models can affect embeddings, with potential consequences for cost and performance. For instance, smaller embeddings (1536 dimensions) generated by newer models can be more cost-effective, but not always superior in all tasks.

Model Changes and Embeddings

Model changes can affect embeddings, and newer models may not always outperform older ones in all benchmarks.
Different versions of the same model or entirely different models result in embeddings that cannot be mixed.
Managing embeddings across different models requires partitioning the vector database and maintaining metadata to track which embeddings belong to which model.

Key Considerations for Vector Database Systems

Using Different Models: Embeddings from different models cannot be directly compared, and combining results requires managing individual embeddings by model.
Handling Model Versions: When a model is updated, there are two approaches:
1. Recompute embeddings from the older version to the new version.
2. Maintain separate embeddings for each version, managing searches across different sets.

The choice between these methods involves managing large-scale data, especially as data volume increases.

Data Management Challenges

Storing the original data is essential for recomputing embeddings when models change.
This can become a significant challenge when managing large datasets, as recomputation or concurrent embedding generation may be necessary.

Multi-modal Models

Multi-modal models, such as Meta-Transformer, can process different input types within a single model, simplifying data management. The evolution of multimodal models is an ongoing trend.

Testing and Regression

Whenever a model or database system is updated, thorough testing is needed to ensure the similarity search quality remains intact. This testing process is essential to avoid regression and decide on appropriate actions if issues arise.

Tools for Generating Embeddings

Many APIs exist for generating embeddings from various data types.
Tools like towhee provide pre-implemented operators for many models in an ETL-style framework, simplifying large-scale embedding generation.
Airbyte also offers connectors for data migration, which can be integrated into workflows involving vector databases.

For more detailed examples, refer to this guide on using OpenAI, Pinecone, Airbyte, and Langchain.

Architectures Overview (RAG — Retrieval Augmented Generation)

Architectures using vector database systems as foundational elements are emerging. Examples include:

Emerging Architectures for LLM Applications
Microsoft, TikTok, and Generative AI with Memory
Understanding the Fundamental Limitations of Vector-Based Retrieval for LLM-Powered Chatbots

These architectures, including the RAG workflow, enable efficient information retrieval. Over time, more architectures with refinements (e.g., different model versions or data management improvements) will appear.

RAG Architectures

Fully implemented RAG (Retrieval Augmented Generation) architectures are now integrated with vector databases. Some are available as SaaS, such as:

Canopy (Pinecone): Canopy RAG Framework
Vectara: Vectara
Verba (Weaviate): Verba

Selecting a Vector Database System

Performance Considerations

For a successful implementation, performance metrics such as scale, throughput, and latency must be key factors in selecting a vector database system. These systems should offer:

Low latency to support interactive applications.
Scalability for growing datasets and increasing vectors.
High throughput for concurrent access.

Use benchmarks to evaluate the performance of different systems.

Functionality Considerations

It's important to have clear requirements for:

Distance metrics
Indexes
Query expressiveness
Model version management

Document these requirements for database system selection, ensuring they align with your use case. If no system meets all needs, be prepared to manage multiple database systems or contribute missing features to an existing one.

Performance vs. Functionality

Achieving the desired performance and functionality may require separate systems for fast retrieval and scalability, as no single system might support all requirements simultaneously.

Convenience vs. Long-Term Needs

Existing database systems may support vector data types but might not meet long-term performance or metric needs. Clearly define requirements and use benchmarks to avoid potential migrations later, which can add effort and complexity.

Conclusion

The vector database space is rapidly evolving and fascinating, especially for those with a background in database systems. This summary emphasizes the importance of performance, functionality, and future-proofing when selecting vector databases.

empirical-soft / empirical-lang

Survey on Vector Database #110