empirical-soft / empirical-lang

A language for time-series analysis
https://www.empirical-soft.com/
Other
168 stars 13 forks source link

Survey on Vector Database #110

Open isthatdebbiej opened 1 month ago

isthatdebbiej commented 1 month ago

Introduction

Vector databases have gained significant importance due to the rise of AI, machine learning, and deep learning applications. These databases store high-dimensional vectors representing data types such as images, text, and audio. One of the critical components in vector databases is the indexing technique, which allows for fast similarity searches by organizing vectors in ways that expedite querying. This survey provides an overview of the main indexing techniques, the vector data type, vector distance metrics, and query languages.

Vector Data Type

A vector in a vector database is an ordered set of scalar data types, often represented by floating-point numbers, with no additional internal structure. Most vector databases enforce limits on the number of elements in a vector, which can affect performance. For instance, ClickHouse and Zilliz use arrays to represent vectors.

The vector data type is primarily used in high-dimensional data representations and is queried using specific distance metrics. These metrics determine how similar two vectors are and form the foundation of retrieval processes in vector databases.

Indexing Techniques in Vector Databases

  1. Approximate Nearest Neighbor (ANN) Search
    Most vector database systems employ ANN search instead of exact search to significantly reduce the time complexity. This involves algorithms that focus on retrieving vectors that are approximately similar to the query vector based on a chosen similarity metric.

  2. Index Types Different vector databases employ a variety of indexing methods. Some of the key methods include:

    • Locality-Sensitive Hashing (LSH): LSH maps vectors into smaller buckets using hashing techniques, allowing for faster lookups.
    • Product Quantization (PQ): PQ divides vectors into sub-vectors and compresses them for efficient search and storage.
    • Hierarchical Navigable Small World (HNSW): HNSW uses graph-based data structures to quickly navigate between vectors in multi-layered structures.
    • Inverted File Index (IVF): IVF clusters vectors into centroids and assigns each centroid an inverted list for more efficient querying.
    • Flat Indexing: This method, though simple, has no optimization for search speed and examines all vectors in the dataset.
    • Annoy (Approximate Nearest Neighbors Oh Yeah): A tree-based approach that speeds up nearest-neighbor search by partitioning the vector space.
  3. Quantization Techniques
    Quantization techniques like Product Quantization (PQ) and Scalar Quantization (SQ) transform vectors into smaller, quantized parts, allowing for more compact storage and faster retrieval.

  4. Other Indexes
    Newer methods like Vamana/DiskANN and libraries such as FAISS (used in Milvus and Pinecone) and ScaNN continue to emerge, providing custom similarity metrics and optimized search times.

Distance Metrics in Vector Databases

Different distance metrics define how vector similarity is computed. Not all vector databases support all distance metrics, but popular options include:

The following are some of the prominent indexes used in vector databases, each offering different functional and non-functional properties. Many vector database systems provide explanations for the indexes they implement. The list below is not exhaustive.

Flat Indexing

IVF (Inverted File Index)

IVFFlat

Annoy (Approximate Nearest Neighbor)

PQ (Product Quantization)

HNSW (Hierarchical Navigable Small-World)

Scalar Quantization (SQ)

Vamana/DiskANN

Libraries for Vector Indexing

In many databases, users can select a specific metric during query execution, which can impact the retrieval results based on the nature of the dataset.

Query Languages for Vector Databases

Most vector databases support query languages that allow users to search for similar vectors:

Implementation Approaches

Vector databases are typically implemented in two ways:

  1. Extending Existing Database Systems: Traditional databases like PostgreSQL are extended to support vector operations. This allows the system to leverage existing infrastructure for transactions, backups, and query languages.
  2. Building Specialized Systems: Systems like Milvus, Pinecone, and Marqo are built from the ground up to optimize for vector data, ensuring fast and scalable retrieval of high-dimensional data.

Benchmarks and Comparisons

Several benchmarks and tools are emerging to compare the performance and capabilities of vector databases. These include ANN Benchmarks, VectorDBBench, and MTEB (Massive Text Embedding Benchmark). Comparisons between vector databases focus on factors like query latency, scalability, support for different distance metrics, and various available indexes.

The development of vector databases is ongoing, and various systems continue to explore ways to optimize search, indexing, and distance metrics. Although specialized vector databases offer more optimized performance for vector-specific tasks, traditional databases are rapidly extending their capabilities to include vector search functionalities, offering robust solutions for a wider range of applications. This survey provides a foundation for understanding vector database indexing and retrieval mechanisms and highlights the importance of selecting appropriate indexing techniques and distance metrics for specific use cases.

Embeddings Overview

Embeddings play a crucial role in many use cases, as they are directly derived from models. Any changes in models can affect embeddings, with potential consequences for cost and performance. For instance, smaller embeddings (1536 dimensions) generated by newer models can be more cost-effective, but not always superior in all tasks.

Model Changes and Embeddings

Key Considerations for Vector Database Systems

The choice between these methods involves managing large-scale data, especially as data volume increases.

Data Management Challenges

Multi-modal Models

Multi-modal models, such as Meta-Transformer, can process different input types within a single model, simplifying data management. The evolution of multimodal models is an ongoing trend.

Testing and Regression

Whenever a model or database system is updated, thorough testing is needed to ensure the similarity search quality remains intact. This testing process is essential to avoid regression and decide on appropriate actions if issues arise.

Tools for Generating Embeddings

For more detailed examples, refer to this guide on using OpenAI, Pinecone, Airbyte, and Langchain.

Architectures Overview (RAG — Retrieval Augmented Generation)

Architectures using vector database systems as foundational elements are emerging. Examples include:

These architectures, including the RAG workflow, enable efficient information retrieval. Over time, more architectures with refinements (e.g., different model versions or data management improvements) will appear.

RAG Architectures

Fully implemented RAG (Retrieval Augmented Generation) architectures are now integrated with vector databases. Some are available as SaaS, such as:

Selecting a Vector Database System

Performance Considerations

For a successful implementation, performance metrics such as scale, throughput, and latency must be key factors in selecting a vector database system. These systems should offer:

Use benchmarks to evaluate the performance of different systems.

Functionality Considerations

It's important to have clear requirements for:

Document these requirements for database system selection, ensuring they align with your use case. If no system meets all needs, be prepared to manage multiple database systems or contribute missing features to an existing one.

Performance vs. Functionality

Achieving the desired performance and functionality may require separate systems for fast retrieval and scalability, as no single system might support all requirements simultaneously.

Convenience vs. Long-Term Needs

Existing database systems may support vector data types but might not meet long-term performance or metric needs. Clearly define requirements and use benchmarks to avoid potential migrations later, which can add effort and complexity.

Conclusion

The vector database space is rapidly evolving and fascinating, especially for those with a background in database systems. This summary emphasizes the importance of performance, functionality, and future-proofing when selecting vector databases.

chrisaycock commented 1 month ago

Thanks for your notes, @isthatdebbiej! There is a lot to unpack here.