cosdata / cosdata

Cosdata: A cutting-edge AI data platform for next-gen search pipelines. Features semantic search, knowledge graphs, hybrid capabilities, real-time scalability, and ML integration. Designed for immutability and version control to enhance AI projects.
https://www.cosdata.io
Apache License 2.0
86 stars 20 forks source link

Core || Indexing || Develop the routines to create, store and query a HNSW index on the vector data #36

Open apurvmanjrekar opened 3 months ago

apurvmanjrekar commented 3 months ago

Description

The system shall build a HNSW index on the vectored information, based on the user specified similarity metric, in an automated manner in order to provide accurate results instantly.

Acceptance criteria

Context

The primary reason to store vectored information is to perform semantic search on unstructured data sets. The semantic search has to be perform accurately (with a reasonably high recall) and instantly (u second to milli-second latency). The HNSW index has proved itself to be both effective and efficient in real-world use. The system will need to be able to automatically build the HNSW index on the vectored information. This should be done in parallel to the data ingestion. Users shall be able to re-index the vectored information as and when required.

Links

Title Link
Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs https://arxiv.org/abs/1603.09320
Similarity Search, Part 4: Hierarchical Navigable Small World (HNSW) https://towardsdatascience.com/similarity-search-part-4-hierarchical-navigable-small-world-hnsw-2aad4fe87d37
Hierarchical Navigable Small Worlds (HNSW) https://www.pinecone.io/learn/series/faiss/hnsw/
HNSW indexing in Vector Databases: Simple explanation and code https://medium.com/@wtaisen/hnsw-indexing-in-vector-databases-simple-explanation-and-code-3ef59d9c1920
ANN-Benchmarks: A Benchmarking Tool for Approximate Nearest Neighbor Algorithms https://arxiv.org/abs/1807.05614

Pending work

apurvmanjrekar commented 3 months ago

Checkpoint 28-Aug-2024

The build out and testing of the HNSW vector is done. The two sub-tasks (additional enhancements) are yet to be initiated.