asg017 / sqlite-vec

A vector search SQLite extension that runs anywhere!
Apache License 2.0
3.68k stars 127 forks source link

Tracking issue: ANN (Approximate Nearest Neighbors) Index #25

Open asg017 opened 2 months ago

asg017 commented 2 months ago

sqlite-vec as of v0.1.0 will be brute-force search only, which slows down on large datasets (>1M w/ large dimensions). I want to include some form of approximate nearest neighbors search before v1, which trades accuracy/resource usage for speed.

This issue is a general "tracking issue" for how ANN will be implemented in sqlite-vec. The open questions I have:

Which ANN index should we use?

We want something that fits well with SQLite - meaning storing data in shadow tables, data that fits in pages, low memory usage, etc.

The main options I see:

Unsure which one will turn out best, will need to reseach more. It's possible we add support for all these options.

How should one "declare" an index?

SQLite doesn't have custom indexes, so I think the best way would be to include index info in the CREATE VIRTUAL TABLE constructor. Like:

create virtual table vec_movies(
  synopsis_embeddings float[768] INDEXED BY diskann(...)
);

or:

create virtual table vec_movies(
  synopsis_embeddings float[768] index=hnsw(...)
);

syntax heavily depends what ANN index we pick. Also how would training work?

How would they work with metadata filtering?

How do we allow bruteforce + ANN on the same table?

How do we pick between KNN/ANN in a SQL query?

neilkumar commented 2 months ago

Would ustream work?

https://github.com/unum-cloud/usearch

They even have some sqlite stuff already

https://github.com/unum-cloud/usearch/blob/main/sqlite/README.md

asg017 commented 2 months ago

usearch is great! But they don't offer many "hooks" to their storage engine, which would be required for sqlite-vec. We'd want to store the index inside SQLite tables, and balance query time + random lookup times. Also the usearch SQLite functions are just scalar functions, nothing that accesses the HNSW index

Also I want to keep sqlite-vec as lightweight as possible, there's no outside dependencies and is a single .c/.h file. So I don't wanna complicate things with a C++ dependency

irowberryFS commented 1 week ago

+1 for LM-DiskANN

di-sukharev commented 6 days ago

yoo bros, lets do this index okay :))