asg017 / sqlite-vec

A vector search SQLite extension that runs anywhere!
Apache License 2.0
4.14k stars 132 forks source link

Comparison to other approaches #94

Open davidmezzetti opened 2 months ago

davidmezzetti commented 2 months ago

Hi @asg017 - Thank you for putting this library together. Storing and searching vectors directly in SQLite is a really important use case.

I believe you've come across txtai. I'm looking to add SQLite vector indexing in the next release. I've added a component for sqlite-vec and it works quite well. This enables a unified vector search with a single SQL query (i.e. join the data with the vectors) vs the current approach of having to join external indexes like Faiss with data stored in a database.

sqlite-vec has received good traction, congratulations on that. I also see there are other options out there and I was curious on your thoughts.

I see you're working on having indexing in https://github.com/asg017/sqlite-vec/issues/25, which is great. Do you envision that the indexes will somehow be directly stored in the SQLite database or in an external file? Or is that still TBD?

I appreciate any insights you can provide.

asg017 commented 2 months ago

hey @davidmezzetti thanks for the kind words! Happy to help review any sqlite-vec related PRs in other open source projects, just give me a ping.

Here are my thoughts on other tools:

Vectorlite: I haven't spent much time playing with this, but it seems great! Do note that the hnswlib and other C++ dependencies make it harder to compile and run on some platforms (WASM, smaller devices, etc.), but that probably doesn't matter much in the Python world. They also store vector indexes outside the SQLite database (in separate files), which is good for performance but has other not-so-great side-effects (may not be transaction-safe, doesn't work with SQLite backup and replication tools, deviates from SQLite's "it's a single file" ethos, etc). Also their benchmarks for sqlite-vec can be improved a bit, I'll probably send a PR to that repo sometime in the near future. Specifically insert/query performance can be improved greatly with a few PRAGMA configurations

libSQL's vector support: Also great, models after pgvector, which uses custom indexes that are not available in vanilla SQLite. But libsql folks can change whatever they want in their fork. It does require using a the libsql driver instead of the default sqlite driver, but that's usually a pip install away. In my experience the libsql vector indexes are a bit slow to insert and query from but YMMV. I'd say try both and see what you like the most.

SQLite: I don't think SQLite will ever add vector support, but I'd be happy to see it. they'll never add new custom column types besides the builtin text/blob/int/float types (so no libsql-like support), but they could possibly add an extension like their JSON/FTS5/R-Tree extensions for vector search. That being said, none of their code use SIMD operations, so I doubt it would be as fast as sqlite-vec. And if they do add something, it usually takes years for their custom extensions to make it into the core SQLite amalgamation, so it may take time. That being said, the SQLite team has surprised me in the past, so I hope they try it out.

For ANN indexes: they 100% will be stored inside the SQLite database. That way backups work, litestream replication works, and transactions continue as-is. I could maybe see a future where indexes are stored in separate files for performance reasons (an opt-in feature), but that would be far down the road.

Let me know if there's anything else I can answer, or email me directly if you wanna chat on a call

davidmezzetti commented 2 months ago

I appreciate the quick response. I'm glad to hear that the ANN indexes will be stored inside the SQLite database.

One other thing that came to mind regarding LIMIT and K. Let's say I have another table for document content and filterable fields. If I run a query joining those together and filter on fields from both the document table and vector table, would the query return all results up to LIMIT? Or does the LIMIT OR K clause first filter the vector table?

SELECT d.* FROM documents d 
LEFT JOIN vectors v ON v.indexid = d.indexid
WHERE d.flag=1 AND v.embedding MATCH ?
LIMIT 100
davidmezzetti commented 2 months ago

I added sqlite-vec as an ANN to txtai here - https://github.com/neuml/txtai/issues/780

A future task will integrate this even further into a single SQLite data file with content and vectors.