asg017 / sqlite-vec

A vector search SQLite extension that runs anywhere!
Apache License 2.0
4.36k stars 141 forks source link

Document architecture #3

Open Adriatic1 opened 7 months ago

Adriatic1 commented 7 months ago

I know this is still early to have the architecture fixed in place, but perhaps it would be a good time to document how things are supposed to work conceptually.

  1. For example, how will the code detect embedding vector, slightly more complex table model failed for me: sqlite> create virtual table example1 using vec0(chunk TEXT NOT NULL, name VARCHAR(64), sample_embedding float[4096]); TODO: unparseable constructor

  2. Describe internal model, mostly why are those internal shadow tables created as they are now: CREATE TABLE "vec_examples_chunks"(chunk_id INTEGER PRIMARY KEY AUTOINCREMENT,size INTEGER NOT NULL,validity BLOB NOT NULL,rowids BLOB NOT NULL) CREATE TABLE "vec_examples_rowids"(rowid INTEGER PRIMARY KEY AUTOINCREMENT,id,chunk_id INTEGER,chunk_offset INTEGER) CREATE TABLE "vec_examples_vector_chunks00"(rowid PRIMARY KEY,vectors BLOB NOT NULL) i.e. what is "validity bitmap", how could you force usage of cosine similarity vs Hamming distance, any (practical) limits to vector size or number of rows in the table etc.

  3. Can the original table contain some additional fields? My possible (RAG) use case would be to store multiple chunked documents and have the ability to filter by document ID before going into similarity search over the filtered rows.

  4. Having an overview what works and what doesn't work yet, or some roadmap would be good. For example pretty printing vectors does not work now:

    sqlite> select * from vec_examples ;
    1|L
    2|>A
    3|K7?OmL7       >T=C+K?M"T%
    4|5>'?'=p}#9?>}?u

Initially I would expect much simpler architecture (just adding a support for VECTOR(type, size) data type and eventually add support for one or more similarity indexes over it), but I guess this was an easier way to add things into Sqlite. I find it hard to understand the current architecture and its possible limitations, so any documentation would be helpful to validate if this would be useful to my project. Also, it's good to have such document early in the process, so you can perhaps get useful comments and improve things while the architecture has not solidified yet.