Closed MahmoudFawzyKhalil closed 1 year ago
Thanks for filing @MahmoudFawzyKhalil !
I'm having trouble reproducing: Can you try running this python script and see if you find the same issue?
import sqlite3
import sqlite_vss
import numpy as np
db = sqlite3.connect(":memory:")
db.enable_load_extension(True)
sqlite_vss.load(db)
print(db.execute("select vss_version()").fetchone()[0])
db.executescript("""
CREATE TABLE IF NOT EXISTS resources (
id INTEGER PRIMARY KEY,
url TEXT,
title TEXT
);
CREATE TABLE IF NOT EXISTS chunks (
id INTEGER PRIMARY KEY,
chunk TEXT,
embedding BLOB,
resource_id INTEGER,
FOREIGN KEY (resource_id) REFERENCES resources (id)
);
CREATE VIRTUAL TABLE vss_chunks USING vss0(
chunk_embedding(768)
);
INSERT INTO resources (url, title)
VALUES ("foo", "bar");
""")
db.execute("""
INSERT INTO chunks (chunk, embedding, resource_id)
VALUES ("foo", ?1, 1);
""", [np.zeros((1, 768), dtype=np.float32)])
db.execute("""
INSERT INTO vss_chunks (rowid, chunk_embedding)
SELECT rowid, embedding
FROM chunks;
""")
db.commit()
results = db.execute("select rowid, vector_debug(chunk_embedding), * from vss_chunks").fetchall()
print(results)
db.close()
If that works, then in your original database, can you run:
SELECT DISTINCT length(embedding)
FROM chunks
And see what it returns? I have a feeling that some of the embeddings lengths are not 3072 (768*4), which might be causing the segfault
Thank you for the quick support!
Also: in the example you sent me if I set the vector length when creating the virtual table to something like 10, it just truncates the rest, it does not segfault.
I managed to reproduce the issue in the script you sent:
Using just a single connection in my code solved the issue. Also, re-running the script on the same database is fine even though it opens a new connection, as long as it is the only one opened in that run.
import sqlite3
from typing import List, Any
import sqlite_vss
import numpy as np
db = sqlite3.connect("bla.db")
db.enable_load_extension(True)
sqlite_vss.load(db)
print(db.execute("select vss_version()").fetchone()[0])
db.executescript("""
CREATE TABLE IF NOT EXISTS resources (
id INTEGER PRIMARY KEY,
url TEXT,
title TEXT
);
CREATE TABLE IF NOT EXISTS chunks (
id INTEGER PRIMARY KEY,
chunk TEXT,
embedding BLOB,
resource_id INTEGER,
FOREIGN KEY (resource_id) REFERENCES resources (id)
);
CREATE VIRTUAL TABLE vss_chunks USING vss0(
chunk_embedding(10)
);
INSERT INTO resources (url, title)
VALUES ("foo", "bar");
""")
db.execute("""
INSERT INTO chunks (chunk, embedding, resource_id)
VALUES ("foo", ?, 1);
""", [np.zeros((1, 768), dtype=np.float32)])
db.commit()
print(db.execute("SELECT * FROM chunks").fetchall())
# Close connection and create a new one
db.close()
db = sqlite3.connect("bla.db")
db.enable_load_extension(True)
sqlite_vss.load(db)
db.execute("""
INSERT INTO vss_chunks (rowid, chunk_embedding)
SELECT rowid, embedding
FROM chunks;
""")
db.commit()
results = db.execute("select rowid, vector_debug(chunk_embedding), * from vss_chunks").fetchall()
print(results)
db.close()
Thank you @MahmoudFawzyKhalil ! I can now reproduce, attempting a fix
Smallest possible repro:
.open tmp.db
.load dist/debug/vector0
.load dist/debug/vss0
CREATE VIRTUAL TABLE vss_chunks USING vss0(
chunk_embedding(1)
);
.open tmp.db
.load dist/debug/vector0
.load dist/debug/vss0
INSERT INTO vss_chunks (rowid, chunk_embedding)
SELECT 2, json_array(1);
If you create a vss0 table, don't insert any data, close the connection, open a new connection, and then try to insert into the table, it segfault. This is because we only serialize the Faiss index after write transactions are commited. But if you don't insert into the table when it's first created, there's no write transaction commit, so the index never gets written.
Will add a new test case and a new version bump shortly.
Thank you @asg017
Following for the fix as well. Was meaning to submit a bug report this week about the same behavior, noticed when I included the table setup as part of migrations in an app. Workaround (and what I did originally, which masked the issue) was to perform the table setup after seeding the database.
Looking forward to being able to remove the temporary hack around I had in place :)
This has now been fixed in v0.0.5
. The loadable extensions/python package/npm package/deno module have all been updated as well.
Will close, but please file another issue if you find anything else! Thanks for the initial report.
Confirmed working for my use case. Thanks!
Summary: When inserting data into the virtual table "vss_chunks" using SQLite, a segmentation fault occurs in the C++ code of the "vssIndexUpdate" function.
Steps to reproduce:
Logs from IntelliJ when attempting the same query using its SQL console:
Environment:
Operating system: Ubuntu 22.04.2 LTS Python version: 3.10.6 SQLite version: 3.40.0 sqlite_vss version: 0.0.4 (installed with pip) sentence_transformers: multi-qa-mpnet-base-cos-v1 model when generates 768 dimensional embeddings
Schema: