asg017 / sqlite-vss

A SQLite extension for efficient vector search, based on Faiss!
MIT License
1.59k stars 59 forks source link

Segfault when inserting embeddings into vss_table #18

Closed MahmoudFawzyKhalil closed 1 year ago

MahmoudFawzyKhalil commented 1 year ago

Summary: When inserting data into the virtual table "vss_chunks" using SQLite, a segmentation fault occurs in the C++ code of the "vssIndexUpdate" function.

Steps to reproduce:

  1. Create a virtual table "vss_chunks" using the vss0 module with the following SQL statement:
  2. Insert data into the "chunks" table with the following Python code:
def save_resource(resource: Resource):
    conn = create_connection()
    cursor = conn.cursor()
    cursor.execute('''
        INSERT INTO resources (url, title)
        VALUES (?, ?)
    ''', (resource.url, resource.title))

    resource_id = cursor.lastrowid
    for chunk, embedding in zip(resource.chunks, resource.embeddings):
        embedding_bytes = embedding.tobytes()
        cursor.execute('''
            INSERT INTO chunks (chunk, embedding, resource_id)
            VALUES (?, ?, ?)
        ''', (chunk, embedding_bytes, resource_id))

    conn.commit()
    conn.close()

r = salmon.create_article('https://news.ycombinator.com/item?id=34218724')
save_resource(r)
  1. Insert data from the "chunks" table into the "vss_chunks" virtual table with the following SQL statement:
connection = create_connection()
cursor = connection.cursor()
cursor.execute('''
            INSERT INTO vss_chunks (rowid, chunk_embedding)
                SELECT rowid, embedding
                FROM chunks
        ''')
connection.commit()
connection.close()

Logs from IntelliJ when attempting the same query using its SQL console:

# Problematic frame:
# C  [vss0.so+0xb7eb9]  vssIndexUpdate(sqlite3_vtab*, int, sqlite3_value**, long long*)+0x459

Environment:

Operating system: Ubuntu 22.04.2 LTS Python version: 3.10.6 SQLite version: 3.40.0 sqlite_vss version: 0.0.4 (installed with pip) sentence_transformers: multi-qa-mpnet-base-cos-v1 model when generates 768 dimensional embeddings

Schema:

def init_db():
    conn = create_connection()
    cursor = conn.cursor()
    cursor.execute('''
        CREATE TABLE IF NOT EXISTS resources (
            id INTEGER PRIMARY KEY,
            url TEXT,
            title TEXT
        )
    ''')

    cursor.execute('''
        CREATE TABLE IF NOT EXISTS chunks (
            id INTEGER PRIMARY KEY,
            chunk TEXT,
            embedding BLOB,
            resource_id INTEGER,
            FOREIGN KEY (resource_id) REFERENCES resources (id)
        )
    ''')

    cursor.execute('''
        CREATE VIRTUAL TABLE vss_chunks USING vss0(
        chunk_embedding(768)
    );
    ''')
    conn.commit()
    conn.close()
asg017 commented 1 year ago

Thanks for filing @MahmoudFawzyKhalil !

I'm having trouble reproducing: Can you try running this python script and see if you find the same issue?

import sqlite3
import sqlite_vss
import numpy as np

db = sqlite3.connect(":memory:")
db.enable_load_extension(True)
sqlite_vss.load(db)

print(db.execute("select vss_version()").fetchone()[0])

db.executescript("""
CREATE TABLE IF NOT EXISTS resources (
    id INTEGER PRIMARY KEY,
    url TEXT,
    title TEXT
);

CREATE TABLE IF NOT EXISTS chunks (
    id INTEGER PRIMARY KEY,
    chunk TEXT,
    embedding BLOB,
    resource_id INTEGER,
    FOREIGN KEY (resource_id) REFERENCES resources (id)
);

CREATE VIRTUAL TABLE vss_chunks USING vss0(
  chunk_embedding(768)
);

INSERT INTO resources (url, title)
  VALUES ("foo", "bar");

""")

db.execute("""
INSERT INTO chunks (chunk, embedding, resource_id)
  VALUES ("foo", ?1, 1);
""", [np.zeros((1, 768), dtype=np.float32)])

db.execute("""
INSERT INTO vss_chunks (rowid, chunk_embedding)
  SELECT rowid, embedding
  FROM chunks;
""")
db.commit()

results = db.execute("select rowid, vector_debug(chunk_embedding), * from vss_chunks").fetchall()

print(results)

db.close()

If that works, then in your original database, can you run:

SELECT DISTINCT length(embedding)
FROM chunks

And see what it returns? I have a feeling that some of the embeddings lengths are not 3072 (768*4), which might be causing the segfault

MahmoudFawzyKhalil commented 1 year ago

Thank you for the quick support!

  1. Testing the code works
  2. Sadly it returns a single row with 3072 as the length

Also: in the example you sent me if I set the vector length when creating the virtual table to something like 10, it just truncates the rest, it does not segfault.

MahmoudFawzyKhalil commented 1 year ago

I managed to reproduce the issue in the script you sent:

  1. Create a file db with one connection
  2. Execute some statements on that connection
  3. Close it
  4. Create another connection and do the insert into the vss_table

Using just a single connection in my code solved the issue. Also, re-running the script on the same database is fine even though it opens a new connection, as long as it is the only one opened in that run.

import sqlite3
from typing import List, Any

import sqlite_vss
import numpy as np

db = sqlite3.connect("bla.db")
db.enable_load_extension(True)
sqlite_vss.load(db)

print(db.execute("select vss_version()").fetchone()[0])

db.executescript("""
CREATE TABLE IF NOT EXISTS resources (
    id INTEGER PRIMARY KEY,
    url TEXT,
    title TEXT
);

CREATE TABLE IF NOT EXISTS chunks (
    id INTEGER PRIMARY KEY,
    chunk TEXT,
    embedding BLOB,
    resource_id INTEGER,
    FOREIGN KEY (resource_id) REFERENCES resources (id)
);

CREATE VIRTUAL TABLE vss_chunks USING vss0(
  chunk_embedding(10)
);

INSERT INTO resources (url, title)
  VALUES ("foo", "bar");

""")
db.execute("""
INSERT INTO chunks (chunk, embedding, resource_id)
  VALUES ("foo", ?, 1);
""", [np.zeros((1, 768), dtype=np.float32)])
db.commit()

print(db.execute("SELECT * FROM chunks").fetchall())
# Close connection and create a new one
db.close()

db = sqlite3.connect("bla.db")
db.enable_load_extension(True)
sqlite_vss.load(db)

db.execute("""
INSERT INTO vss_chunks (rowid, chunk_embedding)
  SELECT rowid, embedding
  FROM chunks;
""")
db.commit()

results = db.execute("select rowid, vector_debug(chunk_embedding), * from vss_chunks").fetchall()

print(results)

db.close()
asg017 commented 1 year ago

Thank you @MahmoudFawzyKhalil ! I can now reproduce, attempting a fix

asg017 commented 1 year ago

Smallest possible repro:

.open tmp.db
.load dist/debug/vector0
.load dist/debug/vss0

CREATE VIRTUAL TABLE vss_chunks USING vss0(
  chunk_embedding(1)
);

.open tmp.db
.load dist/debug/vector0
.load dist/debug/vss0

INSERT INTO vss_chunks (rowid, chunk_embedding)
  SELECT 2, json_array(1);

If you create a vss0 table, don't insert any data, close the connection, open a new connection, and then try to insert into the table, it segfault. This is because we only serialize the Faiss index after write transactions are commited. But if you don't insert into the table when it's first created, there's no write transaction commit, so the index never gets written.

Will add a new test case and a new version bump shortly.

MahmoudFawzyKhalil commented 1 year ago

Thank you @asg017

bkono commented 1 year ago

Following for the fix as well. Was meaning to submit a bug report this week about the same behavior, noticed when I included the table setup as part of migrations in an app. Workaround (and what I did originally, which masked the issue) was to perform the table setup after seeding the database.

Looking forward to being able to remove the temporary hack around I had in place :)

asg017 commented 1 year ago

This has now been fixed in v0.0.5. The loadable extensions/python package/npm package/deno module have all been updated as well.

Will close, but please file another issue if you find anything else! Thanks for the initial report.

bkono commented 1 year ago

Confirmed working for my use case. Thanks!