Open Nintorac opened 1 year ago
Do you happen to have the code you used to ingest embeddings into sqlite-vss
? It shouldn't take 30 mins to insert 30k vectors. I suspect there's a number fixes that could be made to make it much faster, including:
BEGIN
and COMMIT
)execute()
and prefer executemany()
if in PythonAlso depends if you're using a custom factory or now, so any example code would be great!
i have lost the code sorry. If I remember right this was to create the index after all the data has been inserted
I am seeing the same performance problem: populating the virtual table takes forever.
Here is the Python code I'm using: note that the part where it becomes slow is at the very end, where all the vectors have already been inserted and we proceed to populating the virtual table (insert into vss_... select...
). The vectors I'm working with here have 300 dimensions, in case that matters.
If you see any obvious problems with my code, I'd be grateful to hear about them!
import io, shutil, json, sqlite3, sqlite_vss
# create an empty database:
shutil.copyfile("empty.sqlite", "de.sqlite")
con = sqlite3.connect("de.sqlite")
# load sqlite_vss into the database connection:
con.enable_load_extension(True)
sqlite_vss.load(con)
con.enable_load_extension(False)
# insert the words and their vectors:
fname = "/media/michmech/Iomega_HDD/bigstuff/cc.de.300.vec"
line_count = 0
f = io.open(fname, 'r', encoding='utf-8', newline='\n', errors='ignore')
for line in f:
line_count += 1
if line_count > 10000: break
tokens = line.rstrip().split(' ')
word = tokens[0]
vector = list(map(float, tokens[1:]))
if len(vector)==300:
con.execute("insert into words(word, vector) values(?, ?)", (word, json.dumps(vector)))
print(line_count, word)
con.commit()
# NOTE: until now everything runs very quickly, after now it becomes dead slow
# create and populate an index of the vectors:
con.execute("create virtual table vss_words using vss0( vector(300) )")
con.execute("insert into vss_words(rowid, vector) select rowid, vector from words")
con.commit()
con.close()
In case it matters, the words
table is defined like this:
CREATE TABLE "words" (
"word" TEXT,
"vector" TEXT
);
and has an index on the word
column:
CREATE INDEX "ix_words" ON "words" (
"word" ASC
);
Here is a visual on how ingest time scales versus number of embeddings. If I log both axis' it looks approximately linear.
I also noticed that there only seems to be a single thread running for the entire duration of the ingest.
I am using embed dings with dimension 2560.
I am using python and have installed sqlite-vss via pip if that makes a difference