asg017 / sqlite-vss

A SQLite extension for efficient vector search, based on Faiss!
MIT License
1.75k stars 62 forks source link

Very slow for moderate number of embeddings #106

Open Nintorac opened 1 year ago

Nintorac commented 1 year ago

Here is a visual on how ingest time scales versus number of embeddings. If I log both axis' it looks approximately linear.

I also noticed that there only seems to be a single thread running for the entire duration of the ingest.

I am using embed dings with dimension 2560.

image

I am using python and have installed sqlite-vss via pip if that makes a difference

asg017 commented 12 months ago

Do you happen to have the code you used to ingest embeddings into sqlite-vss? It shouldn't take 30 mins to insert 30k vectors. I suspect there's a number fixes that could be made to make it much faster, including:

Also depends if you're using a custom factory or now, so any example code would be great!

Nintorac commented 11 months ago

i have lost the code sorry. If I remember right this was to create the index after all the data has been inserted

michmech commented 1 month ago

I am seeing the same performance problem: populating the virtual table takes forever.

Here is the Python code I'm using: note that the part where it becomes slow is at the very end, where all the vectors have already been inserted and we proceed to populating the virtual table (insert into vss_... select...). The vectors I'm working with here have 300 dimensions, in case that matters.

If you see any obvious problems with my code, I'd be grateful to hear about them!

import io, shutil, json, sqlite3, sqlite_vss

# create an empty database:
shutil.copyfile("empty.sqlite", "de.sqlite")
con = sqlite3.connect("de.sqlite")

# load sqlite_vss into the database connection:
con.enable_load_extension(True)
sqlite_vss.load(con)
con.enable_load_extension(False)

# insert the words and their vectors: 
fname = "/media/michmech/Iomega_HDD/bigstuff/cc.de.300.vec"
line_count = 0
f = io.open(fname, 'r', encoding='utf-8', newline='\n', errors='ignore')
for line in f:
  line_count += 1
  if line_count > 10000: break
  tokens = line.rstrip().split(' ')
  word = tokens[0]
  vector = list(map(float, tokens[1:]))
  if len(vector)==300:
    con.execute("insert into words(word, vector) values(?, ?)", (word, json.dumps(vector)))
    print(line_count, word)
con.commit()

# NOTE: until now everything runs very quickly, after now it becomes dead slow

# create and populate an index of the vectors:
con.execute("create virtual table vss_words using vss0( vector(300) )")
con.execute("insert into vss_words(rowid, vector) select rowid, vector from words")
con.commit()

con.close()

In case it matters, the words table is defined like this:

CREATE TABLE "words" (
  "word" TEXT,
  "vector" TEXT
);

and has an index on the word column:

CREATE INDEX "ix_words" ON "words" (
  "word" ASC
);