asg017 / sqlite-vss

A SQLite extension for efficient vector search, based on Faiss!
MIT License
1.59k stars 59 forks source link

More improvements #68

Closed polterguy closed 1 year ago

polterguy commented 1 year ago

Much smaller PR this time. The most important parts are the new SqlStatement class that automatically takes care of finalizing the statement and freeing any SQL associated with the SQL statement. It makes the code much more readable, and also secure from a memory point of view, in case somebody forgets to finalize a statement, or free some SQL string.

Feel free to add methods on this as needed to expand upon it. I also got rid of the "alternative" logic, trying to keep all DB access constructs to a single concept, getting rid of the exec parts, SQL string concatenation, etc - To keep the parts of the API we're using from SQLite as small as possible (less learning required for future maintainers, and less confusion).

SqlStatement is a naive and minimalistic class that automatically takes care of memory, with deterministic destruction, etc. The "C++ way" you might argue ...

In addition, the SQLITE_STATIC is kind of important I guess, since as you were loading indexes from shadow tables, you previously used SQLITE_TRANSIENT which would copy the input blob as you were writing the index, which first of all would double memory, and secondly consume lots of CPU resources copying. This is especially important in the write_index_insert function. In a web app with "conservative garbage collection" this should be a big deal to avoid memory fragmentation, etc!

I also got rid of some smaller memory leaks (leaking SQL strings during table scan) ++

Psst, don't look at the commits, they contain the merges I pulled in from your repo as I merged in the merge from my former PR. Look at the "Files changed" section.

One question about the table scan parts; Do we really need to keep the statement open? Can't we just iterate over all rows, and store results in a vector like we do for the other parts? Keeping the statement open creates more complex code, and keeps the statement open longer. Even with an index of 500,000 records, we're talking about 2MB of memory, which in the larger scheme really isn't that much, or ..?

There is still the bug related to "error during initialization: vector0: unable to delete/modify user-function due to active statements" - I suspect this is because of the statement being kept open during table scans. I'll run some more tests tomorrow to check this out. It's kind of an important issue for us because of .Net's "connection pooling" which will reuse the same database connection for multiple HTTP requests.

I am 90% certain of that this is related to that we keep the statement open during table scans. I've added some TODO comments, some of which are questions. Feel free to enlighten me in these parts ... :)

polterguy commented 1 year ago

I changed the logic of table scans, such that it runs through all records during initialisation, stores these as a vector of ints, and traverses these using similar logic as search and range_search does today. The memory usage shouldn't be a problem, since even with 500,000 records, we're talking about 500,000 x 4 (sizeof(float)) becoming 2MB, and only for a small amount of time, as the records are traversed.

I also did a lot of additional cleanups, such as making sure we're creating the SQL functions with similar logic in both CPP files, for clarity purposes, and added more error checking, etc ...

Plus it's now possible to select stuff returning nothing. Previously this would crash actually. Now it doesn't crash, but simply returns null results instead.

polterguy commented 1 year ago

There is something wrong with this PR, it's crashing sometimes. I'll figure out why - Don't merge (yet!)