asg017 / sqlite-vss

A SQLite extension for efficient vector search, based on Faiss!
MIT License
1.59k stars 59 forks source link

Plugging memory leaks #60

Closed polterguy closed 1 year ago

polterguy commented 1 year ago

Sorry but this is a massive pull request. Since the library is in alpha stage, version 0.1, I figured it should be OK. The pull request fixes tons of issues, in particular related to memory management.

Memory management fixes

The most important thing I did was as follows;

  1. Get rid of as many new/delete/malloc/free constructs as possible, preferring stack allocated objects where possible. E.g. changing auto foo = new bar(); to bar foo; where possible. This prevents memory leaks if errors occurs and the code returns early before releasing the associated memory, and is considered more "defensive coding".
  2. Where I could not apply the above, I tried to use unique_ptr as much as possible for heap allocated objects. Same argument as above, only different process
  3. Plugged tons of memory leaks. For instance VssSearchParams was never deleted in the original code due to missing 4th argument to sqlite3_result_pointer when creating it and providing it to SQLite's core. However, there are at least 10 similar added constructs that should significantly reduce the amount or memory leaks.
  4. Got rid of as many sqlite3_malloc constructs as possible, and replaced with new/delete, creating encapsulated structs in the process, making it easier to further eliminate leaks, since the object itself is responsible for deleting heap memory it's allocating.

Code structural changes

The struct called vss_index is new. I realised the original way you had kept record of indexes, to_be_delete IDs, to_be_added IDs/data could be structured significantly better by encapsulating the index itself into its own class - Which is the role of the vss_index type.

If you carefully look at the vss_index_vtab type, you will see that I've removed 4 to 5 vectors from your original code, and replaced with a single vector of type vss_index, who's responsibility it is to keep track of the faiss::Index pointers, and the records to delete and add in the vssIndexSync function.

The code is now using better variable names, such as changing from q and s to query and sql instead. I also ran the thing through clang-formater which significantly improved the code's readability.

I also used better error messages, with correct capitalisation and such, which is why the unit tests had to be changed.

Completely removed the Vector0Global structure from the vector0_api_from_db since it served no purpose, and instead created vector0_api directly, using new/delete.

Created public inheritance in of the following classes, instead of the original "C-style inheritance logic".

Additions and removals

I removed the vss_debug function taking zero parameters, realising it was just remnants from your testing as you had started the library. In addition I added a new function called vss_memory_usage that invokes the existing faissMemoryUsageFunc function. I am not sure how much point the latter is, since it seems to return 0 always anyways. However, I figured it might be useful to further track down memory leaks, and/or profile the library's usage of memory in the future.

Conclusion

This is a massive pull request, for that I am sorry - However, the library's code in its original state was impossible for me to read, because of lack of coding standard, badly formatted code, unintuitive variable names, and redundant and complex structures, and/or functions.

This pull request should fix some roughly 10+ memory leaks in total, and I even copied and pasted every single line of code one function at the time to ChatGPT asking it to look for memory leaks - It couldn't find any.

I've also seen significant drops of memory usage in our own Kubernetes cluster which has Prometheus installed, allowing me to easily track memory usage. For instance, a process of creating 1,536 dimensional vectors (OpenAI) towards a database of 12,000 records, reduced the memory footprint by 50%. Implying in its previous state, the library was for all practical concerns useless in a web app, while now it's at least possible to use, since it leaks much less memory.

The indexes are still consuming huge amount of memory, and I don't know why - I suspect this might be simply because I don't know enough about faiss to completely understand it - But the library code itself should be at least somewhat 98% "tight".

There is still remaining work to be done, to clean up things. I've added some "TODO" comments in the code. In particular the first invocation to sqlite3_create_function_v2 in the "sqlite-vector.cpp" file seems to be "acting funny". I will write up separate issues for these, in addition to issues for additional questions I might have - But hopefully you're happy with my pull request, and willing to merge it in.

It's an amazing piece of library you've created, and I am looking forward to contribute more to it. For me, the library is a crucial component in our systems, and hence my motivation for spending so much time on it. I wouldn't mind continuing improving upon it if you allow me to.

I'd say this pull request easily by itself moves the thing almost into "BETA stage'ish" from its current alpha stage. Let me know if you find something I need to change for you to merge this into your codebase, and I'll have a look at it.

ankon commented 1 year ago

Thanks for your work! Going to spend some time reading today I guess :)

Linking to https://github.com/asg017/sqlite-vss/issues/59 for visibility there.

polterguy commented 1 year ago

Psst ... ^_^

polterguy commented 1 year ago

Psst, there's no way you can make sense out of this by reading the diff files. I'd checkout the entire branch side by side, and just look at the code in two separate IDEs or something to understand it. I don't think there's a single line of code I didn't change ... :)

asg017 commented 1 year ago

Hey @polterguy , thanks again for all your hard work here! Can't imagine how long it would take to understand my crap code...

I'll be reviewing this PR over the weekend, and have full intentions to eventually merge this PR in. Most of the changes sound reasonable and should be quick to review.

A few notes from my quick look:

polterguy commented 1 year ago

Can't imagine how long it would take to understand my crap code

Hehe, the code wasn't that bad. Besides, I know how it is, you've got an amazing idea, you implement it in its first stage as "alpha testing out thesis stuff", and you iterate over it with time, making it better. The thing is an amazing piece of tech though, so I wouldn't say it the way you did :)

I would like to keep vss_debug()

It's still there, it just doesn't have that "output "yo" parts" ... ;)

But if you want to pull it back the way it was, go for it - I just didn't see the purpose of something printing out the static text "yo" every time it's invoked ...

like the weird vector blob format with headers

If you write up tasks for these, I can pick the "low hanging fruits" and help you out. I'm really motivated to help out with this lib, since it's a crucial component for us to have working. On my wish list, there's

However, write up whatever you think needs to be done, and I'll pick all the "low hanging fruits". However, you probably know a lot more about both SQLite plugins, and faiss I presume - So I'll prioritise the "C++ type of issues" that are stuff I know how to handle easily ...

asg017 commented 1 year ago

It's still there, it just doesn't have that "output "yo" parts" ... ;)

LOL my bad, I guess I forgot to update it!

And re: low-hanging fruits, I think for now this PR is sufficient! We can tackle those other problem in future PRs

polterguy commented 1 year ago

Psst, the irony 😁

polterguy commented 1 year ago

Hi mate, how's this thing going? ^_^

Any news on when to merge? I've got other things I want to work with, way smaller things, but I don't want to add too much to this (already monster) PR ...

Psst, I finally figured out how to do dimension reduction. I didn't realise you had to do it in two operations. First train the index, then insert. Thx, I had to read your example code 5 times before I saw it :D

polterguy commented 1 year ago

I removed the size check in vssIndexFilter since if we're using PCA for instance, the input vector will never have the same size as the size of vectors in the index.

How's the merging going?

asg017 commented 1 year ago

@polterguy can you actually revert that last commit? A search on an index with a query vector is always of the same dimensions of the index. If you use a dimension reduction technique like PCA in a factory string, I believe Faiss will handle the dimension reduction for you.

The docs here say that the x input vectors are always of size n * d (and in vssIndexFilter, n is always 1 for now) https://faiss.ai/cpp_api/struct/structfaiss_1_1Index.html#:~:text=virtual%20void%20search(idx_t%20n%2C%20const%20float%20*x%2C%20idx_t%20k%2C%20float%20*distances%2C%20idx_t%20*labels%2C%20const%20SearchParameters%20*params%20%3D%20nullptr)%20const%20%3D%200

polterguy commented 1 year ago

Hi Alex, the way I've understood it the PCA reduces by a divisor, implying using PCA4 on 1,536 results in index size of 384. When you later search with a 1,536 dimension vector, the index size is 384 and the search vector is 1,536, resulting in an error. Did I misunderstand it?

I tried many permutations of this, and the above was the only thing I could get working ... :/

asg017 commented 1 year ago

@polterguy see this comment for details about that specific problem. Also see the faiss docs, which shows that the number immediately following "PCA" in a factory string is the number of dimensions to reduce to, and not a divisor

polterguy commented 1 year ago

Done (reverted), I need to experiment more with this I presume since I still don't "get it" - But I'll figure it out :)

polterguy commented 1 year ago

OK, the whole "training issue" was a little bit embarrassing. It seems you'll need minimum n number of training vectors to train CPA, where n equals the dimension of the index - Which is why it failed for me. Sorry, I have closed the issue.

However, when exceptions occurs vssIndexUpdate it would kill the process instead of returning errors to caller. I've pushed another commit that fixes this, and returns error messages. Another issue it fixes is the reserve parts of the training vectors, and insert/delete vectors, which was being erroneously calculated, and is now correctly calculating the new size of these vectors.

I would appreciate it if you merge in this PR, such that I can start working on another one ASAP, to avoid having this one grow "into infinity and beyond" ...

polterguy commented 1 year ago

I plugged the error related to creating the vector0 function by changing the invocation to the following;

        rc = sqlite3_create_function_v2(db,
                                        "vector0",
                                        1,
                                        SQLITE_UTF8,
                                        api,
                                        vector0,
                                        0,
                                        0,
                                        delete_api);

Your original (failing code) was using sqlite3_create_function. Now when creating the vector0 function, it returns SQLITE_OK instead of an error.

I suspect there might be some breaking changes between the different versions of APIs from SQLite, and that we should only use the same API in all function invocations. I'll sanity check this to make sure we don't mix API versions ...

asg017 commented 1 year ago

Ok finally took the time to review: This is fantastic @polterguy , thank you again for your hard work! A few things I especially loved seeing:

Will merge soon and make a new v0.1.1-alpha.20 release, and probably drop a v0.1.1 release in the next day or so. Although I am seeing a segfault + infinite hang when running make test-loadable, so investigating now

polterguy commented 1 year ago

Thx mate ^_^

I might have added a couple of bugs in the new code that wasn't there - However, I suspect I fixed 10x as many bugs as I added. Thx for the najs words ^_^

Yet again, this thing is an incredible thing, and I deeply appreciate your work on this. I've been looking for something like this since the 23rd of December when I "found" OpenAI and realised I needed to do similarity search on the database itself. It optimises queries for us from (table scan + dot product in C#) 30 seconds on 5,000 records down to 0.02 seconds using your plugin.

For us, this thing is a life saver - And I want to continue helping you out on it as much as possible, also in the future ^_^

polterguy commented 1 year ago

BTW, seg faults typically originates from writing past some malloced array or de-referencing a freed pointer, etc. There are some "dubious" places left in the code, something pointed out by @ankon among other things in one of the comments further up, where I suspect the pointer arithmetic is "interesting" from a seg fault's perspective. However, I saved those for you, realising you probably understand these parts better than me @asg017 :)

However, in "daily usage" (my usage), I can confirm that the thing has ZERO leaks, at least on querying. I'll run through some more tests for index building and training to to verify. Something you can verify on a Mac with e.g. ...

ps -axm -o %mem,rss,comm | grep NAME_OF_PROCESS
asg017 commented 1 year ago

This PR has been merged and released in v0.1.1-alpha.20. Please try it out and let me know if it works! Planning a v0.1.1 release soon.

And @polterguy , f2d5cd18329435745ee086ebadd73becb29627d6 fixed the issue I found: Just a couple out-of-bounds checks, nothing crazy!