asg017 / sqlite-vss

A SQLite extension for efficient vector search, based on Faiss!
MIT License
1.59k stars 59 forks source link

Add Elixir hex package for support within Exqlite. #33

Closed trodrigu closed 1 year ago

trodrigu commented 1 year ago

Allows easy installation of sqlite-vss extension files within mix as well as a helper function for creating search queries.

~I was going to remove the search_query function but I remembered how much trouble it caused me to create an Ecto query using the vss_search function so I left it (still open to remove though).~ (removed now)

I haven't ran mix hex.publish as I want to iron out the details for the name, version etc... To do that we can follow this.

In order to use this locally in my mix.exs I am just linking directly to the path like

      {:sqlite_vss, path: "/Users/thomas/Code/sqlite-vss/elixir/"},

Thanks again for creating this library!

Some Notes

~This does not include the checksum validation yet as I'm looking for proof of concept first. This will include emulating what the RustlerPrecompiled library does.~ Checksum will be generated from the Github action.

~This does not include checking for outdated versions that have been installed since the files being installed don't have a version attached (maybe we can add the configured version to the directory name or something).~ (pulled from package version)

Parts of the documentation needs to be updated and along with the package the hexdocs need to be published too.

trodrigu commented 1 year ago

@asg017 I totally vibe with every comment and intend to refactor it for each one! I don't foresee getting this out until Tuesday as I'll be in Moab for a few days. Thanks for the review!

asg017 commented 1 year ago

Hey just a heads up, I just published v0.1.1-alpha.7 which slightly changes the .tar.gz release aftifact names.

Before, the artifacts were named:

sqlite-vss-v0.1.0-vector0-darwin-aarch64.tar.gz
sqlite-vss-v0.1.0-vector0-darwin-x86_64.tar.gz
sqlite-vss-v0.1.0-vector0-linux-x86_64.tar.gz
sqlite-vss-v0.1.0-vss0-darwin-aarch64.tar.gz
sqlite-vss-v0.1.0-vss0-darwin-x86_64.tar.gz
sqlite-vss-v0.1.0-vss0-linux-x86_64.tar.gz

But they are now named:

sqlite-vss-v0.1.1-alpha.7-loadable-linux-x86_64.tar.gz
sqlite-vss-v0.1.1-alpha.7-loadable-macos-aarch64.tar.gz
sqlite-vss-v0.1.1-alpha.7-loadable-macos-x86_64.tar.gz

With the following changes:

No rush for this PR, feel free to take your time - and I hope you enjoyed Moab!

trodrigu commented 1 year ago

@asg017 We are coming along here! I've been working on an issue with the decompression stage. I'm getting an error {:error, :invalid_tar_checksum}. The funny part is the mac's tar -xf works on them. Specifically it doesn't work on the loadable releases but then it work on the previous vss0/vector0 releases. I'm working on making a node reproducible script but its still in the works.

asg017 commented 1 year ago

real quick, yea that's a bug i have with the upload script, will push a fix shortly. The .tar.gz files are accidentally .zip files. My bad!

On Fri, May 26, 2023, 2:12 PM Thomas Rodriguez @.***> wrote:

@asg017 https://github.com/asg017 We are coming along here! I've been working on an issue with the decompression stage. I'm getting an error {:error, :invalid_tar_checksum} https://github.com/asg017/sqlite-vss/pull/33/commits/c3c8478cfecddda1092858a4d20fdc9c3448aeec. The funny part is the mac's tar -xf works on them. Specifically it doesn't work on the loadable releases but then it work on the previous vss0/vector0 releases. I'm working on making a node reproducible script but its still in the works.

— Reply to this email directly, view it on GitHub https://github.com/asg017/sqlite-vss/pull/33#issuecomment-1564958564, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADTZXV4P2ECECJ742ABPL3DXIEMB7ANCNFSM6AAAAAAYEK3WEI . You are receiving this because you were mentioned.Message ID: @.***>

asg017 commented 1 year ago

Ok @trodrigu v0.1.1-alpha.8 will have the fix (as soon as the github action step finishes, which may take ~15 mins).

The .tar.gz assets from before were accidentally a .zip file, because of this 1-line bug. I also learned that tar -xvzf will happy unpack a .zip file without complaining, which confused me for quite some time....

Sorry about that!

trodrigu commented 1 year ago

Thanks that worked!

trodrigu commented 1 year ago

@asg017 In c5a25d6 I added the ability to produce a checksum file which can be included in the package. After reading through https://github.com/phoenixframework/tailwind/tree/main (an installer for tailwind in Phoenix projects) I noticed that it doesn't deal with any checksum files. I think its because the checksum isn't the responsibility of the "installer" which is what I'm thinking this Elixir package is going to be. In order to keep things simpler I removed it in 4c5f73d. I was curious on what your opinion is?

asg017 commented 1 year ago

Thanks @trodrigu for all your hard work!

re checksums: I can generate the checksums.exs during the release process, should it go under bindings/elixir/test/fixtures/checksum-sqlite-vss.exs ? In the Github Actions workflow I already have some code that generates sha256 checksums for the release asssets (which is used to make a checksums.txt file on releases), so I can adapt it a bit to automatically generate the checksum-sqlite-vss-exs file before publish. Theres already a few .tmpl files scatter across the repo that I can re-use.

Let me know when you think this PR is ready: I'll copy+paste this directory into sqlite-hello and test the publishing there

trodrigu commented 1 year ago

should it go under bindings/elixir/test/fixtures/checksum-sqlite-vss.exs ?

Typically, they are generated for the current working directory so in this case it would be bindings/elixir. They also don't get checked into Git and are generated only when the package is being published.

trodrigu commented 1 year ago

Let me know when you think this PR is ready

PR is ready for review.

Things to note about publishing on Hex include having 1 hour to revise any published packages. I added a reference to the sqlite-vss-checksum.exs in the package of the mix.exs so that it will be included in the publish. Before publishing, mix docs should be ran to generate the docs.

asg017 commented 1 year ago

Ok @trodrigu I ported this over to sqlite-hello and published https://hex.pm/packages/sqlite_hello from https://github.com/asg017/sqlite-hello/tree/main/bindings/elixir , can you let me know if it seems to do what you expect? If so I'll try out sqlite-vss

I do have a template for https://github.com/asg017/sqlite-hello/blob/main/bindings/elixir/sqlite-hello-checksum.exs.tmpl which generates sqlite-hello-checksum.exs , but not too sure if the actual package is using it anywhere

trodrigu commented 1 year ago

I'm out atm but I'll check tonight or early tomorrow morning.

trodrigu commented 1 year ago

Package looks good on first inspection!

trodrigu commented 1 year ago

can you let me know if it seems to do what you expect

I was able to install sqlite_hello in my mix.exs file to an app I have deployed on fly.io. I feel comfortable moving this to _vss.

trodrigu commented 1 year ago

I do have a template for https://github.com/asg017/sqlite-hello/blob/main/bindings/elixir/sqlite-hello-checksum.exs.tmpl which generates sqlite-hello-checksum.exs , but not too sure if the actual package is using it anywhere

The idea behind publishing the checksum is that it can help prevent supply chain attacks. You can see how its used on package publish in Rustler Precompiled for new artifacts.

asg017 commented 1 year ago

Thanks again @trodrigu for all the hard work! Going to merge this in, make a few changes, and see if publishing works here.

Though as far as I can tell, the sqlite-hello-checksum.exs gets published inside the Hex package, but we don't do any integrity checking when we download the .tar.gz files, right?

trodrigu commented 1 year ago

but we don't do any integrity checking when we download the .tar.gz files, right?

Yes, this is correct. We can implement a decompress and verify type function which verifies a cached download or incoming data from the web.