ashvardanian / StringZilla

Up to 10x faster strings for C, C++, Python, Rust, and Swift, leveraging NEON, AVX2, AVX-512, and SWAR to accelerate search, sort, edit distances, alignment scores, etc 🦖
https://ashvardanian.com/posts/stringzilla/
Apache License 2.0
2.05k stars 66 forks source link

Rust Bindings #73

Closed michaelgrigoryan25 closed 6 months ago

michaelgrigoryan25 commented 7 months ago

Continuation of #66.

michaelgrigoryan25 commented 7 months ago

@ashvardanian regarding the "fingerprints" in the table that you've shared in the PR, is it the same as sz_hash?

ashvardanian commented 7 months ago

Not the same, but related. Fingerprints are rolling hashes, which are used to populate a bitset.

michaelgrigoryan25 commented 7 months ago

In that case which is the function for generating fingerprints using StringZilla?

ashvardanian commented 7 months ago

@michaelgrigoryan25, it's called sz_fingerprint_rolling 🤗

I am not sure about what's the best Rust interface for it should look like, so let's keep it for the end.

michaelgrigoryan25 commented 7 months ago

These are the most commonly used string types in Rust:

michaelgrigoryan25 commented 7 months ago

These are the most commonly used string types in Rust:

I can implement a macro which implements a common trait for all these types, so that methods like sz_find can be accessed directly, by only importing the trait via use.

ashvardanian commented 7 months ago

Sure. How about the AsRef<[u8]> I currently use?

michaelgrigoryan25 commented 7 months ago

That would work.

michaelgrigoryan25 commented 7 months ago

@ashvardanian https://github.com/michaelgrigoryan25/StringZilla/commit/4f4ace3e165886636f946252d1100c689fdce80a

ashvardanian commented 7 months ago

@michaelgrigoryan25 this looks good! Want to open a PR or want to add a few more things before that?

michaelgrigoryan25 commented 7 months ago

Sure, let's do it right now.

ashvardanian commented 7 months ago

Thanks a lot, great patches, @michaelgrigoryan25! In C++ I've implemented lazy-evaluated convenience functions, like find_all, rfind_all, split_all, rsplit_all, and so on. Took around 400 lines of code. I think it might be a great idea to implement them in Rust as well. What do you think? Would you be interested in adding those and the Levenshtein / Needleman-Wunsch alignment scores??

michaelgrigoryan25 commented 7 months ago

@ashvardanian definitely, let's do it!

ashvardanian commented 7 months ago

As mentioned in #79, I am not sure about the right course of action here. The other operations, like #82 or random string generation might be more relevant. We should also benchmark against memchr and other native Rust string projects.

ashvardanian commented 6 months ago

Benchmarks are ready.