ashvardanian / StringZilla

Up to 10x faster strings for C, C++, Python, Rust, and Swift, leveraging NEON, AVX2, AVX-512, and SWAR to accelerate search, sort, edit distances, alignment scores, etc 🦖
https://ashvardanian.com/posts/stringzilla/
Apache License 2.0
2.05k stars 66 forks source link

Bindings for Rust #66

Closed michaelgrigoryan25 closed 7 months ago

michaelgrigoryan25 commented 8 months ago

Hey @ashvardanian, Hope you are having a great new year so far!

This pull-request includes a partial bindings implementation for StringZilla. Currently there is a single function binding for sz_count_char. Separate tests from the Rust side are currently failing, but I believe that it is due to some silly mistake in the code, we will have to look further.

The way bindings are implemented for Rust is the following:

To properly bind the functions, I had to create separate header and source files and basically wrap the original functions inside of a non inline static function context (see https://github.com/rust-lang/rust-bindgen/discussions/2405), so that bindgen can generate the bindings. Here's an example from rust-bindings.c:

sz_size_t si__count_char(sz_string_start_t const haystack,
                         sz_size_t const haystack_length,
                         sz_string_start_t const needle) {
    return sz_count_char(haystack, haystack_length, needle);
}

Let me know what you think. Best, Mikayel.

ashvardanian commented 8 months ago

Hi, thanks for the PR, @michaelgrigoryan25! I am afraid, a separate header is not a scalable solution. Let's think of a way to patch the primary header to avoid that.

michaelgrigoryan25 commented 8 months ago

@ashvardanian,

I just pushed a fix which now generates the proper definitions for inline statics automatically from the header file (https://github.com/rust-lang/rust-bindgen/discussions/2405). At this point we are good to go and we can start implementing a Rust-friendly interface for the library. There are no linker errors, but we still need to figure out why the following test case is failing

#[test]
fn count_char() {
    let haystack = super::String::new("abba");
    assert_eq!(haystack.count_char("b"), 2);
}

Here is the output:

---- tests::count_char stdout ----
thread 'tests::count_char' panicked at 'assertion failed: `(left == right)`
  left: `1`,
 right: `2`', rust/lib.rs:36:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

I think this has something to do with the way that we are currently handling sz_string_view_t creation, though I am not entirely sure.

michaelgrigoryan25 commented 8 months ago

I have also managed to fix the failing test case. It was happening because I was using a pointer to the temporarily allocated CString. The test passes now.

michaelgrigoryan25 commented 8 months ago

For some reason the following test case is passing, is this behavior expected @ashvardanian:

        let haystack = super::String::new("abba");
        assert_eq!(haystack.count("ab"), 2); // 2 == 2 passes, but there is only 1 'ab'

Is this the expected behavior for count?

ashvardanian commented 8 months ago

The version you are taking from the main branch can only count the number of occurrences of one character. The main-dev is vastly different and more functional.

Can we avoid reimplementing the string class in Rust and just override some of it's methods?

michaelgrigoryan25 commented 8 months ago

We can achieve something like that using custom trait the following way in our lib.rs:

trait Stringzilla {
    fn sz_count(&self, ...) -> ...;
}

impl Stringzilla for String {
    fn sz_count(&self, ...) -> ... { ... };
}

and then whenever we want to make these methods available for the String type, we would just do the following in the code:

use stringzilla::Stringzilla;

let x = String::from("x").sz_count(...);
michaelgrigoryan25 commented 8 months ago

Is there any documentation about StringZilla and it's core functions which need to be implemented? I have started working on my fork's main-dev branch to create the Rust bindings from the up-to-date version of the library.

michaelgrigoryan25 commented 8 months ago

I will also try to make sure to include no_std support, so that the crate does not depend on Rust's standard library.

ashvardanian commented 7 months ago

That is a great suggestion, @michaelgrigoryan25! Sorry for a late response, I've been a bit overwhelmed the last weeks.

  1. Regarding the PR, please make sure to use the most recent main-dev variant as the baseline. The upcoming v3 C API is quite different v2.
  2. Regarding the Rust bindings, to make the project as light as possible, I avoid bindgen and only use cc, which pre-compiles the C dynamic library with multiple hardware accelerate backends.
  3. It's not trivial to override existing Rust String and str methods with StringZilla variants due to name collisions. One approach I was considering - is to use a wrapper type. For now I've only implemented a wrapper for sz_find as a starting point. Alternatively we can define a custom class - Str and reimplement all the interfaces for compatibility with the standard, similar to what I did for C++, but this may take over a thousand lines and will need a proper test coverage. What are your thoughts on this?
michaelgrigoryan25 commented 7 months ago

Hey @ashvardanian, no worries, I will make sure to update you on the 1st and 2nd points as we progress through this PR (I might need to create a new one since this one has a few issues with branching and stuff).

Regarding the 3rd point, it is impossible to override existing methods in the standard library and for that matter in any external libraries. The only way to go with this without re-implementing the whole String struct would be to create a trait with all the StringZilla methods and then import them using use statements. Other libraries like rayon also implement custom operations for standard types this way (see https://docs.rs/rayon/latest/rayon/iter/trait.ParallelIterator.html#method.for_each). I think re-implementing the whole String type is pointless unless we find that there are significant performance gains from doing this.

ashvardanian commented 7 months ago

For now, the whole API contains only global functions:

pub fn find<H: AsRef<[u8]>, N: AsRef<[u8]>>(haystack: H, needle: N) -> Option<usize>
pub fn rfind<H: AsRef<[u8]>, N: AsRef<[u8]>>(haystack: H, needle: N) -> Option<usize>
pub fn find_char_from<H: AsRef<[u8]>, N: AsRef<[u8]>>(haystack: H, needles: N) -> Option<usize>
pub fn rfind_char_from<H: AsRef<[u8]>, N: AsRef<[u8]>>(haystack: H, needles: N) -> Option<usize>
pub fn find_char_not_from<H: AsRef<[u8]>, N: AsRef<[u8]>>(haystack: H, needles: N) -> Option<usize>
pub fn rfind_char_not_from<H: AsRef<[u8]>, N: AsRef<[u8]>>(haystack: H, needles: N) -> Option<usize>

Not all of the functionality is implemented yet.

Functionality C 99 C++ 11 Python Swift Rust
Substring Search
Character Set Search
Edit Distance
Small String Class
Sequence Operation
Lazy Ranges
Fingerprints

Rust styling and docs can also be improved.

michaelgrigoryan25 commented 7 months ago

Looks good @ashvardanian! I suggest closing this PR and opening an issue instead, and link this PR to it.