BurntSushi / aho-corasick

A fast implementation of Aho-Corasick in Rust.
The Unlicense
1.03k stars 93 forks source link

Discussion: Location for Python wrapper #71

Closed itamarst closed 3 years ago

itamarst commented 3 years ago

Hi,

I think we briefly talked on Hacker News about preliminary experiment I did wrapping your library with Python. After further testing, it appears meaningfully faster than existing alternatives (pyahocorasick, cyac) so I am going to be turning this into a real Python package.

There are two ways this could work:

In either case I would write the code, so not asking for any of your effort in the first pass.

The benefits of the latter are that things like new releases of Rust package could (probably?) automatically do Python releases with latest Rust version. The downsides to you are that the Python package is now on your critical path to releases etc., so plausibly it will eventually cause extra work for you in the future.

What do you prefer?

BurntSushi commented 3 years ago

Ah thanks for reaching out!

While in an ideal world, it would be conceptually nice to have them as part of the same project, your anticipation is on the money here: I do not have the bandwidth to take on language binding maintenance. I appreciate your offer to do maintenance, but I generally don't merge stuff into repos unless I'm pretty confident that I can shoulder its maintenance should it be required. And then there's the specific problem of it being in my release critical path.

So yeah, I would definitely prefer it lives outside this repo. For what it's worth, that's what I've done before with Go bindings to the regex crate. I could have put them in the regex crate repo itself, but it wasn't clear to me whether they would be maintained long term because I didn't know whether folks would use it or not.

After further testing, it appears meaningfully faster than existing alternatives (pyahocorasick, cyac) so I am going to be turning this into a real Python package.

Ooooo that's cool! I'd love to see the benchmarks if you have them. :-)

itamarst commented 3 years ago

Totally understand, that's what I would have said, just wanted to check just in case :)

I'll try to include some benchmarks in the new repo README.

BurntSushi commented 3 years ago

Maybe one thing that's worth saying is that I might be willing to maintain a C API for aho-corasick. I do it for regex for example. I think the extent to whether this is useful to you probably depends on how you do the ffi from Python.

itamarst commented 3 years ago

Further testing suggests the performance improvement isn't as dramatic as originally thought, I was comparing apples to oranges (overlapping algorithm vs. non-overlapping).

BurntSushi commented 3 years ago

@itamarst Interesting. If there's a way for me to run your benchmarks, I'd be happy to take a look and see if I can explain what's happening.

itamarst commented 3 years ago

Hi, it's up now, with some benchmarks: https://github.com/G-Research/ahocorasick_rs

It might actually undercount how much faster your code is, and Python serialization/deseralization overhead is a confounding factor.

For your purposes, the interesting thing is that pyahocorasick's overlapping algorithm seems faster.

BurntSushi commented 3 years ago

@itamarst That is quite interesting! I have a few questions if you don't mind. If you don't know the answers immediately, then no worries, I'll try to dig into them later. For the others, I suppose they are suggestions for your benchmarks.

The overlapping case, and its difference with the others, is indeed interesting. My first step there would be to investigate latency given the short haystack. One way latency might impact things here is that a overlapping search produces more matches, and each match costs more.

I don't mean to say that latency is unimportant. Quite the contrary, it's rather important. But it might be a clue for where to look. It could be a difference in the bindings, but it could also be a problem in the aho-corasick crate itself. e.g., Do your bindings do any additional work that pyahocorasick isn't doing? Maybe there's a transcoding step or some allocs or something. Not sure, just spitballing.

Apologies for the stream-of-consciousness style response!

itamarst commented 3 years ago

Thanks for looking!

The initial benchmarks run internally by my client on a prototype actually showed better performance for your code, FWIW, than this benchmark. This is me vaguely trying to reproduce my understanding of what they care about, and I don't think they actually care about overlapping, so I don't think I'm going to spend time on it, just thought I'd mention it. And yeah could be artifact of bad benchmarks.

Customer will hopefully test this code in a few days, and I can report what they say.

pyahocorasick returns an iterator, my benchmarks convert it to list. I considered the approach of returning an iterator to Python, but... doing anything on the Python side is sooooo sloooow. So it seemed better to just return a list, because then construction can be done Rust-side.

There is definitely a Rust<->Python serialization cost. Internally Python stores strings in arrays that aren't UTF-8 (they're single byte per character for ASCII, but wider arrays for more-than-ASCII? It's actually optimized enough that sometimes the actual string operation logic is faster than Rust's.) And of course constructing the Python output objects, one of reasons I cache patterns.

Anyway I'll see what customer says but I'm pretty happy with current code as first pass since you didn't spot anything obviously wrong (and for educational purposes I am actually happy having a library that is only faster on some benchmarks—users really do need to benchmark their use case).

BurntSushi commented 3 years ago

@itamarst Thanks for the response! I'd definitely be happy to hear any updates or whatever you can share.

itamarst commented 3 years ago

With a different benchmark, with ~4000 patterns (popular first names) and 700-character haystacks that have no matches 90% of the time, and one match 10% of time, ahocorasick_rs is 5-7× faster than pyahocorasick, for both overlapping and non-overlapping.

BurntSushi commented 3 years ago

@itamarst Ah nice, thanks for sharing! I believe that is consistent with my latency hypothesis.

itamarst commented 3 years ago

With PyO3 0.14 (which removed some overhead from calling into Rust from Python), ahocorasick_rs is now faster than pyahocorasick for both short and long benchmarks.

BurntSushi commented 3 years ago

w00t! Thanks for sharing. :-)