BurntSushi / aho-corasick

A fast implementation of Aho-Corasick in Rust.
The Unlicense
1.03k stars 93 forks source link

Benchmarks results aren't clear #113

Closed Mart-Bogdan closed 1 year ago

Mart-Bogdan commented 1 year ago

It would be nice to mention benchmark results in reamed.

I've ran all benches, but looking on them didn't give me any understanding how fast this library is.

I look at results, like:

same/packed/teddy/threebytes/nomatch                                                    
same/packed/rabinkarp/threebytes/nomatch     

And it blows my mind🤯.

I've looked at becnhmarks, looks like it all about measuring this library, perhaps to compare with previous build.

All this benches are internal?

I am interested to know how it compares to rust's String::contains, String::replace, if we are using aho-corasick with single needle (I think would be slower), two needles, N needles, etc.

Perhaps it's two different issues:

Basically when it is beneficial to use this lib, at what number of sub-strings?

BurntSushi commented 1 year ago

Yes, the benchmarks are internal. Publishing real benchmarks The Right Way is a ton of work. I'm about to do it for regex engines and it has taken me literal months of work. I have no plans of doing that for this crate.

Basically when it is beneficial to use this lib, at what number of sub-strings?

Someone else can't answer this for you. You should define a benchmark that models your work load. Then you should run them and make a decision based on that.

I am interested to know how it compares to rust's String::contains, String::replace, if we are using aho-corasick with single needle (I think would be slower), two needles, N needles, etc.

In a default configuration, this crate will use memmem from the memchr crate if you only hand it a single pattern. That routine will blow the pants off of std's substring search. So that means that, even for a single pattern, you can use this crate just fine and it should be faster than what you can do with std.

This crate is used by the regex crate for literal optimizations. It tries to do the best it can with what you give it.

If you give it a "small" number of patterns and you're running on x86_64, then it will automatically try to use a vectorized multi-substring matcher that doesn't use the Aho-Corasick algorithm at all. It can be an order of magnitude faster than Aho-Corasick.

But I have to stress: the only way to answer your question is to define a benchmark that is meaningful to your use case.

Mart-Bogdan commented 1 year ago

In a default configuration, this crate will use memmem from the memchr crate if you only hand it a single pattern. That routine will blow the pants off of std's substring search. So that means that, even for a single pattern, you can use this crate just fine and it should be faster than what you can do with std.

Oh, wow. I thought STD version is highly SIMD optimized.

Am I understand correctly -- your library, in case if single pattern is povided is suing different implementation, than if we have many items, so instead of automation generation it is using memmem?

If you give it a "small" number of patterns and you're running on x86_64, then it will automatically try to use a vectorized multi-substring matcher that doesn't use the Aho-Corasick algorithm at all.

Oh I see. That's really cool. Thank you.

I'll try doing some benchmarks myself.

BurntSushi commented 1 year ago

Oh, wow. I thought STD version is highly SIMD optimized.

Yeah, it's not. We would probably like it to be, but substring search is in core and at least right now, we don't have a great way of using ISA extensions in core. I'm not sure anyone has any great ideas on how to make it work either. It's pretty unfortunate.

Am I understand correctly -- your library, in case if single pattern is povided is suing different implementation, than if we have many items, so instead of automation generation it is using memmem?

For a single pattern, yes. You should be able to confirm it by looking at a profile. If it isn't happening then it's probably a bug and I'd happily look into it if you can provide a reproduction.

And note that as I said above, even if you provide multiple patterns, an Aho-Corasick automaton still might not be used at all. It just depends. A lot of it is heuristics at this point.

(Note that an actual automaton will always be generated IIRC. It's just not always used for searches.)

I'll try doing some benchmarks myself.

Aye. Good. Let me know how it goes!

Mart-Bogdan commented 1 year ago

By the way in STD contains has SIMD optimisation, but only on level of sse2

https://github.com/rust-lang/rust/blob/a0daf22b95ff1cd3f7ac55ea9370987535f3134d/library/core/src/str/pattern.rs#L963-L968

I thought that ones to search index of occurrence has as well, but turns out that no, it doesn't. https://github.com/rust-lang/rust/blob/84dd17b56a931a631a23dfd5ef2018fd3ef49108/library/core/src/str/pattern.rs#L1062

BurntSushi commented 1 year ago

Ah yeah, forgive the senility. That ends up working because SSE2 isn't an ISA extension, it's part of x86_64.

Mart-Bogdan commented 1 year ago

Hmmmm, that's right. On each 64bit intel systems SSE2 would be present.

Strange that it checks both conditions: x86_64 AND SSE2 feature.

BurntSushi commented 1 year ago

Hmmmm, that's right. On each 64bit intel systems SSE2 would be present.

Strange that it checks both conditions: x86_64 AND SSE2 feature.

It might be because, technically, the OS can choose not to support SSE2 by not supporting calling conventions that use xmm registers. But that's just a guess. I don't know for sure.

Mart-Bogdan commented 1 year ago

Benchmarks are hard :D

I've made some draft benchmarks, but results are confusing.

But preliminary results are:

str::contains is faster on small haystack on x86

str::matches is slower, good to use aho-corasick even for single needle str::replace is also slower even for single needle.

I think I'll try creating more organized results, more likely closer to end of a week.

P.S. criterion is weird, as it shows results with diferent scale, mixing ns and micro-s in same output. That caused me to belive that creating new instance of AchoCrasick inside loop is 2 times faster, but it was actually 500 times slower image

BurntSushi commented 1 year ago

Why not share your code and inputs so that others can run it?

And yes, there is often a latency versus throughput trade off. See: https://docs.rs/memchr/latest/memchr/#why-use-this-crate

BurntSushi commented 1 year ago

An expectation that one implementation is always faster in every case is unrealistic. This is why I said you should vernal your specific use case. Then you can make a judgement call about whether the differences in, say, latency merit a more complicated implementation on your end. Or whether just using aho-corasick is good enough.

Mart-Bogdan commented 1 year ago

Why not share your code and inputs so that others can run it?

it's really dirty now :D