String search kernel optimisations

samuelcolvin commented 4 months ago

The main context for this is well described by https://github.com/BurntSushi/memchr/pull/156.

I think (in rough order of impact) we should:

[ ] switch from str.contains to memchr
[ ] switch from str.starts_with to to hopefully memchr, otherwise quick_strings::starts_with - there's no "what if the haystack is very long" concern since we're looking at the start of the string, so the difference between memchr and quick_strings won't be as big, or even might be negative
[ ] switch from using starts_with_ignore_ascii_case to quick_strings::istarts_with
[ ] same for *ends_with
[ ] switch from Regex to use quick_strings::icontains (copying the code) for ILIKE - maybe we have to check it's actually faster for large haystacks? - this might have the biggest impact in some scenarois, but me should be careful
[ ] to use those improvements, switch from some direct use of str.contains etc in like.rs to use Predicate

(I'm not suggesting that we make quick_strings a dependency, it was just a scratch experiment, if we use any of that code we should copy it.

samuelcolvin commented 4 months ago

I'm keen to try and work on this.

alamb commented 4 months ago

Thanks @samuelcolvin

I think in general the basic requirement for performance optimizations in this crate is benchmarks that show performance improvements to justify the additional code complexity / maintenance burden.

I think there are already several cargo bench style benchmarks for string operations -- maybe a good first step would be to review them and add any additional cases you think are not covered that would benefit from the optimizations described above

alamb commented 4 months ago

I think @Dandandan and @jhorstmann are especailly execited by low level optimizations like this 😁

samuelcolvin commented 3 months ago

While working on this, I found #6145, we should merge that, then rebase and review the other PRs here.

apache / arrow-rs

String search kernel optimisations #6107