crystal-lang / crystal

The Crystal Programming Language
https://crystal-lang.org
Apache License 2.0
19.42k stars 1.62k forks source link

PCRE2 "fast path" API #13150

Open Blacksmoke16 opened 1 year ago

Blacksmoke16 commented 1 year ago

Feature Request

Follow up to https://github.com/crystal-lang/crystal/issues/3852#issuecomment-943968239, currently regexes are compiled with JIT mode enabled, which provides a good boost in performance. However PCRE(2) also has another fast path API that avoids some sanity checks and can offer an additional 10%+ boost. E.g.

Crystal 1.8.0-dev [91d418738] (2023-03-03)

LLVM: 14.0.6
Default target: x86_64-pc-linux-gnu

$ ./bin/crystal run --release bench.cr
          starts_with?  15.97M ( 62.63ns) (± 3.00%)  0.0B/op   2.53× slower
              matches?  27.47M ( 36.40ns) (±11.41%)  0.0B/op   1.47× slower
fast path starts_with?  37.19M ( 26.89ns) (± 5.33%)  0.0B/op   1.09× slower
    fast path matches?  40.37M ( 24.77ns) (± 3.83%)  0.0B/op        fastest

The catch is not all regexes can be used in this context, or may even benefit from it, so it doesn't make sense to globally enable it like we did JIT mode.

It would be nice if there was some way to force a particular regex pattern to use this "fast path" API for cases where you know that pattern will be used a lot, or within a hot path. The first thought that come to mind is just some extra boolean ivar used to determine which method to call. Alternatively it could be added into the Regex::Options enum to allow regex literals to also be used. However, given this is somewhat related to #11320 in that it's PCRE(2) specific it may not ultimately be worth implementing, tho at least worth discussing.

WDYT?

Blacksmoke16 commented 1 year ago

I looked into what PHP does and they apparently use this by default that is if pcre.jit option is enabled, which it is by default, and the pattern was compiled successfully with a non-zero JIT size:

https://github.com/php/php-src/blob/43c71dfe52c4dda9ef971321995363b7e966b401/ext/pcre/php_pcre.c#L807-L828

I was a bit confused on how they got the PCRE2_ANCHORED option to work given the docs say its not supported at match time. I then noticed they're ONLY passing the NO_UTF_CHECK to jit_match and realized I think it works for them because they don't have to deal with the notion of match time options since you have to pass the whole pattern which would have all the options defined at compile time.

So in short, it may be enough to just check the passed options in match_data and if they do not contain any unsupported options, use the fast path API, otherwise use the normal flow.

Another more extreme option would be to just deprecate the ANCHORED option in favor of an in-pattern approach as mentioned in the API docs:

This effect can also be achieved by appropriate constructs in the pattern itself, which is the only way to do it in Perl.

Tho IMO, prob not worth the effort.

straight-shoota commented 1 year ago

That could be a viable path to explore. However I'm not sure how much such an external match-time check differs from the checks in pcre2_match itself. But maybe there is something to gain. It certainly wouldn't hurt to try. Under the premise that such a change can have subtle intricacies, I'd suggest to defer that until after the initial transition to PCRE2 is carried out (i.e. provisionally targeting the 1.9 or 1.10 release).

Blacksmoke16 commented 1 year ago

However I'm not sure how much such an external match-time check differs from the checks in pcre2_match itself.

In a simple benchmark I put together, quite a lot. Like up to 30% in some cases. The main reasoning is we're already passing the subject size, Crystal strings are already UTF-8, and Crystal strings are \0 terminated when passed to C libs. The latter two are things it checks for, probably among others. In the end its a drop in replacement to pcre2_match, bar needing to support the handful of unsupported match time options.

But yea, waiting a bit sounds good to me :+1:.