PCRE2Project / pcre2

PCRE2 development is now based here.
Other
842 stars 169 forks source link

Alternative branch should match shared prefix only once #411

Open mvorisek opened 1 month ago

mvorisek commented 1 month ago

This is a feature request for engine optimization.

Repro: https://3v4l.org/JKn6D/rfc#vgit.master_jit

Such dummy regexes like (a|b|...|z) are very common for language grammars or routing.

The repro shows massive speedup (from 3.4 ms to 0.82 ms). According regex101.com the speedup is 865 vs. 93 steps only.

The engine should detect shared prefix/subpattern in alternative branch like (ab?c|ab?d) and optimize it into (ab?(?:c|d)).

Why we need this optimization? Consider these patterns:

such usecases need internal optimization support as they cannot be optimized outside the engine.

zherczeg commented 1 month ago

Optimizations are not that easy in general. Pattern to pattern optimizations are usually done by other projects such as https://github.com/zherczeg/repan . The general workflow is optimizing patterns at compile time, and the application uses the optimized variants. This saves time and energy, since the costly steps are done once.

mvorisek commented 1 month ago

Pattern to pattern optimizations are usually done by other projects such as https://github.com/zherczeg/repan .

The repan project implements this feature as "merge alternatives" - https://github.com/zherczeg/repan/blob/25d6e3aabf/tests/opt/test_merge_alternatives_expected.txt.

Such optimization is very powerful with a little extra CPU cycles needed when compiling the pattern and with zero to minimal extra CPU cycles needed when evaluating the compiled pattern. Such optimization should be done by PCRE library itself as many typical usages do not involve compile phase (interpreted languages, user search inputs, ...) and requiring this to be done by intermediate library would require parsing the regex twice and requiring all vendors to maintain extra dependency.

I have updated the description to stress there are cases which can and should be optimized internally but cannot be done by "pattern to pattern optimizations" anyway.

PhilipHazel commented 1 month ago

Now that it's recorded as an issue, somebody might pick this up in the future. It won't be me, because after I manage to get 10.44 released (soon, I hope) I want to start looking for someone to take over the PCRE2 project because I am getting too old. :-(