Speed regression in regex trie

Perl / perl5

🐪 The Perl programming language

https://dev.perl.org/perl5/

Other

1.96k stars 555 forks source link

Speed regression in regex trie #19953

Open pengvado opened 2 years ago

pengvado commented 2 years ago

I have a case-insensitive regex consisting of several string-literals in alternation. This should compile to a single TRIE-EXACTFU, and in perl-5.26 and earlier it did. But in 5.28 through latest blead, some possible string values fail to merge into the TRIE, and end up with BRANCH nodes instead, which is significantly slower.

The commits that introduced this speed-regression are: 141513f22bae5fcd32067ac152c1fcfe5a1b1a5d by @khwilliamson on 2018-02-04, which splits non-letters out of an EXACTFU node and turns them into EXACT instead. 6f0fba9bd70f6f3481c9d325e460600f33289639 by @khwilliamson on 2018-12-08, which turns a length-1 EXACTFU into an ANYOFM.

Those commits are legitimate optimizations under other circumstances, they only become a problem when the nodes in question could have been TRIE entries.

Testcase:

use Benchmark qw(:all);
my $str = "z" x 1000;
timethis(-5, sub { $str =~ m/foo|bar|a_word|_qux/i; }, "perl $^V");

perl v5.26.3: 5 wallclock secs ( 5.24 usr + 0.00 sys = 5.24 CPU) @ 66496.95/s (n=348444) perl v5.28.3: 5 wallclock secs ( 5.33 usr + 0.00 sys = 5.33 CPU) @ 17828.14/s (n=95024) perl v5.30.3: 5 wallclock secs ( 5.28 usr + 0.00 sys = 5.28 CPU) @ 11205.68/s (n=59166) perl v5.36.0: 5 wallclock secs ( 5.33 usr + 0.00 sys = 5.33 CPU) @ 11535.27/s (n=61483)

And the same testcase with use re Debug=>"DUMP":

perl-5.26.3:
Compiling REx "foo|bar|a_word|_qux"
Final program:
   1: TRIE-EXACTFU (14)
      <foo> 
      <bar> 
      <a_word> 
      <_qux> 
  14: END (0)
stclass AHOCORASICK-EXACTFU minlen 3

perl-5.28.3:
Compiling REx "foo|bar|a_word|_qux"
Final program:
   1: BRANCH (14)
   2:   TRIE-EXACTFU (19)
        <foo> (19)
        <bar> (19)
        <a> (10)
  10:     EXACT <_> (12)
  12:     EXACTFU <word> (19)
  14: BRANCH (FAIL)
  15:   EXACT <_> (17)
  17:   EXACTFU <qux> (19)
  19: END (0)
minlen 3

perl-5.30.3 and 5.36.0:
Compiling REx "foo|bar|a_word|_qux"
Final program:
   1: BRANCH (7)
   2:   TRIE-EXACTFU (19)
        <foo> 
        <bar> 
   7: BRANCH (14)
   8:   ANYOFM[Aa] (10)
  10:   EXACT <_> (12)
  12:   EXACTFU <word> (19)
  14: BRANCH (FAIL)
  15:   EXACT <_> (17)
  17:   EXACTFU <qux> (19)
  19: END (0)
minlen 3

Perl configuration

``` Summary of my perl5 (revision 5 version 36 subversion 0) configuration: Commit id: b3c502b607191da0e743a4fa34501a05442305b3 Platform: osname=linux osvers=4.19.0-gentoo archname=x86_64-linux uname='linux localhost 4.19.0-gentoo #3 smp preempt x86_64 intel(r) core(tm) i7-2700k cpu @ 3.50ghz genuineintel gnulinux ' config_args='-de -Dmksymlinks -Dcc=/home/loren/bin/gcc-11.3' hint=previous useposix=true d_sigaction=define useithreads=undef usemultiplicity=undef use64bitint=define use64bitall=define uselongdouble=undef usemymalloc=n default_inc_excludes_dot=define Compiler: cc='/home/loren/bin/gcc-11.3' ccflags ='-fwrapv -fno-strict-aliasing -pipe -fstack-protector-strong -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -D_FORTIFY_SOURCE=2' optimize='-O2' cppflags='-fwrapv -fno-strict-aliasing -pipe -fstack-protector-strong' ccversion='' gccversion='11.3.0' gccosandvers='' intsize=4 longsize=8 ptrsize=8 doublesize=8 byteorder=12345678 doublekind=3 d_longlong=define longlongsize=8 d_longdbl=define longdblsize=16 longdblkind=3 ivtype='long' ivsize=8 nvtype='double' nvsize=8 Off_t='off_t' lseeksize=8 alignbytes=8 prototype=define Linker and Libraries: ld='gcc-11.3' ldflags =' -fstack-protector-strong' libpth=/home/loren/local/lib /usr/lib64 /usr/local/lib64 /home/loren/local/lib64 /lib64 /home/loren/local/lib32 libs=-lpthread -lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lc -lgdbm_compat perllibs=-lpthread -lnsl -ldl -lm -lcrypt -lutil -lc libc=libc-2.31.so so=so useshrplib=false libperl=libperl.a gnulibc_version='2.31' Dynamic Linking: dlsrc=dl_dlopen.xs dlext=so d_dlsymun=undef ccdlflags='-Wl,-E' cccdlflags='-fPIC' lddlflags='-shared -O2 -fstack-protector-strong' Characteristics of this binary (from libperl): Compile-time options: HAS_TIMES PERLIO_LAYERS PERL_COPY_ON_WRITE PERL_DONT_CREATE_GVSV PERL_MALLOC_WRAP PERL_OP_PARENT PERL_PRESERVE_IVUV USE_64_BIT_ALL USE_64_BIT_INT USE_LARGE_FILES USE_LOCALE USE_LOCALE_COLLATE USE_LOCALE_CTYPE USE_LOCALE_NUMERIC USE_LOCALE_TIME USE_PERLIO USE_PERL_ATOF Built under linux Compiled at Jul 11 2022 23:45:02 %ENV: PERL5LIB="/home/loren/local/lib/perl5/" @INC: /home/loren/local/lib/perl5/ /home/loren/src/cc/local/lib/perl5/site_perl/5.36.0/x86_64-linux /home/loren/src/cc/local/lib/perl5/site_perl/5.36.0 /home/loren/src/cc/local/lib/perl5/5.36.0/x86_64-linux /home/loren/src/cc/local/lib/perl5/5.36.0 ```

khwilliamson commented 2 years ago

I have been thinking we need another stage to review some of the optimization decisions in larger context. But that's a bunch of work.

One of the reasons I did the splitting out of non-letters is that we construct some sequences that have to match exactly at runtime. Suppose we have a long target to match against, and we know from the pattern it has to contain a colon. We can use a fast grep to find just the colons in the string. But if the colon were part of an EXACTFU, the regex optimizer did not know enough to notice that. There's a lot of work that could be done on the optimizer, but this was a quick way to get that functionality. I don't know much about how tries are scanned for sequences at runtime

demerphq commented 2 years ago

On Thu, 14 Jul 2022 at 19:49, Karl Williamson @.***> wrote:

I have been thinking we need another stage to review some of the optimization decisions in larger context. But that's a bunch of work.

One of the reasons I did the splitting out of non-letters is that we construct some sequences that have to match exactly at runtime. Suppose we have a long target to match against, and we know from the pattern it has to contain a colon. We can use a fast grep to find just the colons in the string. But if the colon were part of an EXACTFU, the regex optimizer did not know enough to notice that. There's a lot of work that could be done on the optimizer, but this was a quick way to get that functionality. I don't know much about how tries are scanned for sequences at runtime

The trie logic currently in simplified terms looks for sequences of two or more alternation branches where the "first" regop is EXACTish and the same type as each other. I quote "first" and say simplified terms because its a bit more tricky than that, as you can have empty nodes at the start of a branch, consider /(?:)X|Y/ , and because there is the difference betweening trieing an entire alternation, or parts of it, an also because there are "jump tries" and non-jump tries, eg, where the "first" regnode is also the "only" regnode in the alternation.

If I understand you correctly the change you are talking about has changed

/(:bar|baz)/i

so that where it used to produce:

BRANCH EXACTFU BRANCH EXACTFU

it is now producing

BRANCH EXACT EXACTFU BRANCH EXACTFU

during the parse phase. Thus since the first regnode after the BRANCH is not the same it won't make a trie out of them.

If we had an EXACTish type that was "known to be foldcase invariant" then the optimization could use that instead of EXACT and the trie optimization could be taught to do the right thing for that case.

But I notice that there is another optimization firing here that IMO is extremely problematic for the trie code and performance, which is converting single character folded characters into a character class (ANYOF). This is visible here:

perl-5.30.3 and 5.36.0: Compiling REx "foo|bar|a_word|_qux" Final program: 1: BRANCH (7) 2: TRIE-EXACTFU (19)

7: BRANCH (14) 8: ANYOFM[Aa] (10) 10: EXACT <_> (12) 12: EXACTFU (19) 14: BRANCH (FAIL) 15: EXACT <_> (17) 17: EXACTFU (19) 19: END (0) minlen 3 ANYOF nodes currently are not trieable *at all* and defeat all the optimizations built into the trie subsystem, which includes common prefix extraction and built in charclass checks. This doesn't feel right to me at all. Anything trieable essentially *can't* be part of a fixed substring check unless the trie engine identifies it as a common prefix (or hypothetically a common suffix, but we don't handle that currently). Changing something that is trieable to something that is not potentially has quadratic consequences under backtracking. I understand the virtuous intent of this optimization at a high level, but it only makes sense when the pattern is in the "critical path" of matching, and anything inside of an alternation (common prefix/suffixes aside) is not in the critical path by definition. In fact the place where we have the most information to determine if something inside of an alternation is in the critical path is when we are building the trie. I find it a bit ironic actually, I once looked into the reverse of this, converting ANYOF into the EXACTFU they match so the trie optimization could kick in more often, but couldn't find a way to do it fast enough for it to be particularly helpful. Also, if we are doing this *early* during the parse phase, then we simply can't know if we are in alternation and disable it. Note that ANYOF really could and maybe should be replaced by TRIE nodes. Character classes are just efficient ways to write an alternation and are formally equivalent to alternation, and thus the ANYOF logic is/should-be a special case of the TRIE logic. Currently it isnt so, as ANYOF predates TRIE, but that is a detail. I have to admit I am concerned about this, it will negatively affect the performance of many patterns. Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"

demerphq commented 2 years ago

On Fri, 15 Jul 2022 at 13:34, demerphq @.***> wrote:

On Thu, 14 Jul 2022 at 19:49, Karl Williamson @.***> wrote:

I have been thinking we need another stage to review some of the optimization decisions in larger context. But that's a bunch of work.

One of the reasons I did the splitting out of non-letters is that we construct some sequences that have to match exactly at runtime. Suppose we have a long target to match against, and we know from the pattern it has to contain a colon. We can use a fast grep to find just the colons in the string. But if the colon were part of an EXACTFU, the regex optimizer did not know enough to notice that. There's a lot of work that could be done on the optimizer, but this was a quick way to get that functionality. I don't know much about how tries are scanned for sequences at runtime

The trie logic currently in simplified terms looks for sequences of two or more alternation branches where the "first" regop is EXACTish and the same type as each other. I quote "first" and say simplified terms because its a bit more tricky than that, as you can have empty nodes at the start of a branch, consider /(?:)X|Y/ , and because there is the difference betweening trieing an entire alternation, or parts of it, an also because there are "jump tries" and non-jump tries, eg, where the "first" regnode is also the "only" regnode in the alternation.

If I understand you correctly the change you are talking about has changed

/(:bar|baz)/i

so that where it used to produce:

BRANCH EXACTFU BRANCH EXACTFU

it is now producing

BRANCH EXACT EXACTFU BRANCH EXACTFU

during the parse phase. Thus since the first regnode after the BRANCH is not the same it won't make a trie out of them.

If we had an EXACTish type that was "known to be foldcase invariant" then the optimization could use that instead of EXACT and the trie optimization could be taught to do the right thing for that case.

But I notice that there is another optimization firing here that IMO is extremely problematic for the trie code and performance, which is converting single character folded characters into a character class (ANYOF). This is visible here:

perl-5.30.3 and 5.36.0: Compiling REx "foo|bar|a_word|_qux" Final program: 1: BRANCH (7) 2: TRIE-EXACTFU (19)
7: BRANCH (14) 8: ANYOFM[Aa] (10) 10: EXACT <_> (12) 12: EXACTFU (19) 14: BRANCH (FAIL) 15: EXACT <_> (17) 17: EXACTFU (19) 19: END (0) minlen 3 ANYOF nodes currently are not trieable *at all* and defeat all the optimizations built into the trie subsystem, which includes common prefix extraction and built in charclass checks. This doesn't feel right to me at all. Anything trieable essentially *can't* be part of a fixed substring check unless the trie engine identifies it as a common prefix (or hypothetically a common suffix, but we don't handle that currently). Changing something that is trieable to something that is not potentially has quadratic consequences under backtracking. I understand the virtuous intent of this optimization at a high level, but it only makes sense when the pattern is in the "critical path" of matching, and anything inside of an alternation (common prefix/suffixes aside) is not in the critical path by definition. In fact the place where we have the most information to determine if something inside of an alternation is in the critical path is when we are building the trie. I find it a bit ironic actually, I once looked into the reverse of this, converting ANYOF into the EXACTFU they match so the trie optimization could kick in more often, but couldn't find a way to do it fast enough for it to be particularly helpful. Also, if we are doing this *early* during the parse phase, then we simply can't know if we are in alternation and disable it. Note that ANYOF really could and maybe should be replaced by TRIE nodes. Character classes are just efficient ways to write an alternation and are formally equivalent to alternation, and thus the ANYOF logic is/should-be a special case of the TRIE logic. Currently it isnt so, as ANYOF predates TRIE, but that is a detail. I have to admit I am concerned about this, it will negatively affect the performance of many patterns. We can see the effect of the optimizations becoming a bigger and bigger pessimization over different releases with: perl -Mre=debug -le'BEGIN{print "perl $]";} /a_foo|a_bar/i'

Going back to 5.26.3 we get this:

perl 5.026003 Compiling REx "a_foo|a_bar" Final program: 1: TRIE-EXACTFU (9)

9: END (0) stclass AHOCORASICK-EXACTFU minlen 5 Freeing REx: "a_foo|a_bar" Which means this pattern will be matched by the aho-corasick algorithm (which does not backtrack) without a jump trie. This is nearly as efficient as a DFA would be and executes essentially one operation per character in the string matched and it executes in a relatively low overhead loop. In 5.28 the "split out invariant characters as EXACT" was added, this means we end up with a jump trie because we only build the trie out of the single character EXACTFU "a" nodes. You see two "" lines because it is a jump trie which means it has untrieable nodes which must be executed to complete the match, It is also ahocorasickable, but because it is jump trie, it has to try the EXACT "_" and then "foo" and then "_" and then "bar" for every 'a' it sees. It only has to look at the 'a' once however. Thus its performance is proportional to the length of the string matched + branches * number of times prefix seen. perl 5.028003 Compiling REx "a_foo|a_bar" Final program: 1: TRIE-EXACTFU (15) (4) 4: EXACT <_> (6) 6: EXACTFU (15) (11) 11: EXACT <_> (13) 13: EXACTFU (15) 15: END (0) stclass AHOCORASICK-EXACTFU minlen 5 Freeing REx: "a_foo|a_bar" In 5.34, things are worse in terms of operations, it will basically execute like the jump trie, but it will test for "a" twice per character on top of testing for "_" twice per "a". Thus its performance is number of branches * length of string. perl 5.034001 Compiling REx "a_foo|a_bar" Final program: 1: BRANCH (8) 2: ANYOFM[Aa] (4) 4: EXACT <_> (6) 6: EXACTFU (15) 8: BRANCH (FAIL) 9: ANYOFM[Aa] (11) 11: EXACT <_> (13) 13: EXACTFU (15) 15: END (0) minlen 5 String shorter than min possible regex match (0 < 5) Freeing REx: "a_foo|a_bar" Also notice that because all of these EXACT's are inside of an alternation they are effectively invisible to the fixed string extraction logic. Also anything in an ANYOF is invisible to the fixed string extraction logic as well. This really feels like the wrong direction to go to me. If the pattern contains any alternations at all this kind of mixed EXACTish node type should be disabled. And during parsing we treat every non-alternation as an alternation with one branch, so that might get a bit difficult, as it would require infinite lookahead. If anything these kinds of operations should not happen during the parse phase, but rather in the optimization phase when we know it would be safe to perform because we would know if we are in the patterns critical path. Alternatively we could teach the fixed substring logic to be smarter. This is an ugly problem. I can totally see why this would be useful for a pattern with no alternation at all, but as soon as alternation is used these optimizations will just be pessimizations, especially the ANYOF logic. cheers, Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"

hvds commented 2 years ago

I'd still love to see some way of giving the user control over which optimizations are performed, either lexically or at the regexp level (maybe even at the fractional regexp level). That would make workarounds for such pessimizations possible, and allow us usefully to make new optimizations possible that are disabled by default either because they aren't always useful or because they don't always give correct results.

demerphq commented 2 years ago

On Tue, 19 Jul 2022, 13:33 Hugo van der Sanden, @.***> wrote:

I'd still love to see some way of giving the user control over which optimizations are performed, either lexically or at the regexp level (maybe even at the fractional regexp level). That would make workarounds for such pessimizations possible, and allow us usefully to make new optimizations possible that are disabled by default either because they aren't always useful or because they don't always give correct results.

The tried code can be disabled fwiw.

But really optimizations shouldn't ever be pessimisations, users shouldn't need to know how the internals work.

Yves

khwilliamson commented 2 years ago

On 7/15/22 10:28, Yves Orton wrote:

On Fri, 15 Jul 2022 at 13:34, demerphq @.***> wrote:

On Thu, 14 Jul 2022 at 19:49, Karl Williamson @.***> wrote:

I have been thinking we need another stage to review some of the optimization decisions in larger context. But that's a bunch of work.

One of the reasons I did the splitting out of non-letters is that we construct some sequences that have to match exactly at runtime. Suppose we have a long target to match against, and we know from the pattern it has to contain a colon. We can use a fast grep to find just the colons in the string. But if the colon were part of an EXACTFU, the regex optimizer did not know enough to notice that. There's a lot of work that could be done on the optimizer, but this was a quick way to get that functionality. I don't know much about how tries are scanned for sequences at runtime

The trie logic currently in simplified terms looks for sequences of two or more alternation branches where the "first" regop is EXACTish and the same type as each other. I quote "first" and say simplified terms because its a bit more tricky than that, as you can have empty nodes at the start of a branch, consider /(?:)X|Y/ , and because there is the difference betweening trieing an entire alternation, or parts of it, an also because there are "jump tries" and non-jump tries, eg, where the "first" regnode is also the "only" regnode in the alternation.

If I understand you correctly the change you are talking about has changed

/(:bar|baz)/i

so that where it used to produce:

BRANCH EXACTFU BRANCH EXACTFU

it is now producing

BRANCH EXACT EXACTFU BRANCH EXACTFU

during the parse phase. Thus since the first regnode after the BRANCH is not the same it won't make a trie out of them.

If we had an EXACTish type that was "known to be foldcase invariant" then the optimization could use that instead of EXACT and the trie optimization could be taught to do the right thing for that case.

But I notice that there is another optimization firing here that IMO is extremely problematic for the trie code and performance, which is converting single character folded characters into a character class (ANYOF). This is visible here:

perl-5.30.3 and 5.36.0: Compiling REx "foo|bar|a_word|_qux" Final program: 1: BRANCH (7) 2: TRIE-EXACTFU (19)
7: BRANCH (14) 8: ANYOFM[Aa] (10) 10: EXACT <_> (12) 12: EXACTFU (19) 14: BRANCH (FAIL) 15: EXACT <_> (17) 17: EXACTFU (19) 19: END (0) minlen 3 ANYOF nodes currently are not trieable *at all* and defeat all the optimizations built into the trie subsystem, which includes common prefix extraction and built in charclass checks. This doesn't feel right to me at all. Anything trieable essentially *can't* be part of a fixed substring check unless the trie engine identifies it as a common prefix (or hypothetically a common suffix, but we don't handle that currently). Changing something that is trieable to something that is not potentially has quadratic consequences under backtracking. I understand the virtuous intent of this optimization at a high level, but it only makes sense when the pattern is in the "critical path" of matching, and anything inside of an alternation (common prefix/suffixes aside) is not in the critical path by definition. In fact the place where we have the most information to determine if something inside of an alternation is in the critical path is when we are building the trie. I find it a bit ironic actually, I once looked into the reverse of this, converting ANYOF into the EXACTFU they match so the trie optimization could kick in more often, but couldn't find a way to do it fast enough for it to be particularly helpful. Also, if we are doing this *early* during the parse phase, then we simply can't know if we are in alternation and disable it. Note that ANYOF really could and maybe should be replaced by TRIE nodes. Character classes are just efficient ways to write an alternation and are formally equivalent to alternation, and thus the ANYOF logic is/should-be a special case of the TRIE logic. Currently it isnt so, as ANYOF predates TRIE, but that is a detail. I have to admit I am concerned about this, it will negatively affect the performance of many patterns. We can see the effect of the optimizations becoming a bigger and bigger pessimization over different releases with: perl -Mre=debug -le'BEGIN{print "perl $]";} /a_foo|a_bar/i'

Going back to 5.26.3 we get this:

perl 5.026003 Compiling REx "a_foo|a_bar" Final program: 1: TRIE-EXACTFU (9)
9: END (0) stclass AHOCORASICK-EXACTFU minlen 5 Freeing REx: "a_foo|a_bar" Which means this pattern will be matched by the aho-corasick algorithm (which does not backtrack) without a jump trie. This is nearly as efficient as a DFA would be and executes essentially one operation per character in the string matched and it executes in a relatively low overhead loop. In 5.28 the "split out invariant characters as EXACT" was added, this means we end up with a jump trie because we only build the trie out of the single character EXACTFU "a" nodes. You see two "" lines because it is a jump trie which means it has untrieable nodes which must be executed to complete the match, It is also ahocorasickable, but because it is jump trie, it has to try the EXACT "_" and then "foo" and then "_" and then "bar" for every 'a' it sees. It only has to look at the 'a' once however. Thus its performance is proportional to the length of the string matched + branches * number of times prefix seen. perl 5.028003 Compiling REx "a_foo|a_bar" Final program: 1: TRIE-EXACTFU (15) (4) 4: EXACT <_> (6) 6: EXACTFU (15) (11) 11: EXACT <_> (13) 13: EXACTFU (15) 15: END (0) stclass AHOCORASICK-EXACTFU minlen 5 Freeing REx: "a_foo|a_bar" In 5.34, things are worse in terms of operations, it will basically execute like the jump trie, but it will test for "a" twice per character on top of testing for "_" twice per "a". Thus its performance is number of branches * length of string. perl 5.034001 Compiling REx "a_foo|a_bar" Final program: 1: BRANCH (8) 2: ANYOFM[Aa] (4) 4: EXACT <_> (6) 6: EXACTFU (15) 8: BRANCH (FAIL) 9: ANYOFM[Aa] (11) 11: EXACT <_> (13) 13: EXACTFU (15) 15: END (0) minlen 5 String shorter than min possible regex match (0 < 5) Freeing REx: "a_foo|a_bar" Also notice that because all of these EXACT's are inside of an alternation they are effectively invisible to the fixed string extraction logic. Also anything in an ANYOF is invisible to the fixed string extraction logic as well. This really feels like the wrong direction to go to me. If the pattern contains any alternations at all this kind of mixed EXACTish node type should be disabled. And during parsing we treat every non-alternation as an alternation with one branch, so that might get a bit difficult, as it would require infinite lookahead. If anything these kinds of operations should not happen during the parse phase, but rather in the optimization phase when we know it would be safe to perform because we would know if we are in the patterns critical path. Alternatively we could teach the fixed substring logic to be smarter. This is an ugly problem. I can totally see why this would be useful for a pattern with no alternation at all, but as soon as alternation is used these optimizations will just be pessimizations, especially the ANYOF logic.

Why can't the trie construction code be extended to consider these nodes?

The EXACT nodes for things like a colon could be joined in with adjacent EXACTFU nodes after the determination of the places where a colon must appear.

The ANYOFM optimization is done during parsing only if an EXACTFish node isn't suitable. There is further conversion to ANYOFM in study_chunk(), after trie construction code, but I suppose that due to the recursive nature of study_chunk, it could actually happen before the trie code

If you examine the progress of /i pattern matching, you will see frequent recomputation of folds each time a backtrack occurs. I don't know how to cache the results to prevent that. But it turns out that about half the folds in Unicode are such that they could be matched by using a mask. I envision an EXACTFUM node which has two strings. p and m, such that if target t would match iff (t & m) eq p. This would work for UTF-8 patterns and targets without having to compute the code points at all. For long enough patterns, this could be done per-word, leading to a dramatic speed up, at the cost of doubling the size of the node. An EXACTFUM node would subsume adjacent EXACT and EXACTF-ish ones, with the mask for the EXACT bytes being FF.

(Another optimization of speed wrt time would be to have a node contain two strings if the UTF8ness of the target mattered. One string for byte matching; the other for UTF-8. Then the representations wouldn't have to continually be recomputed at run time backtracking.)

I don't know how this would fit in with tries. I haven't looked at those internals in a long time. But I do notice that there is code to do run-time folding. That made sense to me at one time, but no longer does. I would think that folds would be baked into the trie table at compilation time.

hvds commented 2 years ago

I'd still love to see some way of giving the user control over which optimizations are performed, either lexically or at the regexp level (maybe even at the fractional regexp level). That would make workarounds for such pessimizations possible, and allow us usefully to make new optimizations possible that are disabled by default either because they aren't always useful or because they don't always give correct results. The tried code can be disabled fwiw. But really optimizations shouldn't ever be pessimisations, users shouldn't need to know how the internals work.

I'm not sure if you are saying here "no, we should never give users control over which optimizations are performed", but this response can certainly be read that way.

Users know their own data far better than we ever can. They can discover which strategies work for them by trial and error (ie benchmarking) without needing to know how the internals work. And some users have advanced needs that do justify understanding how the internals work.

Beyond that, there are always going to be bugs in the regexp engine.

Right now, it seems to be semi-common knowledge that you can work around most optimization bugs by inserting an optimization-defeater like '(?=)', and over the years I've often had to recommend such tricks (and resort to them). Knowledge over how to make a non-buggy case even faster is much thinner on the ground - that's reserved to experts like you.

demerphq commented 2 years ago

On Tue, 11 Oct 2022 at 10:05, Karl Williamson @.***> wrote:

On 7/15/22 10:28, Yves Orton wrote:

On Fri, 15 Jul 2022 at 13:34, demerphq @.***> wrote:

On Thu, 14 Jul 2022 at 19:49, Karl Williamson @.***> wrote:

I have been thinking we need another stage to review some of the optimization decisions in larger context. But that's a bunch of work.

One of the reasons I did the splitting out of non-letters is that we construct some sequences that have to match exactly at runtime. Suppose we have a long target to match against, and we know from the pattern it has to contain a colon. We can use a fast grep to find just the colons in the string. But if the colon were part of an EXACTFU, the regex optimizer did not know enough to notice that. There's a lot of work that could be done on the optimizer, but this was a quick way to get that functionality. I don't know much about how tries are scanned for sequences at runtime

The trie logic currently in simplified terms looks for sequences of two or more alternation branches where the "first" regop is EXACTish and the same type as each other. I quote "first" and say simplified terms because its a bit more tricky than that, as you can have empty nodes at the start of a branch, consider /(?:)X|Y/ , and because there is the difference betweening trieing an entire alternation, or parts of it, an also because there are "jump tries" and non-jump tries, eg, where the "first" regnode is also the "only" regnode in the alternation.

If I understand you correctly the change you are talking about has changed

/(:bar|baz)/i

so that where it used to produce:

BRANCH EXACTFU BRANCH EXACTFU

it is now producing

BRANCH EXACT EXACTFU BRANCH EXACTFU

during the parse phase. Thus since the first regnode after the BRANCH is not the same it won't make a trie out of them.

If we had an EXACTish type that was "known to be foldcase invariant" then the optimization could use that instead of EXACT and the trie optimization could be taught to do the right thing for that case.

But I notice that there is another optimization firing here that IMO is extremely problematic for the trie code and performance, which is converting single character folded characters into a character class (ANYOF). This is visible here:

perl-5.30.3 and 5.36.0: Compiling REx "foo|bar|a_word|_qux" Final program: 1: BRANCH (7) 2: TRIE-EXACTFU (19)
7: BRANCH (14) 8: ANYOFM[Aa] (10) 10: EXACT <_> (12) 12: EXACTFU (19) 14: BRANCH (FAIL) 15: EXACT <_> (17) 17: EXACTFU (19) 19: END (0) minlen 3 ANYOF nodes currently are not trieable *at all* and defeat all the optimizations built into the trie subsystem, which includes common prefix extraction and built in charclass checks. This doesn't feel right to me at all. Anything trieable essentially *can't* be part of a fixed substring check unless the trie engine identifies it as a common prefix (or hypothetically a common suffix, but we don't handle that currently). Changing something that is trieable to something that is not potentially has quadratic consequences under backtracking. I understand the virtuous intent of this optimization at a high level, but it only makes sense when the pattern is in the "critical path" of matching, and anything inside of an alternation (common prefix/suffixes aside) is not in the critical path by definition. In fact the place where we have the most information to determine if something inside of an alternation is in the critical path is when we are building the trie. I find it a bit ironic actually, I once looked into the reverse of this, converting ANYOF into the EXACTFU they match so the trie optimization could kick in more often, but couldn't find a way to do it fast enough for it to be particularly helpful. Also, if we are doing this *early* during the parse phase, then we simply can't know if we are in alternation and disable it. Note that ANYOF really could and maybe should be replaced by TRIE nodes. Character classes are just efficient ways to write an alternation and are formally equivalent to alternation, and thus the ANYOF logic is/should-be a special case of the TRIE logic. Currently it isnt so, as ANYOF predates TRIE, but that is a detail. I have to admit I am concerned about this, it will negatively affect the performance of many patterns. We can see the effect of the optimizations becoming a bigger and bigger pessimization over different releases with: perl -Mre=debug -le'BEGIN{print "perl $]";} /a_foo|a_bar/i'

Going back to 5.26.3 we get this:

perl 5.026003 Compiling REx "a_foo|a_bar" Final program: 1: TRIE-EXACTFU (9)
9: END (0) stclass AHOCORASICK-EXACTFU minlen 5 Freeing REx: "a_foo|a_bar" Which means this pattern will be matched by the aho-corasick algorithm (which does not backtrack) without a jump trie. This is nearly as efficient as a DFA would be and executes essentially one operation per character in the string matched and it executes in a relatively low overhead loop. In 5.28 the "split out invariant characters as EXACT" was added, this means we end up with a jump trie because we only build the trie out of the single character EXACTFU "a" nodes. You see two "" lines because it is a jump trie which means it has untrieable nodes which must be executed to complete the match, It is also ahocorasickable, but because it is jump trie, it has to try the EXACT "_" and then "foo" and then "_" and then "bar" for every 'a' it sees. It only has to look at the 'a' once however. Thus its performance is proportional to the length of the string matched + branches * number of times prefix seen. perl 5.028003 Compiling REx "a_foo|a_bar" Final program: 1: TRIE-EXACTFU (15) (4) 4: EXACT <_> (6) 6: EXACTFU (15) (11) 11: EXACT <_> (13) 13: EXACTFU (15) 15: END (0) stclass AHOCORASICK-EXACTFU minlen 5 Freeing REx: "a_foo|a_bar" In 5.34, things are worse in terms of operations, it will basically execute like the jump trie, but it will test for "a" twice per character on top of testing for "_" twice per "a". Thus its performance is number of branches * length of string. perl 5.034001 Compiling REx "a_foo|a_bar" Final program: 1: BRANCH (8) 2: ANYOFM[Aa] (4) 4: EXACT <_> (6) 6: EXACTFU (15) 8: BRANCH (FAIL) 9: ANYOFM[Aa] (11) 11: EXACT <_> (13) 13: EXACTFU (15) 15: END (0) minlen 5 String shorter than min possible regex match (0 < 5) Freeing REx: "a_foo|a_bar" Also notice that because all of these EXACT's are inside of an alternation they are effectively invisible to the fixed string extraction logic. Also anything in an ANYOF is invisible to the fixed string extraction logic as well. This really feels like the wrong direction to go to me. If the pattern contains any alternations at all this kind of mixed EXACTish node type should be disabled. And during parsing we treat every non-alternation as an alternation with one branch, so that might get a bit difficult, as it would require infinite lookahead. If anything these kinds of operations should not happen during the parse phase, but rather in the optimization phase when we know it would be safe to perform because we would know if we are in the patterns critical path. Alternatively we could teach the fixed substring logic to be smarter. This is an ugly problem. I can totally see why this would be useful for a pattern with no alternation at all, but as soon as alternation is used these optimizations will just be pessimizations, especially the ANYOF logic.

Why can't the trie construction code be extended to consider these nodes?

I think it depends on what you mean by this. In a practical sense it is difficult to "decode" a character-class to the set of characters it matches, and if we did we might end up constructing a degenerate tree, especially with unicode in play. Consider the character class [\x{0}-\x{10FFFF}] for instance. When that gets mapped into the trie it ends up becoming something pretty different. In theory it is doable, in practice it is awkward. There is another sense where we could take the trie logic and extend it to be a DFA. That is even more practically demanding.

The EXACT nodes for things like a colon could be joined in with adjacent EXACTFU nodes after the determination of the places where a colon must appear.

Yes, some of these things could be done. But the trie logic predates these new EXACT type nodes, and I guess when they were added the trie code wasnt accounted for (one way or another).

The ANYOFM optimization is done during parsing only if an EXACTFish node isn't suitable. There is further conversion to ANYOFM in study_chunk(), after trie construction code, but I suppose that due to the recursive nature of study_chunk, it could actually happen before the trie code

The logic that converts a trie to a character-class is safe, as a character class is indistinguishable from a trie with a depth of 1.

At one level you could argue that ANYOF and TRIE's should be unified into a single data structure. But there is an underlying tension on how you want to represent them, for instance a TRIE is larger and slower than an ANYOF for many cases, but it handles cases that ANYOF does not.

If you examine the progress of /i pattern matching, you will see frequent recomputation of folds each time a backtrack occurs. I don't know how to cache the results to prevent that. But it turns out that about half the folds in Unicode are such that they could be matched by using a mask. I envision an EXACTFUM node which has two strings. p and m, such that if target t would match iff (t & m) eq p. This would work for UTF-8 patterns and targets without having to compute the code points at all. For long enough patterns, this could be done per-word, leading to a dramatic speed up, at the cost of doubling the size of the node. An EXACTFUM node would subsume adjacent EXACT and EXACTF-ish ones, with the mask for the EXACT bytes being FF.

I am not sure I follow this to be honest. I have been sick the last couple of weeks and frankly my brain is not dev functional yet.

All i know is that unless the trie engine is taught about these special exact types many cases which optimize a simple pattern like /foo/ will pessimize when it becomes /foo|bar/. Idealy we would use one datastructure for all of these things, but that way lies full DFA construction, and we know a bunch of things we do to optimize matching arent compatible with that model.

(Another optimization of speed wrt time would be to have a node contain two strings if the UTF8ness of the target mattered. One string for byte matching; the other for UTF-8. Then the representations wouldn't have to continually be recomputed at run time backtracking.)

I don't know how this would fit in with tries. I haven't looked at those internals in a long time. But I do notice that there is code to do run-time folding. That made sense to me at one time, but no longer does. I would think that folds would be baked into the trie table at compilation time.

I suppose in an ideal world they would be. But consider how many entries in a trie we would have to make to match a string of k characters case insensitively with simple ASCII case folding rules: we would need to make 2^k keys being added to the trie so we would need 6 entries to store a short key like "foo". This would make trie construction extremely expensive even if you assume only a limited set of characters will have case folded equivalents, all you need is a long string of them and you have a very high construction cost. Unicode makes things worse by allowing for mappings that are multiple code-point and I think it has cases where the set is larger than 2, making the cost even higher. Just consider that "\x{DF}" matches at least 5 different strings and you can see the issue. Another factor is that one of the contributors to the memory costs in a trie is the number of distinct codepoints it holds as we need to store a mapping from codepoint to "id" used by the trie, and by using the folded version we reduce the number of codepoints in the result. Again using "\x{DF}", if we fold first then we /never see/ such a codepoint during the match process, and never have to account for it matching a multi-codepoint sequence 'sS' for instance, which would be another sequence we would never see, nor have to account for.

FWIW, the trie doesn't backtrack. It shouldn't fold the same character twice unless preceding logic has backtracked which caused a new invocation of the TRIE. It does support "backtracking" in the sense that it might pile up possibly accepting states and go through them in the correct order, but that shouldn't involve refolding input.

cheers, Yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

demerphq commented 2 years ago

On Tue, 11 Oct 2022 at 21:13, Hugo van der Sanden @.***> wrote:

I'd still love to see some way of giving the user control over which optimizations are performed, either lexically or at the regexp level (maybe even at the fractional regexp level). That would make workarounds for such pessimizations possible, and allow us usefully to make new optimizations possible that are disabled by default either because they aren't always useful or because they don't always give correct results. The tried code can be disabled fwiw. But really optimizations shouldn't ever be pessimisations, users shouldn't need to know how the internals work.

I'm not sure if you are saying here "no, we should never give users control over which optimizations are performed", but this response can certainly be read that way.

That is not what I am saying.:-) I am saying that disabling optimizations in the regex engine (or enabling them) should be a very rarely needed tool in the toolbox, and that users should not generally need to know that "for that type of pattern/string you need to disable this optimization because it's broken and turns into a pessimization". Sure, having an escape hatch that disables various optimizations would be nice, but mostly for debugging and for when things go horribly wrong in unexpected ways. I would hate it if documentation that "you need disable these optimizations" became prevalent. That just says we are doing things wrong internally.

Users know their own data far better than we ever can. They can discover which strategies work for them by trial and error (ie benchmarking) without needing to know how the internals work. And some users have advanced needs that do justify understanding how the internals work.

I guess my view on this is somewhat contradictory. On one level I totally agree, a user should know enough about the details to know how to get the best results. But at the same time I think a user shouldnt have to know many things to get the best behavior out of the tools we provide, especially when at the theoretical level there neednt be any difference. So for instance in an ideal world the patterns /foo|Foo|bar|Bar/ and /[fF]oo|[bB]ar/ should compile down to the same internal representation and have exactly the same performance. The users choice in how to spell things really shouldnt matter.

There are relatively few cases where I think a user should know about and control the optimizer for the regex engine, or where they should know about how the implementation works. It should just work, and its average case should be "good enough". Every place where spelling a pattern one way instead of a supposedly equivalent different way that results in performance changes is IMO a bug in the engine implementation. An unfortunate situation we have to deal with by necessity, not because we think its ok to require users to do such tuning.

BTW, ironically one of the few cases where I do think that the user should have some control is actually not about disabling optimizations, but rather about enabling them, or perhaps more specifically about allowing a user to override a default disabling that the regex engine itself does. There are a bunch of optimizations that are disabled by (?{...}) and (??{ ... }) blocks. I would love to expose a way to prevent this disablement. This is a special case because we are moving outside of the declarative realm of regex processing into the world of imperative coding with side-effects, which means there can be more than one possible reasonable answer to certain questions, especially in the context of partial matches. Consider /A(?{...})B/ and a string S which matches A but does not match B, should the (?{...}) execute or not? This is especially problematic when introducing new optimizations where if we did not disable them in the face of (?{..}) we would change existing logic so that code that used to execute (?{...}) (because B was not tested early) no long does because the B test was already performed.

Beyond that, there are always going to be bugs in the regexp engine.

Yeah, this is the main place where I think disabling optimizations would have value and is the reason I made it possible to disable the TRIE logic. As long as we all agree that anytime someone /must/ disable an optimization is a bug and thus exceptional then I am fine with us providing ways to disable things.

Right now, it seems to be semi-common knowledge that you can work around most optimization bugs by inserting an optimization-defeater like '(?=)', and over the years I've often had to recommend such tricks (and resort to them). Knowledge over how to make a non-buggy case even faster is much thinner on the ground - that's reserved to experts like you.

Does that defeat optimizations? :-) If I knew I had forgotten. :-) The one that disables the most is (?{ }) I think. It would be interesting to check. :-)

cheers, Yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"