Perl / perl5

🐪 The Perl programming language
https://dev.perl.org/perl5/
Other
1.86k stars 529 forks source link

Ancient Regex Regression #16616

Open p5pRT opened 5 years ago

p5pRT commented 5 years ago

Migrated from rt.perl.org#133352 (status was 'open')

Searchable as RT133352$

deven commented 2 years ago

There were a lot of changes to the regex engine between 5.000alpha9 and 5.000alpha12h, including:

deven commented 2 years ago

Overall, between regexp.h, regcomp.h, regcomp.c and regexec.c, there were 408 lines deleted and 647 lines added.

demerphq commented 2 years ago

I guess this thread got stalled?

Dave Mitchel asked:

"I know a way to make it much faster, but it involves, for every BRANCH/BRANCHJ/TRIE/TRIEC node, knowing the current capture index - i.e. in the same way that OPEN nodes have 'n = ARG(scan);' but I don't know how this can stored.

Yves, is there type of node - or a way to extend the current nodes - such that a 32-bit capture index can stored as part of each of these nodes at compile time? Without breaking everything?"

Not really unfortunately. In theory you can change the structures that are used to implement BRANCH, BRANCHJ, TRIE, TRIEC and run make regen. In practice this can break a lot, unfortunately a bunch of code in regex compiler makes unhealthy assumptions about the size of regops. I just tried a simple fix of changing BRANCH to be regnode_1 and BRANCHJ to be regnode_2L, and it broke a ton of stuff. TRIE/TRIEC are easier to deal with likely.

I have put this on my todo list, my time availability will be limited in the near future however.

Yves

demerphq commented 2 years ago

Hi, wanted to give an update. I have been working on a patch that makes it much easier to resize regnodes, and I am implementing the change requested in this ticket. I will follow up with a pr next week.

demerphq commented 2 years ago

Hi @iabyn

I now have a branch which stores the number of capture buffers were opened before it, but reading your request again I am not sure I understood what you wanted properly. The branch is yves/regnode_typedefs which is not an ideal name, but there is a bunch of stuff in the branch and some of it relates to this. The branch is messy as heck and needs some rebase -i reorganization, squashing, and etc, but it does work, and it adds support for detecting when changing the size of a regnode would break things. It changes the type of a BRANCH regop from struct regnode to struct regnode_1, and the BRANCHJ regop from struct regnode_1 to struct regnode_2L. I could not change the size of the TRIE regops as it would mean that we would also have to make all of the EXACT regops larger. Instead I have added a new member to the struct _reg_trie_data, which is stored in the data-array and accessible that way, and because of this it doesn't make sense to make TRIEC nodes larger, they can just use the same infra as the TRIE ones would.

In the below output you can see the new data as (buf: 2) and similar.

$ ./perl -Ilib -Mre=debug -e'/()()(?:[fF]oo()|[bB]ar()|[bB]az())([xX]|[yY])/'
Compiling REx "()()(?:[fF]oo()|[bB]ar()|[bB]az())([xX]|[yY])"
Final program:
   1: OPEN1 (4)
   3:   NOTHING (4)
   4: CLOSE1 (6)
   6: OPEN2 (9)
   8:   NOTHING (9)
   9: CLOSE2 (11)
  11: BRANCH (buf:2) (22)
  13:   ANYOFM[Ff] (15)
  15:   EXACT <oo> (17)
  17:   OPEN3 (20)
  19:     NOTHING (20)
  20:   CLOSE3 (45)
  22: BRANCH (buf:3) (33)
  24:   ANYOFM[Bb] (26)
  26:   EXACT <ar> (28)
  28:   OPEN4 (31)
  30:     NOTHING (31)
  31:   CLOSE4 (45)
  33: BRANCH (buf:4) (FAIL)
  35:   ANYOFM[Bb] (37)
  37:   EXACT <az> (39)
  39:   OPEN5 (42)
  41:     NOTHING (42)
  42:   CLOSE5 (45)
  44: TAIL (45)
  45: OPEN6 (47)
  47:   BRANCH (buf:6) (51)
  49:     ANYOFM[Xx] (55)
  51:   BRANCH (buf:6) (FAIL)
  53:     ANYOFM[Yy] (55)
  55: CLOSE6 (57)
  57: END (0)
minlen 4 
String shorter than min possible regex match (0 < 4)
Freeing REx: "()()(?:[fF]oo()|[bB]ar()|[bB]az())([xX]|[yY])"

and also here:

./perl -Ilib -Mre=debug -e'/()()(?:foo()|bar()|baz())(x|y)/'
Compiling REx "()()(?:foo()|bar()|baz())(x|y)"
Final program:
   1: OPEN1 (4)
   3:   NOTHING (4)
   4: CLOSE1 (6)
   6: OPEN2 (9)
   8:   NOTHING (9)
   9: CLOSE2 (11)
  11: TRIE-EXACT[bf] (buf:2) (39)
      <foo> (15)
  15:   OPEN3 (18)
  17:     NOTHING (18)
  18:   CLOSE3 (39)
      <bar> (24)
  24:   OPEN4 (27)
  26:     NOTHING (27)
  27:   CLOSE4 (39)
      <baz> (33)
  33:   OPEN5 (36)
  35:     NOTHING (36)
  36:   CLOSE5 (39)
  39: OPEN6 (41)
  41:   TRIE-EXACT[xy] (buf:6) (49)
        <x> 
        <y> 
  49: CLOSE6 (51)
  51: END (0)
minlen 4 
String shorter than min possible regex match (0 < 4)
Freeing REx: "()()(?:foo()|bar()|baz())(x|y)"

As I said I am not entirely sure I am storing what you expect. For instance what should /(foo|bar()|baz()()|bop)/ show for the branches? MY current implementation shows this:

$ ./perl -Ilib -Mre=debug -e'/([fF]oo|[bB]ar()|[Bb]az()()|[bB]op)/'
Compiling REx "([fF]oo|[bB]ar()|[Bb]az()()|[bB]op)"
Final program:
   1: OPEN1 (3)
   3:   BRANCH (buf:1) (9)
   5:     ANYOFM[Ff] (7)
   7:     EXACT <oo> (42)
   9:   BRANCH (buf:1) (20)
  11:     ANYOFM[Bb] (13)
  13:     EXACT <ar> (15)
  15:     OPEN2 (18)
  17:       NOTHING (18)
  18:     CLOSE2 (42)
  20:   BRANCH (buf:2) (36)
  22:     ANYOFM[Bb] (24)
  24:     EXACT <az> (26)
  26:     OPEN3 (29)
  28:       NOTHING (29)
  29:     CLOSE3 (31)
  31:     OPEN4 (34)
  33:       NOTHING (34)
  34:     CLOSE4 (42)
  36:   BRANCH (buf:4) (FAIL)
  38:     ANYOFM[Bb] (40)
  40:     EXACT <op> (42)
  42: CLOSE1 (44)
  44: END (0)
minlen 3 
String shorter than min possible regex match (0 < 3)
Freeing REx: "([fF]oo|[bB]ar()|[Bb]az()()|[bB]op)"

FWIW, I am highly tempted to change the debug output so that the "next regop" is shown as -> \d and not (\d) as it would be nice to use the parens for something more useful.

@khwilliamson and @deven you may find this interesting also.

demerphq commented 2 years ago

BTW, @iabyn the smoke-me/davem/captures branch was deleted, I have rebased it and I will repush it to the repo to save you from doing it yourself.

deven commented 2 years ago

@demerphq For whatever my opinion may be worth, I think that's an excellent idea, to display the "next regop" with -> instead of parens. Apart from freeing up the parens for other purposes, it would be much more intuitive in the first place!

deven commented 2 years ago

I had hoped to take a stab at implementing a production-ready fix for this bug, which I knew my example patch was not, even though it did fix the bug. But after the work you guys have done, which I haven't even absorbed yet, I'm not sure there's much left for me to help with on the implementation now?

demerphq commented 2 years ago

@iabyn I have repushed your branch as smoke-me/davem/captures_rebased, it is rebased on top of blead from a few minutes ago. I will start cleaning up my branch to enable tracking which parens we are in.

demerphq commented 2 years ago

@deven, I mostly wanted you to know your work on this was not for nothing and that it was back in play. One way or another I plan to get either your or daves or daves improved patch in for 5.38, we are too close to 5.36 to risk it, but I do want to get this fixed! Also I figured you might have something to add to the discussion as you did all that work understanding how things work.

iabyn commented 2 years ago

On Thu, Apr 07, 2022 at 07:15:31AM -0700, Yves Orton wrote:

I now have a branch which stores the number of capture buffers were opened before it, but reading your request again I am not sure I understood what you wanted properly.

The original ticket was from so long ago that I've forgotten most details, but from what I recall:

1) the issue involves the saving and restoring of capture info on repeats / alternations in a way that is hard to understand, does my head in, and will take a while for me to get my head back around.

2) I had a potential fix for the bug, but it made some things slower, so wasn't ideal.

3) I knew a better way to fix things but it involved needing access to the current capture index in some places where that info wasn't currently held. Hopefully your branch solves that issue, which means that my better method may now work.

4) However, I now have no idea what my better method was. See (1) about taking a while to get my head round it again.

So I'll add this to my list of things to look at, but it won't be any time very soon.

-- Monto Blanco... scorchio!

demerphq commented 1 year ago

I will start looking into this soon again.