andgineer / TRegExpr

Regular expressions (regex), pascal.
https://regex.sorokin.engineer/en/latest/
MIT License
174 stars 63 forks source link

Crash when searching inside huge line (len=130K) #388

Closed Alexey-T closed 5 months ago

Alexey-T commented 6 months ago

Test project in Lazarus. Adjust full path to TRegExpr in

tst-tregexpr-crash-huge-line.zip

I use FPC 3.2.3 and Linux x64. It crashes when I press button on the demo form, somewhere in MatchPrim(). Call stack:

Screenshot from 2024-05-12 18-33-23

Alexey-T commented 6 months ago

Maybe I can ask @User4martin to see it, please?

User4martin commented 6 months ago

Not sure how easy that is to fix.

With ModifierS:= false the .+ generates a non-optimized loop (OP_branch and OP_BACK). If that can be changed to an OP_PLUS then it's fixable. I'll have a look later.

Otherwise, its not going to be a quick fix.


General loop code must be able to backtrack. So for each char that the . advances it must recursively call MatchPrim (so on failure the outer caller can try to continue).

On my PC the stack overflows after about 95,000 chars.


That said, if you do more generic loops (a|b|.)+: That will crash on a string of such length. And to a degree that is an issue with how most regex engines work.

IIRC There is an option to enable recursion checks, and turn it into an error.

Alexey-T commented 6 months ago

Yes, there is DEFINE for stack.

{$DEFINE RegExpWithStackOverflowCheck}

Screenshot from 2024-05-12 21-05-36

I guess no need to fix it then.

User4martin commented 6 months ago

I did look a bit deeper... Something strange.

In FindRepeated is the following comment

        // note - OP_ANY_ML cannot be proceeded in FindRepeated because can skip
        // more than one char at once

But that is (if at all) only true if {$DEFINED UNICODEEX}. And in that case, that is also true for the normal OP_ANY, or am I missing something?

So OP_STAR, OP_PLUS, OP_BRACES => need to check for surrogates when backtracing. Probably other fixes, "IncUnicode2" does not count correctly.

But OP_ANY_ML as far as I can see could be added to FindRepeated, or is there something else.

This needs a (or several) new issues....

Alexey-T commented 6 months ago

But that is (if at all) only true if {$DEFINED UNICODEEX}

It can skip 2 chars: CR with following LF, to skip CR LF.

that is also true for the normal OP_ANY, or am I missing something?

no. Normal OP_ANY don't skip CR LF.

User4martin commented 6 months ago

It can skip 2 chars: CR with following LF, to skip CR LF.

Ah, ok, Then it is broken. It does indeed test for IsPairedBreak, but then only increments the regInput by one.

Also if fixed, it may be possible that backtracking can check if it has to do a paired break. If it actually needs to. Should it also match a single 0a or 0d ? Then the matching is just for the count. And the count needs to be fixed for surrogates already, and could then be extended for this.

User4martin commented 6 months ago

The question is, whould .. on the text #$0D#$0A match the 2 individual chars? Or will it only try the combined?

Because if the text is #$0D#$09 it will match the individual 0d followed by the tab.

If it does match either combined or individual, then it is (probably the only) non-loop that needs backtracking....


As far as I can tell on regex101.com if there is a 0D0A then the . always eats both, and will not match the single 0D.

That is the text 'a'#$0D#$0A'b' will be matched by a.b, but will not be matched by a..b.

In other words, the pattern a.b will match

The pattern a..b will match

Alexey-T commented 6 months ago

Should it also match a single 0a or 0d ?

Yes, if CF is not followed by LF. AFAIR.

Then the matching is just for the count. And the count needs to be fixed for surrogates already, and could then be extended for this.

To be like in other engines, so .{4} matches exactly 4 unicode points? Ok.

Alexey-T commented 6 months ago

In other words, the pattern a.b will match

* `'a'#$0A'b'`

* `'a'#$0D'b'`

* `'a'#$0D#$0A'b'`

The pattern a..b will match

* `'a'#$0A#$0A'b'`

* `'a'#$0D#$0D'b'`

* `'a'#$0A#$0D'b'`  // reverse order

* `'a'#$0D#$0A#$0D#$0A'b'`

* BUT NOT `'a'#$0D#$0A'b'`  (seen always as a single line break)

Our code must match like this too, AFAIR (this needs to be confirmed).

User4martin commented 6 months ago

I don't know if the browser can be forced to specific linebreaks... So testing on online regex pages may be flawed.

I did run a test with perl (Win and Linux)

perl -wle 'use strict; my $a = "a".chr(13).chr(10)."b"; print $a =~/a.b/ ? 1 : 0;  print $a =~/a..b/ ? 1 : 0;  print $a =~/a.b/s ? 1 : 0; print $a =~/a..b/s ? 1 : 0; print'
0
0
0
1

perl -wle 'use strict; my $a = "a".chr(10).chr(11)."b"; print $a =~/a.b/ ? 1 : 0;  print $a =~/a..b/ ? 1 : 0;  print $a =~/a.b/s ? 1 : 0; print $a =~/a..b/s ? 1 : 0; print'
0
0
0
1

perl -wle 'use strict; my $a = "a".chr(13)."b"; print $a =~/a.b/ ? 1 : 0;  print $a =~/a..b/ ? 1 : 0;  print $a =~/a.b/s ? 1 : 0; print $a =~/a..b/s ? 1 : 0; print'
1
0
1
0

perl -wle 'use strict; my $a = "a".chr(10)."b"; print $a =~/a.b/ ? 1 : 0;  print $a =~/a..b/ ? 1 : 0;  print $a =~/a.b/s ? 1 : 0; print $a =~/a..b/s ? 1 : 0; print'
0
0
1
0

It treats 0D0A as 2 individual linebreaks.

A 0D is matched even without /s

But perl is a very particular flavour...

Alexey-T commented 6 months ago

Yes, Perl is not the truth source for us, we must make it like in majority of RE engines. This site shows features of many engines: https://www.regular-expressions.info/

User4martin commented 6 months ago

Node.js

> var s = /a.b/s
undefined
> s.exec("a\nb");
[ 'a\nb', index: 0, input: 'a\nb', groups: undefined ]
> s.exec("a\rb");
[ 'a\rb', index: 0, input: 'a\rb', groups: undefined ]
> s.exec("a\n\rb");
null
> s.exec("a\r\nb");
null
> var s = /a..b/s
undefined
> s.exec("a\n\rb");
[ 'a\n\rb', index: 0, input: 'a\n\rb', groups: undefined ]
> s.exec("a\r\nb");
[ 'a\r\nb', index: 0, input: 'a\r\nb', groups: undefined ]

also only seems to match individual \n or \r but not a sequence of both.

Alexey-T commented 6 months ago

Interesting. Does site https://www.regular-expressions.info/ tell about other flavors , how they handle it?

Alexey-T commented 6 months ago

No, site does not tell this. Ok, feel free to optimize the code so . matches only one EOL char: 0D or 0A (or other allowed Unicode EOL).

Alexey-T commented 6 months ago

I see, we already do as needed! no need to change code.

  Subj:= 'a'#13#10'b';
  Regex:= '(?s)a..b'; 

this matches.

User4martin commented 6 months ago

I see, we already do as needed!

Just in a very unoptimized way.

The code checks for the combined break, and if found consumes only the first of the 2 chars. And if not found checks if the first char occurs on its own...

I do think however that it is correct for . to consume only one char. => /s simply says that linebreak chars are matched by .. But not that . matches a multi char line break as single char.

If someone needs the latter, they can feature request it.


Given that currently only one char is consumed, the original issue can be optimized by moving op_any_ml to be handled by op_plus.

Alexey-T commented 6 months ago

Can you make the PR, please?

User4martin commented 6 months ago

I'll look into it. It may be a while though.

User4martin commented 5 months ago

Ok, I added the changes.

It should actually not make any behavioural difference.

The check for IsPairedBreak was only to fail the match. The match still fails on the #13 of the #13#10. All that changes is the position of the fail. (EDIT: actually, not even that)

That also means that the OP_ANY_ML has always only matched a single char. And could have been simple all the time.

OP_ANY on the other hand (which does match line breaks) has never checked for paired-breaks, and never matched them as one (nor should it).


IsPairedBreak is now only used in OP_BOL_ML / OP_EOL_ML. Those I have not touched, nor reviewed, nor compared in behaviour to other re-engines.

Alexey-T commented 5 months ago

seems it is solved by https://github.com/andgineer/TRegExpr/pull/389 , closing.