regex engine slowdown bug

p5pRT commented 14 years ago

The RT System itself - Status changed from 'new' to 'open'

p5pRT commented 13 years ago

From @powerman

I've just hit similar issue again. This time I managed to prepare small example. I'm not 100% sure this is same bug\, but it looks similar:

$ time perl -e '$s="\"x10000; 1 while $s =~ /\<X(?-i:>)(.*?)(?=\<a)/gis'

real 0m0.003s user 0m0.001s sys 0m0.001s

$ time perl -e '$s="\"x10000; 1 while $s =~ /\<X[^>]*(?-i:>)(.*?)(? =\<a)/gis'

real 0m0.839s user 0m0.837s sys 0m0.001s

This happens both on 5.8.8 and 5.12.2

p5pRT commented 13 years ago

From @powerman

Срд. Мар. 30 06:22:23 2011\, powerman писал:

I've just hit similar issue again. This time I managed to prepare small example. I'm not 100% sure this is same bug\, but it looks similar:

Sorry\, one of examples was wrong. Here are correct examples:

$ time perl -e '$s="\"x10000; 1 while $s =~ /\<X[^>]*(?-i:>)(.*?)(?=\<a)/gis'

real 0m0.842s user 0m0.836s sys 0m0.002s

$ time perl -e '$s="\"x10000; 1 while $s =~ /\<X[^>]*(?i-:>)(.*?)(?=\<a)/gis'

real 0m0.003s user 0m0.003s sys 0m0.000s

This happens both on 5.8.8 and 5.12.2

p5pRT commented 13 years ago

From @demerphq

On 24 July 2010 17:46\, Dave Mitchell \davem@iabyn\.com wrote:

The original of this bug report didn't make it to the p5p mailing list; presumably because the original test script included 0.5Mb of sample HTML. The following demonstrates the same slowdown while generating its own sample HTMl data:

#!/usr/bin/perl use Time::HiRes qw( time );

my $html = qq{\
\n} x 30_000;

sub try { my ($re) = @_; my $t = time; $html =~ /$re/; warn sprintf "%7.4f sec\, regexp is %s\n"\, time-$t\, $re; }

try(qr{\<div class="marketsort[^>]*(?-i:>)\s*}ms); try(qr{\<div class="marketsort[^>]*(?:>)\s*}ims); try(qr{\<div class="marketsort[^>]*(?-i:>)}ims); try(qr{\<div\sclass="marketsort[^>]*(?-i:>)\s*}ims); try(qr{\<div class="marketsort[^>]*(?-i:>)\s*}ims); warn "FINISHED\n";

which on blead gives:

0.0004 sec\, regexp is (?ms-xi:\<div class="marketsort[^>]*(?-i:>)\s*) 0.0041 sec\, regexp is (?msi-x:\<div class="marketsort[^>]*(?:>)\s*) 0.0048 sec\, regexp is (?msi-x:\<div class="marketsort[^>]*(?-i:>)) 0.0163 sec\, regexp is (?msi-x:\<div\sclass="marketsort[^>]*(?-i:>)\s*) 61.5914 sec\, regexp is (?msi-x:\<div class="marketsort[^>]*(?-i:>)\s*) FINISHED

The time of the last pattern is quadratic on RHS of 'x' in the $html assignment.

I haven't looked into any further than that.

We never fixed the superlinear cache bug that was broken some time around when the engine was derecursivized.

Id bet a dollar this is related.

Yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 13 years ago

From @powerman

Чтв. Мар. 31 01:16:36 2011\, demerphq писал:

We never fixed the superlinear cache bug that was broken some time around when the engine was derecursivized.

Id bet a dollar this is related.

BTW\, your YAPH signature remind me about -Mre=debug\, and I tried to compare these regexps this way. Results are interesting\, but\, sadly\, I don't know perl internals good enough to understand them:

$ perl -Mre=debug -e "/\<X[^>]*>(.*?)(?=\<a)/gis" and $ perl -Mre=debug -e "/\<X[^>]*(?i-:>)(.*?)(?=\<a)/gis" produce same output and work fast\, but this one differs and slow: $ perl -Mre=debug -e "/\<X[^>]*(?-i:>)(.*?)(?=\<a)/gis"

Last line in first two (fast) regex was: stclass EXACTF \<\ minlen 3 but in last (slow) regex it was: floating ">" at 2..2147483647 (checking floating) stclass EXACTF \<\ minlen 3

Right now we've huge amount of parsers working 24x7 with thousands of different regexps\, and it's SLOW. Can anyone give some recommendations about how to work around this issue - like "add /i flag to every regex whenever possible"\, etc.? Because for now I'm even doesn't sure adding this redundant /i doesn't slowdown one regexp while speeding up another…

WBR\, Alex.

P.S. I hope one lucky day we'll find libre2 support in perl\, using some pragma to switch between current perl regexp engine and libre2 - they are mostly compatible\, but libre2 guarantee constant execution time and I'm really tired of regexps which at some moment decide to spend few minutes for execution instead of nanoseconds.

p5pRT commented 13 years ago

From @dgl

On Thu\, Mar 31\, 2011 at 03:05:52AM -0700\, Alex Efros via RT wrote: [..]

P.S. I hope one lucky day we'll find libre2 support in perl\, using some pragma to switch between current perl regexp engine and libre2 - they are mostly compatible\, but libre2 guarantee constant execution time and I'm really tired of regexps which at some moment decide to spend few minutes for execution instead of nanoseconds.

Not wanting to hijack this bug's thread but see http://search.cpan.org/perldoc?re::engine::RE2 (feel free to let me know if you have issues).

p5pRT commented 13 years ago

From @demerphq

On 31 March 2011 12:05\, Alex Efros via RT \perlbug\-followup@perl\.org wrote:

Чтв. Мар. 31 01:16:36 2011\, demerphq писал:

We never fixed the superlinear cache bug that was broken some time around when the engine was derecursivized.

Id bet a dollar this is related.

BTW\, your YAPH signature remind me about -Mre=debug\, and I tried to compare these regexps this way. Results are interesting\, but\, sadly\, I don't know perl internals good enough to understand them:

$ perl -Mre=debug -e "/\<X[^>]*>(.*?)(?=\<a)/gis" and $ perl -Mre=debug -e "/\<X[^>]*(?i-:>)(.*?)(?=\<a)/gis" produce same output and work fast\, but this one differs and slow: $ perl -Mre=debug -e "/\<X[^>]*(?-i:>)(.*?)(?=\<a)/gis"

Last line in first two (fast) regex was: stclass EXACTF \<\ minlen 3 but in last (slow) regex it was: floating ">" at 2..2147483647 (checking floating) stclass EXACTF \<\ minlen 3

Right now we've huge amount of parsers working 24x7 with thousands of different regexps\, and it's SLOW. Can anyone give some recommendations about how to work around this issue - like "add /i flag to every regex whenever possible"\, etc.? Because for now I'm even doesn't sure adding this redundant /i doesn't slowdown one regexp while speeding up another...

You are encountering an optimisation that normally speeds things up by allowing perl to fail fast.

This is called the "longest constant string" optimization. During the optimization phase we scan the pattern and calculate the position\, and value of the longest constant strings which are at a constant and floating offset.

So in your "slow" pattern that would be for constant offset '\<X' and for floating offset ' >'.

For various reasons we prefer the floating string for this check.

My guess is that the performance impact here is that your data is long has a LOT of '>' in it\, which would then result in perl scanning to find the _last_ (iirc). Which is probably not the right thing.

Now for various reasons this optimisation is disabled with a case insensitive match. Which means that perl no longer sees the ">" and no longer considers it a viable optimisation target\, and therfore goes with a different strategy which is to look for the "\<X" and then match from there. Your data probably makes this a much better strategy.

Its hard to say what to do here. *Most* times this optimization is highly useful. In your case clearly not.

There certainly is not a generic solution here. For instance enabling /i might make a few of your patterns faster\, but likely will slow all the rest down.

You will have to do analysis.

Also some of your patterns are potentially super-linear. Which we used to catch at the perl level.

However if done right you will find things like (?>....) will change them to being linear.

For various reasons Perl's engine is not a true regular expression engine in the formal sense\, and as such you sometimes have to manually provide it hints so it doesnt backtrack excessively.

WBR\, Alex.

P.S. I hope one lucky day we'll find libre2 support in perl\, using some pragma to switch between current perl regexp engine and libre2 - they are mostly compatible\, but libre2 guarantee constant execution time and I'm really tired of regexps which at some moment decide to spend few minutes for execution instead of nanoseconds.

While i havent checked IMO libre2 cannot make that commitment and support the full syntax of perl.

Also\, I do not believe that a proper NFA/DFA can provide left-most longest semantics.

However\, we have a regex plug in interface in Perl\, and have done so since Perl 5.10

Perhaps your company should hire somebody to implement a regex plug in that uses libre2.

cheers\, Yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 13 years ago

From @demerphq

On 1 April 2011 10:57\, David Leadbeater \dgl@dgl\.cx wrote:

On Thu\, Mar 31\, 2011 at 03:05:52AM -0700\, Alex Efros via RT wrote: [..]

P.S. I hope one lucky day we'll find libre2 support in perl\, using some pragma to switch between current perl regexp engine and libre2 - they are mostly compatible\, but libre2 guarantee constant execution time and I'm really tired of regexps which at some moment decide to spend few minutes for execution instead of nanoseconds.

Not wanting to hijack this bug's thread but see http://search.cpan.org/perldoc?re::engine::RE2 (feel free to let me know if you have issues).

\ Many alternatives

Matching a string against a regexp with 17\,576 alternatives (aaa .. zzz).

This uses trie matching on Perl (obviously RE2 does similar by default).

$ perl misc/altern.pl Rate re re2 re 52631/s -- -91% re2 554938/s 954% -- \

This one is frustrating. We have to consider a lot of cases that RE2 doesnt.

Nevertheless\, we should be able to speed this up considerably. For various reasons we chose an implementation that was designed to handle an uncommon worse case scenario\, and not the common one\, and pay for it in terms of speed.

Cheers\, yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 13 years ago

From @iabyn

On Thu\, Mar 31\, 2011 at 10:15:55AM +0200\, demerphq wrote:

On 24 July 2010 17:46\, Dave Mitchell \davem@iabyn\.com wrote:

The original of this bug report didn't make it to the p5p mailing list; presumably because the original test script included 0.5Mb of sample HTML. The following demonstrates the same slowdown while generating its own sample HTMl data:

#!/usr/bin/perl use Time::HiRes qw( time );

my $html = qq{\
\n} x 30_000;

sub try { my ($re) = @_; my $t = time; $html =~ /$re/; warn sprintf "%7.4f sec\, regexp is %s\n"\, time-$t\, $re; }

try(qr{\<div class="marketsort[^>]*(?-i:>)\s*}ms); try(qr{\<div class="marketsort[^>]*(?:>)\s*}ims); try(qr{\<div class="marketsort[^>]*(?-i:>)}ims); try(qr{\<div\sclass="marketsort[^>]*(?-i:>)\s*}ims); try(qr{\<div class="marketsort[^>]*(?-i:>)\s*}ims); warn "FINISHED\n";

which on blead gives:

0.0004 sec\, regexp is (?ms-xi:\<div class="marketsort[^>]*(?-i:>)\s*) 0.0041 sec\, regexp is (?msi-x:\<div class="marketsort[^>]*(?:>)\s*) 0.0048 sec\, regexp is (?msi-x:\<div class="marketsort[^>]*(?-i:>)) 0.0163 sec\, regexp is (?msi-x:\<div\sclass="marketsort[^>]*(?-i:>)\s*) 61.5914 sec\, regexp is (?msi-x:\<div class="marketsort[^>]*(?-i:>)\s*) FINISHED

The time of the last pattern is quadratic on RHS of 'x' in the $html assignment.

I haven't looked into any further than that.

We never fixed the superlinear cache bug that was broken some time around when the engine was derecursivized.

Id bet a dollar this is related.

You owe me one dollar ;-)

The superlinear cache is only used with CURLYX/WHILEM op pairs\, which aren't generated with these particular patterns. Perhaps a case could be made for extending it for other ops as well? (Off the top of my head I don't how viable that is.)

The particular bug with the superlinear cache was that\, until the derecursivisation [why is my spell-checker complaining???]\, the cache cached both success and failure positions; afterwards it only cached failure positions. At the time I couldn't think of a case that actually exercised the positive cache; eventually someone found one\, but it was very obscure and was deemed not worth worrying about (it involved negative lookahead assertions IIRC).

-- Any [programming] language that doesn't occasionally surprise the novice will pay for it by continually surprising the expert. -- Larry Wall

p5pRT commented 13 years ago

From @rurban

2011/4/1 demerphq \demerphq@gmail\.com:

On 31 March 2011 12:05\, Alex Efros via RT \perlbug\-followup@perl\.org wrote:

Чтв. Мар. 31 01:16:36 2011\, demerphq писал:

We never fixed the superlinear cache bug that was broken some time around when the engine was derecursivized.

Id bet a dollar this is related.

BTW\, your YAPH signature remind me about -Mre=debug\, and I tried to compare these regexps this way. Results are interesting\, but\, sadly\, I don't know perl internals good enough to understand them:

$ perl -Mre=debug -e "/\<X[^>]*>(.*?)(?=\<a)/gis" and $ perl -Mre=debug -e "/\<X[^>]*(?i-:>)(.*?)(?=\<a)/gis" produce same output and work fast\, but this one differs and slow: $ perl -Mre=debug -e "/\<X[^>]*(?-i:>)(.*?)(?=\<a)/gis"

Last line in first two (fast) regex was: stclass EXACTF \<\ minlen 3 but in last (slow) regex it was: floating ">" at 2..2147483647 (checking floating) stclass EXACTF \<\ minlen 3

Right now we've huge amount of parsers working 24x7 with thousands of different regexps\, and it's SLOW. Can anyone give some recommendations about how to work around this issue - like "add /i flag to every regex whenever possible"\, etc.? Because for now I'm even doesn't sure adding this redundant /i doesn't slowdown one regexp while speeding up another...

You are encountering an optimisation that normally speeds things up by allowing perl to fail fast.

This is called the "longest constant string" optimization. During the optimization phase we scan the pattern and calculate the position\, and value of the longest constant strings which are at a constant and floating offset.

So in your "slow" pattern that would be for constant offset '\<X' and for floating offset ' >'.

For various reasons we prefer the floating string for this check.

My guess is that the performance impact here is that your data is long has a LOT of '>' in it\, which would then result in perl scanning to find the _last_ (iirc). Which is probably not the right thing.

Now for various reasons this optimisation is disabled with a case insensitive match. Which means that perl no longer sees the ">" and no longer considers it a viable optimisation target\, and therfore goes with a different strategy which is to look for the "\<X" and then match from there. Your data probably makes this a much better strategy.

Its hard to say what to do here. *Most* times this optimization is highly useful. In your case clearly not.

There certainly is not a generic solution here. For instance enabling /i might make a few of your patterns faster\, but likely will slow all the rest down.

Just in case someone else wants to add /i to make regex search faster: DO NOT add /i in the general case.

In the general case and esp. for short scripts adding /i makes regex searches slower. It has to demand-load the whole unicore/To/Fold case tables (that's 1.6MB on 32bit)\, and searches for all unicode cases also.

utf8_heavy.pl is demand-loaded\, also Carp is loaded. Carp loads warnings which in turn loads all warnings::register_categories which is about 68kB on 32-bit. -- Reini Urban http://phpwiki.org/ http://murbreak.at/

p5pRT commented 13 years ago

From @khwilliamson

On 04/06/2011 06:31 AM\, Reini Urban wrote:

2011/4/1 demerphq\demerphq@gmail\.com:

On 31 March 2011 12:05\, Alex Efros via RT\perlbug\-followup@perl\.org wrote:

Чтв. Мар. 31 01:16:36 2011\, demerphq писал:

We never fixed the superlinear cache bug that was broken some time around when the engine was derecursivized.

Id bet a dollar this is related.

BTW\, your YAPH signature remind me about -Mre=debug\, and I tried to compare these regexps this way. Results are interesting\, but\, sadly\, I don't know perl internals good enough to understand them:

$ perl -Mre=debug -e "/\<X[^>]*>(.*?)(?=\<a)/gis" and $ perl -Mre=debug -e "/\<X[^>]*(?i-:>)(.*?)(?=\<a)/gis" produce same output and work fast\, but this one differs and slow: $ perl -Mre=debug -e "/\<X[^>]*(?-i:>)(.*?)(?=\<a)/gis"

Last line in first two (fast) regex was: stclass EXACTF\<\ minlen 3 but in last (slow) regex it was: floating ">" at 2..2147483647 (checking floating) stclass EXACTF\<\ minlen 3

Right now we've huge amount of parsers working 24x7 with thousands of different regexps\, and it's SLOW. Can anyone give some recommendations about how to work around this issue - like "add /i flag to every regex whenever possible"\, etc.? Because for now I'm even doesn't sure adding this redundant /i doesn't slowdown one regexp while speeding up another...

You are encountering an optimisation that normally speeds things up by allowing perl to fail fast.

This is called the "longest constant string" optimization. During the optimization phase we scan the pattern and calculate the position\, and value of the longest constant strings which are at a constant and floating offset.

So in your "slow" pattern that would be for constant offset '\<X' and for floating offset '>'.

For various reasons we prefer the floating string for this check.

My guess is that the performance impact here is that your data is long has a LOT of '>' in it\, which would then result in perl scanning to find the _last_ (iirc). Which is probably not the right thing.

Now for various reasons this optimisation is disabled with a case insensitive match. Which means that perl no longer sees the ">" and no longer considers it a viable optimisation target\, and therfore goes with a different strategy which is to look for the "\<X" and then match from there. Your data probably makes this a much better strategy.

Its hard to say what to do here. *Most* times this optimization is highly useful. In your case clearly not.

There certainly is not a generic solution here. For instance enabling /i might make a few of your patterns faster\, but likely will slow all the rest down.

Just in case someone else wants to add /i to make regex search faster: DO NOT add /i in the general case.

In the general case and esp. for short scripts adding /i makes regex searches slower. It has to demand-load the whole unicore/To/Fold case tables (that's 1.6MB on 32bit)\, and searches for all unicode cases also.

utf8_heavy.pl is demand-loaded\, also Carp is loaded. Carp loads warnings which in turn loads all warnings::register_categories which is about 68kB on 32-bit.

I believe this is fixed now in blead.

p5pRT commented 12 years ago

From @jkeenan

On Sat Apr 09 12:22:28 2011\, public@khwilliamson.com wrote:

I believe this is fixed now in blead.

If so\, can we close this ticket?

Thank you very much. Jim Keenan

p5pRT commented 12 years ago

From @khwilliamson

On Sat Apr 21 13:44:21 2012\, jkeenan wrote:

On Sat Apr 09 12:22:28 2011\, public@khwilliamson.com wrote:

I believe this is fixed now in blead.

If so\, can we close this ticket?

Thank you very much. Jim Keenan

I was referring only to the part about /i slowing everything down by demand loading utf8_heavy and the casefold tables. That part was fixed; I have no knowledge about the other parts.

-- Karl Williamson

p5pRT commented 12 years ago

From @iabyn

On Sat\, Apr 21\, 2012 at 01:44:21PM -0700\, James E Keenan via RT wrote:

If so\, can we close this ticket?

With blead\, my reduced test script still exhibits quadratic slowdown in the last case.

-- Fire extinguisher (n) a device for holding open fire doors.

p5pRT commented 12 years ago

From @demerphq

On 22 April 2012 12:15\, Dave Mitchell \davem@iabyn\.com wrote:

On Sat\, Apr 21\, 2012 at 01:44:21PM -0700\, James E Keenan via RT wrote:

If so\, can we close this ticket?

With blead\, my reduced test script still exhibits quadratic slowdown in the last case.

Speaking from the point of view of this:

Fast: perl -Mre=debug -e "/\<X[^>]*(?i-:>)(.*?)(?=\<a)/gis Slow: perl -Mre=debug -e "/\<X[^>]*(?-i:>)(.*?)(?=\<a)/gis

As far as I can tell it comes down to us favouring a floating required string over a fixed required string.

(IIRC) For some reason (flip of a coin? instinct?) long ago a decision was made to

A) check only one of the two required strings B) favour the floating one over the fixed one.

When the ">" is case sensitive[1] we can't use it in the "required substr" logic\, so we end up with no "floating required string" and thus use the fixed required string "\<X" in that optimization\, and presumably fail fast[2]. When it is case-insensitive it ends up being the floating required string\, and we use it to do the search[3]\, which presumably is slow in this case.

There is comment relating to this in regcomp.c:5585[4]\, saying we prefer to search for floating strings\, however it seems to me that we could choose an alternative strategy if we wish\, such as to choose the longer string when there is both\, if only on probability grounds that the longer string is less likely to be randomly in the string\, and only prefer the floating when they are the same length? Or maybe start using both?

cheers Yves

[1] Which opens up an interesting conversation about case-neutral codepoints and optimisations [2] $ perl -Mre=debug -e "/\<X[^>]*(?i-:>)(.*?)(?=\<a)/gis" Compiling REx "\<X[^>]*(?i-:>)(.*?)(?=\<a)" Final program: 1: EXACTF \<\ (3) 3: STAR (15) 4: ANYOF[\0-=?-\377][{unicode_all}] (0) 15: EXACTF \<>> (17) 17: OPEN1 (19) 19: MINMOD (20) 20: STAR (22) 21: SANY (0) 22: CLOSE1 (24) 24: IFMATCH[0] (30) 26: EXACTF \<\ (28) 28: SUCCEED (0) 29: TAIL (30) 30: END (0) stclass EXACTF \<\ minlen 3 Freeing REx: "\<X[^>]*(?i-:>)(.*?)(?=\<a)"

[3] $ perl -Mre=debug -e "/\<X[^>]*(?-i:>)(.*?)(?=\<a)/gis" Compiling REx "\<X[^>]*(?-i:>)(.*?)(?=\<a)" Final program: 1: EXACTF \<\ (3) 3: STAR (15) 4: ANYOF[\0-=?-\377][{unicode_all}] (0) 15: EXACT \<>> (17) 17: OPEN1 (19) 19: MINMOD (20) 20: STAR (22) 21: SANY (0) 22: CLOSE1 (24) 24: IFMATCH[0] (30) 26: EXACTF \<\ (28) 28: SUCCEED (0) 29: TAIL (30) 30: END (0) floating ">" at 2..2147483647 (checking floating) stclass EXACTF \<\ minlen 3 Freeing REx: "\<X[^>]*(?-i:>)(.*?)(?=\<a) [4] regcomp.c /* A temporary algorithm prefers floated substr to fixed one to dig more info. */ if (longest_fixed_length > longest_float_length) { r->check_end_shift = r->anchored_end_shift; r->check_substr = r->anchored_substr; r->check_utf8 = r->anchored_utf8; r->check_offset_min = r->check_offset_max = r->anchored_offset; if (r->extflags & RXf_ANCH_SINGLE) r->extflags |= RXf_NOSCAN; } else { r->check_end_shift = r->float_end_shift; r->check_substr = r->float_substr; r->check_utf8 = r->float_utf8; r->check_offset_min = r->float_min_offset; r->check_offset_max = r->float_max_offset; }

-- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 12 years ago

From @demerphq

On 22 April 2012 12:37\, demerphq \demerphq@gmail\.com wrote:

[4] regcomp.c /* A temporary algorithm prefers floated substr to fixed one to dig more info. */ if (longest_fixed_length > longest_float_length) { r->check_end_shift = r->anchored_end_shift; r->check_substr = r->anchored_substr; r->check_utf8 = r->anchored_utf8; r->check_offset_min = r->check_offset_max = r->anchored_offset; if (r->extflags & RXf_ANCH_SINGLE) r->extflags |= RXf_NOSCAN; } else { r->check_end_shift = r->float_end_shift; r->check_substr = r->float_substr; r->check_utf8 = r->float_utf8; r->check_offset_min = r->float_min_offset; r->check_offset_max = r->float_max_offset; }

I should read more carefully. The code and comment don't match. And the reason we dont use "\<X" as the start of the pattern is because it is not a required substring as it is case insensitive\, it is actually a stclass optimization which is separate from the required substring optimization. Sorry for the confusion.

yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 12 years ago

From @demerphq

On 22 April 2012 13:03\, demerphq \demerphq@gmail\.com wrote:

On 22 April 2012 12:37\, demerphq \demerphq@gmail\.com wrote:

[4] regcomp.c /* A temporary algorithm prefers floated substr to fixed one to dig more info. */ if (longest_fixed_length > longest_float_length) { r->check_end_shift = r->anchored_end_shift; r->check_substr = r->anchored_substr; r->check_utf8 = r->anchored_utf8; r->check_offset_min = r->check_offset_max = r->anchored_offset; if (r->extflags & RXf_ANCH_SINGLE) r->extflags |= RXf_NOSCAN; } else { r->check_end_shift = r->float_end_shift; r->check_substr = r->float_substr; r->check_utf8 = r->float_utf8; r->check_offset_min = r->float_min_offset; r->check_offset_max = r->float_max_offset; }

I should read more carefully. The code and comment don't match. And the reason we dont use "\<X" as the start of the pattern is because it is not a required substring as it is case insensitive\, it is actually a stclass optimization which is separate from the required substring optimization. Sorry for the confusion.

The output looks like this:

Found floating substr ">" at offset 34... start_shift: 21 check_at: 34 s: 0 endpos: 35 This position contradicts STCLASS... Looking for floating substr starting at offset 35... Found floating substr ">" at offset 52... start_shift: 21 check_at: 52 s: 0 endpos: 53 This position contradicts STCLASS... Looking for floating substr starting at offset 53... Found floating substr ">" at offset 70... start_shift: 21 check_at: 70 s: 0 endpos: 71 This position contradicts STCLASS... Looking for floating substr starting at offset 71... Found floating substr ">" at offset 88... start_shift: 21 check_at: 88 s: 0 endpos: 89 This position contradicts STCLASS... Looking for floating substr starting at offset 89... Did not find floating substr ">"...

The quadratic nature comes from the STCLASS optimisation running from 0 each time (s:0). It doesnt remember what it already searched.

Yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 12 years ago

From @demerphq

On 22 April 2012 13:24\, demerphq \demerphq@gmail\.com wrote:

On 22 April 2012 13:03\, demerphq \demerphq@gmail\.com wrote:

On 22 April 2012 12:37\, demerphq \demerphq@gmail\.com wrote:

[4] regcomp.c /* A temporary algorithm prefers floated substr to fixed one to dig more info. */ if (longest_fixed_length > longest_float_length) { r->check_end_shift = r->anchored_end_shift; r->check_substr = r->anchored_substr; r->check_utf8 = r->anchored_utf8; r->check_offset_min = r->check_offset_max = r->anchored_offset; if (r->extflags & RXf_ANCH_SINGLE) r->extflags |= RXf_NOSCAN; } else { r->check_end_shift = r->float_end_shift; r->check_substr = r->float_substr; r->check_utf8 = r->float_utf8; r->check_offset_min = r->float_min_offset; r->check_offset_max = r->float_max_offset; }

I should read more carefully. The code and comment don't match. And the reason we dont use "\<X" as the start of the pattern is because it is not a required substring as it is case insensitive\, it is actually a stclass optimization which is separate from the required substring optimization. Sorry for the confusion.

The output looks like this:

Found floating substr ">" at offset 34... start_shift: 21 check_at: 34 s: 0 endpos: 35 This position contradicts STCLASS... Looking for floating substr starting at offset 35... Found floating substr ">" at offset 52... start_shift: 21 check_at: 52 s: 0 endpos: 53 This position contradicts STCLASS... Looking for floating substr starting at offset 53... Found floating substr ">" at offset 70... start_shift: 21 check_at: 70 s: 0 endpos: 71 This position contradicts STCLASS... Looking for floating substr starting at offset 71... Found floating substr ">" at offset 88... start_shift: 21 check_at: 88 s: 0 endpos: 89 This position contradicts STCLASS... Looking for floating substr starting at offset 89... Did not find floating substr ">"...

The quadratic nature comes from the STCLASS optimisation running from 0 each time (s:0). It doesnt remember what it already searched.

This patch fixes it. It is a little naive and I cant decide if it is safe to apply at all.

The old code was slow because when it had a floating required string and a stclass item it would basically fall into a loop (via goto) doing

while not found: search for floating string (incrementally) search for stclass via find_byclass() from the start of the string to the position of the floating string

So the search for the stclass would repeatedly check the same ground\, producing the quadratic behavior. My patch changes this so its:

while not found: search for floating string (incrementally) search for stclass via find_byclass() incrementally up to the floating string

however the reason I am suspicious is that I wonder if there is a case where the floating string matches the STCLASS as well\, such that starting the stclass search at the point we found the floating string is unsafe. Which means that really find_byclass should manage this pointer as it is really the only one that knows where it safe to restart the string at.

While I write this I wonder if we could change the strategy so that once we use find_byclass we search the whole string instead of up to the floating item. We would then use the floating substring logic only if we actually could find the stclass... That might be an easier approach.

As a side comment\, whatever we do with all this logic\, it may have weird performance impacts on other stuff. This is all pretty much heuristics\, albeit ones that mostly work.

Yves

app:yorton@yam:yves/re-unicode:\~/git_tree/perl$ git diff

Inline Patch

```diff diff --git a/regexec.c b/regexec.c index 87d29c5..e023897 100644 --- a/regexec.c +++ b/regexec.c @@ -561,6 +561,7 @@ Perl_re_intuit_start(pTHX_ REGEXP * const rx, SV ```

sv, char strpos, I32 ml_anch; register char *other_last = NULL; /* other substr checked before this */ char *check_at = NULL; /* check substr found at this pos */ + char *checked_upto = NULL; /* how far into the string we have already checked */ const I32 multiline = prog->extflags & RXf_PMf_MULTILINE; RXi_GET_DECL(prog\,progi); #ifdef DEBUGGING @@ -1090\,11 +1091\,15 @@ Perl_re_intuit_start(pTHX_ REGEXP * const rx\, SV *sv\, char *strpos\, else endpos= strend;

- DEBUG_EXECUTE_r(PerlIO_printf(Perl_debug_log\, "start_shift: %"IVdf" check_at: %"IVdf" s: %"IVdf" endpos: %"IVdf"\n"\, - (IV)start_shift\, (IV)(check_at - strbeg)\, (IV)(s - strbeg)\, (IV)(endpos - strbeg))); - + if (!checked_upto) + checked_upto = s; + DEBUG_EXECUTE_r(PerlIO_printf(Perl_debug_log\, "start_shift: %"IVdf" check_at: %"IVdf" s: %"IVdf" endpos: %"IVdf" checked_upto: %"IVdf"\n"\, + (IV)start_shift\, (IV)(check_at - strbeg)\, (IV)(s - strbeg)\, (IV)(endpos - strbeg)\, (IV)(checked_upto- strbeg))) + t = s; - s = find_byclass(prog\, progi->regstclass\, s\, endpos\, NULL); + s = find_byclass(prog\, progi->regstclass\, checked_upto\, endpos\, NULL); + checked_upto = endpos; + if (!s) { #ifdef DEBUGGING const char *what = NULL;

-- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 12 years ago

From @demerphq

On 22 April 2012 14:15\, demerphq \demerphq@gmail\.com wrote:

On 22 April 2012 13:24\, demerphq \demerphq@gmail\.com wrote:

On 22 April 2012 13:03\, demerphq \demerphq@gmail\.com wrote:

On 22 April 2012 12:37\, demerphq \demerphq@gmail\.com wrote:

[4] regcomp.c /* A temporary algorithm prefers floated substr to fixed one to dig more info. */ if (longest_fixed_length > longest_float_length) { r->check_end_shift = r->anchored_end_shift; r->check_substr = r->anchored_substr; r->check_utf8 = r->anchored_utf8; r->check_offset_min = r->check_offset_max = r->anchored_offset; if (r->extflags & RXf_ANCH_SINGLE) r->extflags |= RXf_NOSCAN; } else { r->check_end_shift = r->float_end_shift; r->check_substr = r->float_substr; r->check_utf8 = r->float_utf8; r->check_offset_min = r->float_min_offset; r->check_offset_max = r->float_max_offset; }

I should read more carefully. The code and comment don't match. And the reason we dont use "\<X" as the start of the pattern is because it is not a required substring as it is case insensitive\, it is actually a stclass optimization which is separate from the required substring optimization. Sorry for the confusion.

The output looks like this:

Found floating substr ">" at offset 34... start_shift: 21 check_at: 34 s: 0 endpos: 35 This position contradicts STCLASS... Looking for floating substr starting at offset 35... Found floating substr ">" at offset 52... start_shift: 21 check_at: 52 s: 0 endpos: 53 This position contradicts STCLASS... Looking for floating substr starting at offset 53... Found floating substr ">" at offset 70... start_shift: 21 check_at: 70 s: 0 endpos: 71 This position contradicts STCLASS... Looking for floating substr starting at offset 71... Found floating substr ">" at offset 88... start_shift: 21 check_at: 88 s: 0 endpos: 89 This position contradicts STCLASS... Looking for floating substr starting at offset 89... Did not find floating substr ">"...

The quadratic nature comes from the STCLASS optimisation running from 0 each time (s:0). It doesnt remember what it already searched.

This patch fixes it. It is a little naive and I cant decide if it is safe to apply at all.

The old code was slow because when it had a floating required string and a stclass item it would basically fall into a loop (via goto) doing

while not found: search for floating string (incrementally) search for stclass via find_byclass() from the start of the string to the position of the floating string

So the search for the stclass would repeatedly check the same ground\, producing the quadratic behavior. My patch changes this so its:

while not found: search for floating string (incrementally) search for stclass via find_byclass() incrementally up to the floating string

however the reason I am suspicious is that I wonder if there is a case where the floating string matches the STCLASS as well\, such that starting the stclass search at the point we found the floating string is unsafe. Which means that really find_byclass should manage this pointer as it is really the only one that knows where it safe to restart the string at.

Actually\, find_byclass() failing on s..e means you cant match anywhere in s..e so restarting at e seems fine.

Unless someone can see a counter argument i will apply this sometime.

Yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 12 years ago

From @rjbs

* demerphq \demerphq@gmail\.com [2012-04-22T08:25:58]

Actually\, find_byclass() failing on s..e means you cant match anywhere in s..e so restarting at e seems fine.

Unless someone can see a counter argument i will apply this sometime.

Like 5.17.0? Sounds awesome! Fast regexp for everyone!

-- rjbs

p5pRT commented 12 years ago

From @demerphq

On 22 April 2012 19:45\, Ricardo Signes \perl\.p5p@rjbs\.manxome\.org wrote:

* demerphq \demerphq@gmail\.com [2012-04-22T08:25:58]

Actually\, find_byclass() failing on s..e means you cant match anywhere in s..e so restarting at e seems fine.

Unless someone can see a counter argument i will apply this sometime.

Like 5.17.0? Sounds awesome! Fast regexp for everyone!

Yes. But......

Can we just start a branch for 5.16 and let blead carry on please?

Then I dont have to remember what patches I have pending and blead perl can carry on like normal.

This thing of blocking blead while we roll stuff out is just silly in a vcs that manages branches as well as git does.

Yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 12 years ago

From @rjbs

* demerphq \demerphq@gmail\.com [2012-04-23T10:31:26]

Can we just start a branch for 5.16 and let blead carry on please?

I've suggested doing this before and gotten both "please yes" and "please no." I would like to branch blead sooner than "after release" but not to have to argue about it any time soon. So\, my preference is to do just as you suggest... around 5.17.9\, with lots of notice and whatever argument is needed happening before then.

With the above in mind\, please let's save the likely argument about the argument until a little later? :)

-- rjbs

p5pRT commented 12 years ago

From @demerphq

On 23 April 2012 16:53\, Ricardo Signes \perl\.p5p@rjbs\.manxome\.org wrote:

* demerphq \demerphq@gmail\.com [2012-04-23T10:31:26]

Can we just start a branch for 5.16 and let blead carry on please?

I've suggested doing this before and gotten both "please yes" and "please no." I would like to branch blead sooner than "after release" but not to have to argue about it any time soon. So\, my preference is to do just as you suggest... around 5.17.9\, with lots of notice and whatever argument is needed happening before then.

With the above in mind\, please let's save the likely argument about the argument until a little later? :)

Sure. Until then maybe we should create a branch namespace for all the stuff waiting.

Yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 12 years ago

From @cpansprout

On Mon Apr 23 08:12:29 2012\, demerphq wrote:

On 23 April 2012 16:53\, Ricardo Signes \perl\.p5p@rjbs\.manxome\.org wrote:

* demerphq \demerphq@gmail\.com [2012-04-23T10:31:26]

Can we just start a branch for 5.16 and let blead carry on please?

I've suggested doing this before and gotten both "please yes" and "please no." I would like to branch blead sooner than "after release" but not to have to argue about it any time soon. �So\, my preference is to do just as you suggest... around 5.17.9\, with lots of notice and whatever argument is needed happening before then.

With the above in mind\, please let's save the likely argument about the argument until a little later? :)

Sure. Until then maybe we should create a branch namespace for all the stuff waiting.

I’m doing more or less that with my sprout/misc-post-5.16 branch. If you want to add to it\, go ahead. :-)

--

Father Chrysostomos

p5pRT commented 12 years ago

From @cpansprout

On Sun Apr 22 05:26:30 2012\, demerphq wrote:

On 22 April 2012 14:15\, demerphq \demerphq@gmail\.com wrote:

On 22 April 2012 13:24\, demerphq \demerphq@gmail\.com wrote:

On 22 April 2012 13:03\, demerphq \demerphq@gmail\.com wrote:

On 22 April 2012 12:37\, demerphq \demerphq@gmail\.com wrote:

[4] regcomp.c /* A temporary algorithm prefers floated substr to fixed one to dig more info. */ � � � �if (longest_fixed_length > longest_float_length) { � � � � � �r->check_end_shift = r->anchored_end_shift; � � � � � �r->check_substr = r->anchored_substr; � � � � � �r->check_utf8 = r->anchored_utf8; � � � � � �r->check_offset_min = r->check_offset_max = r- anchored_offset; � � � � � �if (r->extflags & RXf_ANCH_SINGLE) � � � � � � � �r->extflags |= RXf_NOSCAN; � � � �} � � � �else { � � � � � �r->check_end_shift = r->float_end_shift; � � � � � �r->check_substr = r->float_substr; � � � � � �r->check_utf8 = r->float_utf8; � � � � � �r->check_offset_min = r->float_min_offset; � � � � � �r->check_offset_max = r->float_max_offset; � � � �}

I should read more carefully. The code and comment don't match. And the reason we dont use "\<X" as the start of the pattern is because it is not a required substring as it is case insensitive\, it is actually a stclass optimization which is separate from the required substring optimization. Sorry for the confusion.

The output looks like this:

Found floating substr ">" at offset 34... start_shift: 21 check_at: 34 s: 0 endpos: 35 This position contradicts STCLASS... Looking for floating substr starting at offset 35... Found floating substr ">" at offset 52... start_shift: 21 check_at: 52 s: 0 endpos: 53 This position contradicts STCLASS... Looking for floating substr starting at offset 53... Found floating substr ">" at offset 70... start_shift: 21 check_at: 70 s: 0 endpos: 71 This position contradicts STCLASS... Looking for floating substr starting at offset 71... Found floating substr ">" at offset 88... start_shift: 21 check_at: 88 s: 0 endpos: 89 This position contradicts STCLASS... Looking for floating substr starting at offset 89... Did not find floating substr ">"...

The quadratic nature comes from the STCLASS optimisation running from 0 each time (s:0). It doesnt remember what it already searched.

This patch fixes it. It is a little naive and I cant decide if it is safe to apply at all.

The old code was slow because when it had a floating required string and a stclass item it would basically fall into a loop (via goto) doing

while not found: �search for floating string (incrementally) �search for stclass via find_byclass() from the start of the string to the position of the floating string

So the search for the stclass would repeatedly check the same ground\, producing the quadratic behavior. My patch changes this so its:

while not found: �search for floating string (incrementally) �search for stclass via find_byclass() incrementally up to the floating string

however the reason I am suspicious is that I wonder if there is a case where the floating string matches the STCLASS as well\, such that starting the stclass search at the point we found the floating string is unsafe. Which means that really find_byclass should manage this pointer as it is really the only one that knows where it safe to restart the string at.

Actually\, find_byclass() failing on s..e means you cant match anywhere in s..e so restarting at e seems fine.

Unless someone can see a counter argument i will apply this sometime.

Like now? :-)

--

Father Chrysostomos

p5pRT commented 12 years ago

From @cpansprout

Fixed in dee29c6b5be.

p5pRT commented 12 years ago

@cpansprout - Status changed from 'open' to 'resolved'

p5pRT commented 12 years ago

@cpansprout - Status changed from 'resolved' to 'open'

p5pRT commented 12 years ago

From @cpansprout

The original fix was reverted by commit e8b06a558c\, since it caused a regression (#113400).

This bug is now fixed by d8080198.

p5pRT commented 12 years ago

@cpansprout - Status changed from 'open' to 'resolved'

Perl / perl5