Perl / perl5

🐪 The Perl programming language
https://dev.perl.org/perl5/
Other
1.93k stars 553 forks source link

perlre(1) and paired double quote regex searches -- #15498

Open p5pRT opened 8 years ago

p5pRT commented 8 years ago

Migrated from rt.perl.org#128864 (status was 'open')

Searchable as RT128864$

p5pRT commented 8 years ago

From aab@purdue.edu

This is a comment and a possible suggestion.

I've been playing around (read a LOT of hacking) with the CPAN "JavaScript​::Beautifier" module to increase its performance and completeness. In one part of its 'get_next_token()' sub\, it extracts quoted strings. I swiped the (first encountered) perlre(1) example

  /"(?​:[^"\\]++|\\.)*+"/

as the basis for the regex pattern to use. Unfortunately\, during my testing\, I found a javascript file\, https://github.com/ternjs/acorn/blob/master/test/jquery-string.js, that causes that dread "Complex regular subexpression recursion limit" error message. I next tried variations of the second example

  /"(?>(?​:(?>[^"\\]+)|\\.)*)"/

with the same result. FWIW - the opening quote in the 'jquery-string.js' file is on the first line\, the closing quote is on last line\, 10314 lines later\, and the file is full of \" .

I ended up using the regex expression

  /\G((?>$peek.*?(?\<!\\)$peek))/s

where $peek is the quote character. It seems to work fine but I'll bet that there are probably a bunch of "gotchas" that I haven't encountered yet.

Suggestion​: update the perlre(1) example(s) ala double quoted strings to something that doesn't go "recursion limit" flaky.

-- Thanks\,

-- Paul Townsend

p5pRT commented 8 years ago

From @cpansprout

On Sat Aug 06 14​:10​:37 2016\, aab@​purdue.edu wrote​:

This is a comment and a possible suggestion.

I've been playing around (read a LOT of hacking) with the CPAN "JavaScript​::Beautifier" module to increase its performance and completeness. In one part of its 'get_next_token()' sub\, it extracts quoted strings. I swiped the (first encountered) perlre(1) example

/"(?​:[^"\\]++|\\.)*+"/

as the basis for the regex pattern to use. Unfortunately\, during my testing\, I found a javascript file\, https://github.com/ternjs/acorn/blob/master/test/jquery-string.js, that causes that dread "Complex regular subexpression recursion limit" error message.

I am surprised. I thought the ++ would avoid that.

I next tried variations of the second example

/"(?>(?​:(?>[^"\\]+)|\\.)*)"/

with the same result. FWIW - the opening quote in the 'jquery- string.js' file is on the first line\, the closing quote is on last line\, 10314 lines later\, and the file is full of \" .

I ended up using the regex expression

/\G((?>$peek.*?(?\<!\\)$peek))/s

where $peek is the quote character. It seems to work fine but I'll bet that there are probably a bunch of "gotchas" that I haven't encountered yet.

Such as this code snippet​:

  alert("\\" + "\\");

In a browser\, that pops up an alert message with two backslashes. If I run your regular expression on it without the \G​:

$peek = '"'; \<\<\ENd =~ /((?>$peek.*?(?\<!\\)$peek))/s; alert("\\" + "\\"); ENd print "$1\n"; __END__

it gives me​:

"\\" + "

FWIW\, JE uses​:

  /\G (?​: '([^'\\]*(?​:\\.[^'\\]*)*)'   |   "([^"\\]*(?​:\\.[^"\\]*)*)" )/xcgs

The basic pattern is​:

  normal* ( special normal* )*

--

Father Chrysostomos

p5pRT commented 8 years ago

The RT System itself - Status changed from 'new' to 'open'

p5pRT commented 8 years ago

From aab@purdue.edu

Yeah\, I found the "\\" not working. Boo Hiss.

________________________________ From​: Father Chrysostomos via RT \perlbug\-followup@&#8203;perl\.org Sent​: Saturday\, August 6\, 2016 7​:10​:14 PM To​: Paul Townsend Subject​: [perl #128864] perlre(1) and paired double quote regex searches --

On Sat Aug 06 14​:10​:37 2016\, aab@​purdue.edu wrote​:

This is a comment and a possible suggestion.

I've been playing around (read a LOT of hacking) with the CPAN "JavaScript​::Beautifier" module to increase its performance and completeness. In one part of its 'get_next_token()' sub\, it extracts quoted strings. I swiped the (first encountered) perlre(1) example

/"(?​:[^"\\]++|\\.)*+"/

as the basis for the regex pattern to use. Unfortunately\, during my testing\, I found a javascript file\, https://github.com/ternjs/acorn/blob/master/test/jquery-string.js, that causes that dread "Complex regular subexpression recursion limit" error message.

I am surprised. I thought the ++ would avoid that.

I next tried variations of the second example

/"(?>(?​:(?>[^"\\]+)|\\.)*)"/

with the same result. FWIW - the opening quote in the 'jquery- string.js' file is on the first line\, the closing quote is on last line\, 10314 lines later\, and the file is full of \" .

I ended up using the regex expression

/\G((?>$peek.*?(?\<!\\)$peek))/s

where $peek is the quote character. It seems to work fine but I'll bet that there are probably a bunch of "gotchas" that I haven't encountered yet.

Such as this code snippet​:

  alert("\\" + "\\");

In a browser\, that pops up an alert message with two backslashes. If I run your regular expression on it without the \G​:

$peek = '"'; \<\<\ENd =~ /((?>$peek.*?(?\<!\\)$peek))/s; alert("\\" + "\\"); ENd print "$1\n"; __END__

it gives me​:

"\\" + "

FWIW\, JE uses​:

  /\G (?​: '([^'\\]*(?​:\\.[^'\\]*)*)'   |   "([^"\\]*(?​:\\.[^"\\]*)*)" )/xcgs

The basic pattern is​:

  normal* ( special normal* )*

--

Father Chrysostomos

p5pRT commented 8 years ago

From aab@purdue.edu

Further testing indicates that a small change in the first perlre(1) example enables that pattern to work even on the jsquery-string.js file

  old​: /"(?​:[^"\\]++|\\.)*+"/

  new​: /"(?​:[^"\\]++|(\\.)++)*+"/

-- Thanks\,

-- Paul Townsend

________________________________ From​: Father Chrysostomos via RT \perlbug\-followup@&#8203;perl\.org Sent​: Saturday\, August 6\, 2016 7​:10 PM To​: Paul Townsend Subject​: [perl #128864] perlre(1) and paired double quote regex searches --

On Sat Aug 06 14​:10​:37 2016\, aab@​purdue.edu wrote​:

This is a comment and a possible suggestion.

I've been playing around (read a LOT of hacking) with the CPAN "JavaScript​::Beautifier" module to increase its performance and completeness. In one part of its 'get_next_token()' sub\, it extracts quoted strings. I swiped the (first encountered) perlre(1) example

/"(?​:[^"\\]++|\\.)*+"/

as the basis for the regex pattern to use. Unfortunately\, during my testing\, I found a javascript file\, https://github.com/ternjs/acorn/blob/master/test/jquery-string.js, that causes that dread "Complex regular subexpression recursion limit" error message.

I am surprised. I thought the ++ would avoid that.

I next tried variations of the second example

/"(?>(?​:(?>[^"\\]+)|\\.)*)"/

with the same result. FWIW - the opening quote in the 'jquery- string.js' file is on the first line\, the closing quote is on last line\, 10314 lines later\, and the file is full of \" .

I ended up using the regex expression

/\G((?>$peek.*?(?\<!\\)$peek))/s

where $peek is the quote character. It seems to work fine but I'll bet that there are probably a bunch of "gotchas" that I haven't encountered yet.

Such as this code snippet​:

  alert("\\" + "\\");

In a browser\, that pops up an alert message with two backslashes. If I run your regular expression on it without the \G​:

$peek = '"'; \<\<\ENd =~ /((?>$peek.*?(?\<!\\)$peek))/s; alert("\\" + "\\"); ENd print "$1\n"; __END__

it gives me​:

"\\" + "

FWIW\, JE uses​:

  /\G (?​: '([^'\\]*(?​:\\.[^'\\]*)*)'   |   "([^"\\]*(?​:\\.[^"\\]*)*)" )/xcgs

The basic pattern is​:

  normal* ( special normal* )*

--

Father Chrysostomos

p5pRT commented 8 years ago

From @abigail

On Sat\, Aug 06\, 2016 at 04​:10​:14PM -0700\, Father Chrysostomos via RT wrote​:

On Sat Aug 06 14​:10​:37 2016\, aab@​purdue.edu wrote​:

This is a comment and a possible suggestion.

I've been playing around (read a LOT of hacking) with the CPAN "JavaScript​::Beautifier" module to increase its performance and completeness. In one part of its 'get_next_token()' sub\, it extracts quoted strings. I swiped the (first encountered) perlre(1) example

/"(?​:[^"\\]++|\\.)*+"/

as the basis for the regex pattern to use. Unfortunately\, during my testing\, I found a javascript file\, https://github.com/ternjs/acorn/blob/master/test/jquery-string.js, that causes that dread "Complex regular subexpression recursion limit" error message.

I am surprised. I thought the ++ would avoid that.

I'm surprised as well. Mostly by the claim it's the most efficient way of finding such strings. The alternation inside the loop is a red flag.

[ Snip ]

FWIW\, JE uses​:

    /\\G \(?&#8203;: '\(\[^'\\\\\]\*\(?&#8203;:\\\\\.\[^'\\\\\]\*\)\*\)'
              |
            "\(\[^"\\\\\]\*\(?&#8203;:\\\\\.\[^"\\\\\]\*\)\*\)"  \)/xcgs

The basic pattern is​:

normal\* \( special normal\* \)\*

Yes. That's the loop unrolling technique Jeffrey Friedl taught us almost 20 years ago.

A benchmark shows it's faster than the suggestion made by perlre. More surprisingly\, using possessive quantifiers is slightly slower\, both in the perlre suggestion\, and the unrolling technique​:

  #!/opt/perl/bin/perl

  use 5.010;

  use strict;   use warnings;   no warnings 'syntax';

  my $RUNS = 1_000;

  my %patterns = (   unroll_pos => qr /"[^"\\]*+(?​:\\(?s​:.)[^"\\]*+)*+"/\,   unroll_reg => qr /"[^"\\]*(?​:\\(?s​:.)[^"\\]*)*"/\,   perlre_pos => qr /"(?​:[^"\\]++|\\(?s​:.))*+"/\,   perlre_reg => qr /"(?​:[^"\\]+|\\(?s​:.))*"/\,   );

  my $text = `cat jquery-string.js`;

  my %results;

  while (my ($name\, $pattern) = each %patterns) {   my ($u1\, $s1) = times;   foreach (1 .. $RUNS) {   my @​r = $text =~ /$pattern/g;   }   my ($u2\, $s2) = times;   $results {$name} = $u2 + $s2 - $u1 - $s1;   }

  foreach my $name (sort {$results {$a} \<=> $results {$b}} keys %results) {   printf "Avg. run time for '%s'​: %.4f\n" => $name\,   $results {$name} / $RUNS;   }

  __END__   Avg. run time for 'unroll_reg'​: 0.0998   Avg. run time for 'unroll_pos'​: 0.1055   Avg. run time for 'perlre_reg'​: 0.1131   Avg. run time for 'perlre_pos'​: 0.1195

It's tempting to change the pattern in perlre to the more efficient unrolling technique\, but the pattern is there to demonstrate how possessive quantifiers work.

But we may want to scratch the part about it claiming to be the most efficient way.

Abigail

p5pRT commented 8 years ago

From @xsawyerx

On 09/22/2016 03​:25 AM\, Abigail wrote​:

[...] It's tempting to change the pattern in perlre to the more efficient unrolling technique\, but the pattern is there to demonstrate how possessive quantifiers work.

What about adding a line that says\, "This is meant to demonstrate possessive quantifiers. If you're looking for efficiency\, you could write the following​:..." and adding the more efficient example?

p5pRT commented 8 years ago

From @jkeenan

On Sat Aug 06 14​:10​:37 2016\, aab@​purdue.edu wrote​:

This is a comment and a possible suggestion.

To return to the original problem for a moment ...

I've been playing around (read a LOT of hacking) with the CPAN "JavaScript​::Beautifier" module to increase its performance and completeness. In one part of its 'get_next_token()' sub\, it extracts quoted strings. I swiped the (first encountered) perlre(1) example

/"(?​:[^"\\]++|\\.)*+"/

as the basis for the regex pattern to use. Unfortunately\, during my testing\, I found a javascript file\, https://github.com/ternjs/acorn/blob/master/test/jquery-string.js, that causes that dread "Complex regular subexpression recursion limit" error message. I next tried variations of the second example

/"(?>(?​:(?>[^"\\]+)|\\.)*)"/

with the same result. FWIW - the opening quote in the 'jquery- string.js' file is on the first line\, the closing quote is on last line\, 10314 lines later\, and the file is full of \" .

I was not able to reproduce that error message. See attached 128864-test-long-js.pl. Running that program produced​:

##### $ perl 128864-test-long-js.pl No No No No #####

Is there a flaw in my test program or does this error only occur on certain platforms?

Thank you very much.

-- James E Keenan (jkeenan@​cpan.org)

p5pRT commented 8 years ago

From @jkeenan

128864-test-long-js.pl

p5pRT commented 8 years ago

From aab@purdue.edu

I just realized the other day that I specified he wrong URL for 'jquery-string.js'. The original

  https://github.com/ternjs/acorn/blob/master/test/jquery-string.js

is a browserfied HTML version of the raw file. The raw file itself is at

  https​://raw.githubusercontent.com/ternjs/acorn/master/test/jquery-string.js\https://raw.githubusercontent.com/ternjs/acorn

Reading between the lines\, I suspect that the last several commenters have used the HTML file as the source.

I apologize for the confusion. I'm just learning the ins/outs of the github site

FWIW - the current HTML file has 83692 quoted strings. All of the raw files double quotes have been converted to '"'. The raw file has a single quoted string that's 338729 characters long and contains a very large number of '\.' pairs many of which are consecutive.

.

All of Abigail's test patterns fail using the raw file. If you change her "unroll_pos" pattern

  /"[^"\\]*+(?​:\\(?s​:.)[^"\\]*+)*+"/

to

  /"[^"\\]*+(?​:(?s​:\\.)++[^"\\]*+)*+"/

I apologize for the confusion. I'm just learning the ins/outs of the github site

/

it works but is slightly slower in general.

My testing of a modified Abigail script tells me that possesive patterns are faster for the raw file but slower for the HTML file.

-- Paul Townsend

________________________________ From​: James E Keenan via RT \perlbug\-followup@&#8203;perl\.org Sent​: Wednesday\, September 28\, 2016 5​:34 PM To​: Paul Townsend Subject​: [perl #128864] perlre(1) and paired double quote regex searches --

On Sat Aug 06 14​:10​:37 2016\, aab@​purdue.edu wrote​:

This is a comment and a possible suggestion.

To return to the original problem for a moment ...

James E Keenan (jkeenan@​cpan.org)

p5pRT commented 8 years ago

From @jkeenan

On Thu Sep 29 12​:34​:27 2016\, aab@​purdue.edu wrote​:

I just realized the other day that I specified he wrong URL for 'jquery-string.js'. The original

https://github.com/ternjs/acorn/blob/master/test/jquery-string.js

is a browserfied HTML version of the raw file. The raw file itself is at

https://raw.githubusercontent.com/ternjs/acorn/master/test/jquery- string.js\https://raw.githubusercontent.com/ternjs/acorn

Okay\, I have just now used 'wget' to obtain that file and have run it thru my script. I got exactly the same results as previously\, i.e.\, 4 "No\n" with no fatal errors.

Thank you very much.

-- James E Keenan (jkeenan@​cpan.org)

p5pRT commented 8 years ago

From aab@purdue.edu

Okay\, here's my contribution. I took Abigails's script and went a bit overboard hacking it (see attachment).

# Pattern name modifiers​: # pos = uses possessive quantifiers\, i.e.\, *+ instead of * and ++ instead of + # reg = uses non-possessive quantifiers\, i.e.\, * or + or nothing # pr = uses a combination of possessive/non-possessive quantifiers # vb = uses variable backslashdot pattern\, i.e.\, (?​:\\.)* or (?​:\\.)+ # na = does not use alternation(|)

URL = https://raw.githubusercontent.com/ternjs/acorn/master/test/jquery-string.js Average CPU Seconds per regex invocation per pattern​: invocations = 60000 pattern-name CPU/regex state 'qr/.../' compiled-pattern unroll_pr2_vb 0.00572 pass (?^​:"[^"\\]*(?​:(?s​:\\.)++[^"\\]*)*") unroll_pr1_vb 0.00577 pass (?^​:"[^"\\]*(?​:(?s​:\\.)++[^"\\]*)*+") perlre_pos_vbna 0.00597 pass (?^​:"(?​:[^"\\]++(?s​:\\.)*+)*+") perlre_pr2_vbna 0.00600 pass (?^​:"(?​:[^"\\]+(?s​:\\.)*+)*") unroll_pos_vb 0.00602 pass (?^​:"[^"\\]*+(?​:(?s​:\\.)++[^"\\]*+)*+") perlre_pr1_vbna 0.00752 pass (?^​:"(?​:[^"\\]+(?s​:\\.)*)*+") unroll_reg_vb 0.00759 pass (?^​:"[^"\\]*(?​:(?s​:\\.)+[^"\\]*)*") perlre_reg_vbna 0.00762 pass (?^​:"(?​:[^"\\]+(?s​:\\.)*)*") perlre_pos_vb 0.01020 pass (?^​:"(?​:[^"\\]++|(?s​:\\.)*+)*+") perlre_reg_vb 0.01151 pass (?^​:"(?​:[^"\\]+|(?s​:\\.)*)*") perlre_reg 0.13182 fail (?^​:"(?​:[^"\\]+|\\(?s​:.))*") unroll_reg 0.14338 fail (?^​:"[^"\\]*(?​:\\(?s​:.)[^"\\]*)*") unroll_pos 4.10578 fail (?^​:"[^"\\]*+(?​:\\(?s​:.)[^"\\]*+)*+") perlre_pos 6.31433 fail (?^​:"(?​:[^"\\]++|\\(?s​:.))*+")

URL = https://github.com/ternjs/acorn/blob/master/test/jquery-string.js Average CPU Seconds per regex invocation per pattern​: invocations = 60000 pattern-name CPU/regex state 'qr/.../' compiled-pattern unroll_reg 0.11773 pass (?^​:"[^"\\]*(?​:\\(?s​:.)[^"\\]*)*") unroll_reg_vb 0.11941 pass (?^​:"[^"\\]*(?​:(?s​:\\.)+[^"\\]*)*") unroll_pos 0.12037 pass (?^​:"[^"\\]*+(?​:\\(?s​:.)[^"\\]*+)*+") unroll_pr2_vb 0.12207 pass (?^​:"[^"\\]*(?​:(?s​:\\.)++[^"\\]*)*") unroll_pr1_vb 0.12430 pass (?^​:"[^"\\]*(?​:(?s​:\\.)++[^"\\]*)*+") unroll_pos_vb 0.12483 pass (?^​:"[^"\\]*+(?​:(?s​:\\.)++[^"\\]*+)*+") perlre_reg_vbna 0.12925 pass (?^​:"(?​:[^"\\]+(?s​:\\.)*)*") perlre_pr1_vbna 0.12994 pass (?^​:"(?​:[^"\\]+(?s​:\\.)*)*+") perlre_reg 0.13298 pass (?^​:"(?​:[^"\\]+|\\(?s​:.))*") perlre_pr2_vbna 0.13323 pass (?^​:"(?​:[^"\\]+(?s​:\\.)*+)*") perlre_reg_vb 0.13626 pass (?^​:"(?​:[^"\\]+|(?s​:\\.)*)*") perlre_pos_vbna 0.13668 pass (?^​:"(?​:[^"\\]++(?s​:\\.)*+)*+") perlre_pos 0.13795 pass (?^​:"(?​:[^"\\]++|\\(?s​:.))*+") perlre_pos_vb 0.14314 pass (?^​:"(?​:[^"\\]++|(?s​:\\.)*+)*+")

The first set is for the raw 'jquery-string.js' file and the second for its HTMLized version. Note that Abigail's four patterns all fail with the raw file but her 'unroll_reg' pattern is the fastest for the HTMLized file. Actually\, all of the 'unroll_...' patterns are generally faster than any of the 'perlre_...' ones for that file.

The raw file has a single quoted string that is 338729 characters long and contains a large count of (often consecutive) '\.' pairs.

The HTML file contains 83692 quoted strings and all of the raw files double quotes have been converted to '"'. Although I didn't really check\, I don't think any of the quoted strings contains a '\.' pair.

If you KNOW that the files you are scanning do not have a large number of '\.' pairs\, especially consecutive ones\, use the 'unroll_reg' pattern. The possessive pattern\, 'unroll_pos' is not far behind. If you want a pattern that is fast and works for all quoted strings\, use the 'unroll_reg_vb' pattern. If you want a pattern that's fast for files with a lot of '\.' pairs\, use the 'unroll_pr2_vb' pattern.

It should be noted that the script was executed on a Dell Optiplex-760 box with two 3GHz CPUs using a Cygwin 2.6.0 installation that sits on top of a Windows Vista 32-bit OS. Perl's Time​::HiRes​::clock() and times() functions use a clock that has a 1 millisecond resolution so some of the raw file timings are probably variable. I'm not sure what it means but Time​::HiRes​::clock_getres() says approximately 15 milliseconds. Boy\, it would be nice to have clock with a higher resolution.

Mr. Keenan. You can see the perl error messages describing why Abigail's patterns failed if you run the attached script with the '-d' (debug) option.

The script can also be used check the patterns against your favorite quoted string test file by using the '-t file' option.

-- Thanks\, -- Paul Townsend

________________________________ From​: James E Keenan via RT \perlbug\-followup@&#8203;perl\.org Sent​: Thursday\, September 29\, 2016 6​:13 PM To​: Paul Townsend Subject​: [perl #128864] perlre(1) and paired double quote regex searches --

On Thu Sep 29 12​:34​:27 2016\, aab@​purdue.edu wrote​:

I just realized the other day that I specified he wrong URL for 'jquery-string.js'. The original

https://github.com/ternjs/acorn/blob/master/test/jquery-string.js

is a browserfied HTML version of the raw file. The raw file itself is at

https://raw.githubusercontent.com/ternjs/acorn/master/test/jquery- string.js\https://raw.githubusercontent.com/ternjs/acorn

Okay\, I have just now used 'wget' to obtain that file and have run it thru my script. I got exactly the same results as previously\, i.e.\, 4 "No\n" with no fatal errors.

Thank you very much.

-- James E Keenan (jkeenan@​cpan.org)

p5pRT commented 8 years ago

From aab@purdue.edu

perl_quote_patterns