Perl / perl5

🐪 The Perl programming language
https://dev.perl.org/perl5/
Other
1.96k stars 555 forks source link

large speed difference in simple regex between pcre and GNU regex #9638

Closed p5pRT closed 15 years ago

p5pRT commented 15 years ago

Migrated from rt.perl.org#62718 (status was 'rejected')

Searchable as RT62718$

p5pRT commented 15 years ago

From kees@outflux.net

Created by kees@outflux.net

This is a bug report for perl from kees@​outflux.net\, generated with the help of perlbug 1.36 running under perl 5.10.0.

----------------------------------------------------------------- In certain situations\, there is a giant slow-down in pcre\, when compared to grep​:

$ perl -e 'print "0" x 600' | time pcregrep '0+a?0*(XY?|Z)' Command exited with non-zero status 1 3.45user 0.00system 0​:03.47elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k

$ perl -e 'print "0" x 600' | time egrep '0+a?0*(XY?|Z)' Command exited with non-zero status 1 0.00user 0.00system 0​:00.00elapsed 44%CPU (0avgtext+0avgdata 0maxresident)k

Dropping "a?" makes no difference\, but illustrates the style of regex I was using. All other elements are required to hit the slow-down.

Thanks\,

-Kees

Perl Info ``` Flags: category=core severity=low Site configuration information for perl 5.10.0: Configured by Debian Project at Mon Jan 5 22:36:05 UTC 2009. Summary of my perl5 (revision 5 version 10 subversion 0) configuration: Platform: osname=linux, osvers=2.6.24-16-server, archname=x86_64-linux-gnu-thread-multi uname='linux yellow 2.6.24-16-server #1 smp thu apr 10 13:15:38 utc 2008 x86_64 gnulinux ' config_args='-Dusethreads -Duselargefiles -Dccflags=-DDEBIAN -Dcccdlflags=-fPIC -Darchname=x86_64-linux-gnu -Dprefix=/usr -Dprivlib=/usr/share/perl/5.10 -Darchlib=/usr/lib/perl/5.10 -Dvendorprefix=/usr -Dvendorlib=/usr/share/perl5 -Dvendorarch=/usr/lib/perl5 -Dsiteprefix=/usr/local -Dsitelib=/usr/local/share/perl/5.10.0 -Dsitearch=/usr/local/lib/perl/5.10.0 -Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3 -Dsiteman1dir=/usr/local/man/man1 -Dsiteman3dir=/usr/local/man/man3 -Dman1ext=1 -Dman3ext=3perl -Dpager=/usr/bin/sensible-pager -Uafs -Ud_csh -Ud_ualarm -Uusesfio -Uusenm -DDEBUGGING=-g -Doptimize=-O2 -Duseshrplib -Dlibperl=libperl.so.5.10.0 -Dd_dosuid -des' hint=recommended, useposix=true, d_sigaction=define useithreads=define, usemultiplicity=define useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef use64bitint=define, use64bitall=define, uselongdouble=undef usemymalloc=n, bincompat5005=undef Compiler: cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fno-strict-aliasing -pipe -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64', optimize='-O2 -g', cppflags='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fno-strict-aliasing -pipe -I/usr/local/include' ccversion='', gccversion='4.3.3 20081217 (prerelease)', gccosandvers='' intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16 ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 alignbytes=8, prototype=define Linker and Libraries: ld='cc', ldflags =' -L/usr/local/lib' libpth=/usr/local/lib /lib /usr/lib /lib64 /usr/lib64 libs=-lgdbm -lgdbm_compat -ldb -ldl -lm -lpthread -lc -lcrypt perllibs=-ldl -lm -lpthread -lc -lcrypt libc=/lib/libc-2.9.so, so=so, useshrplib=true, libperl=libperl.so.5.10.0 gnulibc_version='2.9' Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E' cccdlflags='-fPIC', lddlflags='-shared -O2 -g -L/usr/local/lib' Locally applied patches: @INC for perl 5.10.0: /etc/perl /usr/local/lib/perl/5.10.0 /usr/local/share/perl/5.10.0 /usr/lib/perl5 /usr/share/perl5 /usr/lib/perl/5.10 /usr/share/perl/5.10 /usr/local/lib/site_perl . Environment for perl 5.10.0: HOME=/home/kees LANG=en_US.UTF-8 LANGUAGE (unset) LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/bin/X11:/usr/games:/usr/local/sbin:/usr/sbin:/sbin:/home/kees/bin PERL_BADLANG (unset) SHELL=/bin/bash ```
p5pRT commented 15 years ago

From @demerphq

2009/1/24 via RT Kees Cook \perlbug\-followup@​perl\.org​:

# New Ticket Created by Kees Cook # Please include the string​: [perl #62718] # in the subject line of all future correspondence about this issue. # \<URL​: http​://rt.perl.org/rt3/Ticket/Display.html?id=62718 >

This is a bug report for perl from kees@​outflux.net\, generated with the help of perlbug 1.36 running under perl 5.10.0.

----------------------------------------------------------------- In certain situations\, there is a giant slow-down in pcre\, when compared to grep​:

$ perl -e 'print "0" x 600' | time pcregrep '0+a?0*(XY?|Z)' Command exited with non-zero status 1 3.45user 0.00system 0​:03.47elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k

$ perl -e 'print "0" x 600' | time egrep '0+a?0*(XY?|Z)' Command exited with non-zero status 1 0.00user 0.00system 0​:00.00elapsed 44%CPU (0avgtext+0avgdata 0maxresident)k

Dropping "a?" makes no difference\, but illustrates the style of regex I was using. All other elements are required to hit the slow-down.

Judging both by the name and by the description below

  http​://linuxcommand.org/man_pages/pcregrep1.html

pcregrep is a wrapper around the PCRE library which is an independent regular expression engine which supports the Perl regular expression syntax. It does not actually use perl\, or have anything to do with perl. So this ticket should be filed against that project and not here.

And I can happily report that when I do your test using true perl grep I see much better performance than pcregrep​:

$ perl -e 'print "0" x 600' | time perl -ne'print if /0+a?0*(XY?|Z)/' 0.02user 0.00system 0​:00.02elapsed 92%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+457minor)pagefaults 0swaps

Cheers\, yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 15 years ago

The RT System itself - Status changed from 'new' to 'open'

p5pRT commented 15 years ago

@rgs - Status changed from 'open' to 'rejected'

p5pRT commented 15 years ago

From @chorny

Hello.

1. PCRE is not Perl. PCRE can be used by Perl using a CPAN module\, but Perl uses it's own library for regex. 2. Similar program in Perl my $a="0" x 600; use Benchmark '​:hireswallclock'; my $time_start = Benchmark->new(); $a=~/0+a?0*(XY?|Z)/; my $time_end = Benchmark->new(); print timestr(timediff($time_end\, $time_start))\,"\n";

gives result on Redhat​: 0.027082 wallclock secs ( 0.03 usr + 0.00 sys = 0.03 CPU)

It is not as fast as egrep\, but Perl understands much more complex regular expressions than egrep. Also you can use (?>0+) for optimization\, it even works in my PCRE.

perl -e 'print "0" x 600' | time pcregrep '0+a?0*(XY?|Z)' Command exited with non-zero status 1 6.43user 0.00system 0​:06.45elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (1major+142minor)pagefaults 0swaps

pcregrep version 4.2 09-Jan-2006 using PCRE version 6.6 06-Feb-2006

If you will report this to PCRE team\, don't forget to specify your PCRE version. You can do this with

pcregrep --version

Also you should try to upgrade it to latest version first\, it may be already fixed.

2009/1/25 via RT Kees Cook \perlbug\-followup@&#8203;perl\.org​:

----------------------------------------------------------------- In certain situations\, there is a giant slow-down in pcre\, when compared to grep​:

$ perl -e 'print "0" x 600' | time pcregrep '0+a?0*(XY?|Z)' Command exited with non-zero status 1 3.45user 0.00system 0​:03.47elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k

$ perl -e 'print "0" x 600' | time egrep '0+a?0*(XY?|Z)' Command exited with non-zero status 1 0.00user 0.00system 0​:00.00elapsed 44%CPU (0avgtext+0avgdata 0maxresident)k

Dropping "a?" makes no difference\, but illustrates the style of regex I was using. All other elements are required to hit the slow-down.

Thanks\,

-Kees

-- Alexandr Ciornii\, http​://chorny.net

p5pRT commented 15 years ago

From kees@outflux.net

On Sun\, Jan 25\, 2009 at 06​:54​:31AM -0800\, yves orton via RT wrote​:

pcregrep is a wrapper around the PCRE library which is an independent regular expression engine which supports the Perl regular expression syntax. It does not actually use perl\, or have anything to do with perl. So this ticket should be filed against that project and not here.

Ah! Sorry\, yes. While trying to reduce my original issue (that was with perl) to a small test-case\, I switched to pcre without thinking about it's origin. Sticking to Perl\, I reduced my RE to something not quite as short\, but it still shows a large speed difference compared to grep.

perl -e 'print "0" x 600' | time perl -ne 'print if /([[​:digit​:]]+[\, ])*[[​:digit​:]]+\.?[[​:digit​:]]*[[​:space​:]]*(XY?|Z)/'

Or\, shorter​:

perl -e 'print "0" x 600' | time perl -ne'print if /(a+[\, ])*0+\.?0*b*(XY?|Z)/' 2.87user 0.00system 0​:02.90elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k

vs

perl -e 'print "0" x 600' | time egrep '(a+[\, ])*0+\.?0*b*(XY?|Z)' 0.00user 0.00system 0​:00.00elapsed 57%CPU (0avgtext+0avgdata 0maxresident)k

(I will go open a pcre report as well. Thanks!)

-Kees

-- Kees Cook @​outflux.net

p5pRT commented 15 years ago

From @davidnicol

On Sat\, Jan 24\, 2009 at 4​:59 PM\, via RT Kees Cook \perlbug\-followup@&#8203;perl\.org wrote​:

----------------------------------------------------------------- In certain situations\, there is a giant slow-down

Yes. Situations causing slowness involve checking all alternative groupings. Someone wrote a paragraph in The Perl Journal a decade ago suggesting running your regexes backwards when you can't rearrange things to avoid going quadratic\, and this approach works with your inefficient case​:

$ perl -e 'print "0" x 600' | time perl -ne'print if /(a+[\, ])*0+\.?0*b*(XY?|Z)/' 6.07user 0.03system 0​:06.12elapsed 99%CPU (0avgtext+0avgdata 547584maxresident)k 0inputs+0outputs (2211major+0minor)pagefaults 0swaps

$ perl -e 'print "0" x 600' | time perl -ne'print if (reverse $_) =~ /(XY?|Z)b*0*\.?0+([\, ]a+)*/' 0.03user 0.06system 0​:00.08elapsed 113%CPU (0avgtext+0avgdata 547840maxresident)k 0inputs+0outputs (2212major+0minor)pagefaults 0swaps

The reason is\, Perl's regex engine always starts from the left\, then backtracks through all possibilities. So one can construct horrible time-sink regexes by giving lots of alternatives on the left end and obviously failing on the right end.

$ perl -e 'print "0" x 600' | time perl -ne'print if /0+0+0+0+x/' 57.45user 0.03system 0​:57.62elapsed 99%CPU (0avgtext+0avgdata 546560maxresident)k 0inputs+0outputs (2207major+0minor)pagefaults 0swaps

$ perl -e 'print "0" x 600' | time perl -ne'print if (reverse $_) =~ /x0+0+0+0+/' 0.03user 0.04system 0​:00.07elapsed 105%CPU (0avgtext+0avgdata 545792maxresident)k 0inputs+0outputs (2204major+0minor)pagefaults 0swaps

Good Perlists are supposed to know about this\, and the occasional attempt to "fix" the situation\, by for instance recognizing must-have bits of regexes and doing a first pass to verify that they are in fact in there somewhere before spending a lot of time trying to find the configuration of deck chairs that will prevent the ship from sinking\, has\, in my recollection\, been rejected at the "should I bother?" stage as something that would slow down the "general" case of regexes tuned by People Who Know What They're Doing.

In the OP's "minimal" test case\, there are two icebergs​: there must be at least one zero\, and the final capture group is not optional. These can be checked for explicitly\, in advance\, to avoid playing towers of Hanoi with the deck chairs​: $ perl -e 'print "0" x 600' | time perl -ne'print if /0.*[XZ]/ && /(a+[\, ])*?0+\.?0*b*(XY?|Z)/' 0.04user 0.04system 0​:00.07elapsed 129%CPU (0avgtext+0avgdata 546048maxresident)k 0inputs+0outputs (2205major+0minor)pagefaults 0swaps

All the benefits of doing your regex backwards\, but far less cute.

$ perl -e 'print "0" x 600' | time perl -ne'print if /x/ && /0+0+0+0+x/' 0.04user 0.03system 0​:00.07elapsed 104%CPU (0avgtext+0avgdata 545280maxresident)k 0inputs+0outputs (2202major+0minor)pagefaults 0swaps

p5pRT commented 15 years ago

From @demerphq

2009/1/26 David Nicol \davidnicol@&#8203;gmail\.com​:

Good Perlists are supposed to know about this\, and the occasional attempt to "fix" the situation\, by for instance recognizing must-have bits of regexes and doing a first pass to verify that they are in fact in there somewhere before spending a lot of time trying to find the configuration of deck chairs that will prevent the ship from sinking\, has\, in my recollection\, been rejected at the "should I bother?" stage as something that would slow down the "general" case of regexes tuned by People Who Know What They're Doing.

This is not correct. Serious efforts have been made to do exactly this.

And in fact it is exactly this optimisation that normally makes perls regex engine fail fast.

It is however unfortunate that the optimisation does not extend to using TRIE regops unless the start the pattern\, and cannot see that [XY] must be in the original pattern.

However it is very true that the perl engine\, and all those like it\, have problems with the class of pattern being discussed\, however judicious use of (?>...) should prevent the diabolical cases from coming into play.

Yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 15 years ago

From kees@outflux.net

On Mon\, Jan 26\, 2009 at 07​:34​:19AM -0800\, yves orton via RT wrote​:

2009/1/26 David Nicol \davidnicol@&#8203;gmail\.com​:

Good Perlists are supposed to know about this\, and the occasional attempt to "fix" the situation\, by for instance recognizing must-have bits of regexes and doing a first pass to verify that they are in fact in there somewhere before spending a lot of time trying to find the configuration of deck chairs that will prevent the ship from sinking\, has\, in my recollection\, been rejected at the "should I bother?" stage as something that would slow down the "general" case of regexes tuned by People Who Know What They're Doing.

This is not correct. Serious efforts have been made to do exactly this.

And in fact it is exactly this optimisation that normally makes perls regex engine fail fast.

It is however unfortunate that the optimisation does not extend to using TRIE regops unless the start the pattern\, and cannot see that [XY] must be in the original pattern.

However it is very true that the perl engine\, and all those like it\, have problems with the class of pattern being discussed\, however judicious use of (?>...) should prevent the diabolical cases from coming into play.

Thanks for the detailed answers!

The situation I was trying to reduce came from SpamAssassin rules\, so I don't have the flexibility to reverse the text before matching. I think I can get away with some (?>) to speed it up. I also reduced one of the "*" to "{0\,2}"\, which was more correct and seemed to short-cut the problem.

Thanks again\,

-Kees

-- Kees Cook @​outflux.net

p5pRT commented 15 years ago

From @davidnicol

On Mon\, Jan 26\, 2009 at 9​:33 AM\, demerphq \demerphq@&#8203;gmail\.com wrote​:

This is not correct. Serious efforts have been made to do exactly this.

And in fact it is exactly this optimisation that normally makes perls regex engine fail fast.

It is however unfortunate that the optimisation does not extend to using TRIE regops unless the start the pattern\, and cannot see that [XY] must be in the original pattern.

However it is very true that the perl engine\, and all those like it\, have problems with the class of pattern being discussed\, however judicious use of (?>...) should prevent the diabolical cases from coming into play.

Yves

I'm glad to be corrected. Still\,

  perl -le '$_="0" x 600; print if /0+0+0+0+x/'

takes a long time\, even on 5.11.0.

How difficult would it be to start from the end and work backwards\, identifying non-optional pieces ("Anchors") and checking for them in a non-greedy way\, without regard to captures\, before bothering to start greedy matching? Or scan through to the end the first time something fails to verify that there aren't later parts that will clearly fail\, before doing any backtracking?

It seems like this is a situation where it would be difficult to get all the corner and edge cases right\, but provisionally checking ahead to see if the things we're trying to backtrack and find are actually going to help if we find them seems like it would help more than it hurts.

p5pRT commented 15 years ago

From @demerphq

2009/1/28 David Nicol \davidnicol@&#8203;gmail\.com​:

On Mon\, Jan 26\, 2009 at 9​:33 AM\, demerphq \demerphq@&#8203;gmail\.com wrote​:

This is not correct. Serious efforts have been made to do exactly this.

And in fact it is exactly this optimisation that normally makes perls regex engine fail fast.

It is however unfortunate that the optimisation does not extend to using TRIE regops unless the start the pattern\, and cannot see that [XY] must be in the original pattern.

However it is very true that the perl engine\, and all those like it\, have problems with the class of pattern being discussed\, however judicious use of (?>...) should prevent the diabolical cases from coming into play.

Yves

I'm glad to be corrected. Still\,

perl -le '$_="0" x 600; print if /0+0+0+0+x/'

Ah\, i see whats happening. Its picking "0000" as the start string\, as its longer than 'x'\, and thus it enters the pattern\, and then goes degenerate.

Whereas if you change the pattern to​:

perl -le '$_="0" x 600; print if /[0]+[0]+[0]+[0]+x/'

then it fails instantly.

Hard to say what the right thing is here beyond DFA construction.

Also notice that these are fairly contrived examples. Most sane patterns have a reasonable required string.

Yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"