Open p5pRT opened 11 years ago
Hello\,
My understanding of how (*THEN) works is that the test below should match. The perlre page says "...this verb always matches\, and when backtracked into on failure\, it causes the regex engine to try the next alternation in the innermost enclosing group (capturing or otherwise) that has alternations." Unless I am going mad\, the examples below (one a normal group\, the other an assertion) fulfil the condition.
$ perl -e 'print (("ac" =~ /^(?=ab|ac)/)? "yes\n":"no\n")' yes $ perl -e 'print (("ac" =~ /^(?=a(*THEN)b|ac)/)? "yes\n":"no\n")' no
$ perl -e 'print (("ac" =~ /^(ab|ac)/)? "yes\n":"no\n")' yes $ perl -e 'print (("ac" =~ /^(a(*THEN)b|ac)/)? "yes\n":"no\n")' no
I discovered this because PCRE does match these patterns.
Regards\, Philip
-- Philip Hazel
Flags: category=core severity=low
Site configuration information for perl 5.16.2:
Configured by nobody at Tue Dec 11 22:56:31 CET 2012.
Summary of my perl5 (revision 5 version 16 subversion 2) configuration:
Platform:
osname=linux\, osvers=3.6.9-1-arch\, archname=i686-linux-thread-multi
uname='linux flo 3.6.9-1-arch #1 smp preempt tue dec 4 08:04:10 cet 2012 i686 gnulinux '
config_args='-des -Dusethreads -Duseshrplib -Doptimize=-march=i686 -mtune=generic -O2 -pipe -fstack-protector --param=ssp-buffer-size=4 -D_FORTIFY_SOURCE=2 -Dprefix=/usr -Dinstallprefix=/usr -Dvendorprefix=/usr -Dprivlib=/usr/share/perl5/core_perl -Darchlib=/usr/lib/perl5/core_perl -Dsitelib=/usr/share/perl5/site_perl -Dsitearch=/usr/lib/perl5/site_perl -Dvendorlib=/usr/share/perl5/vendor_perl -Dvendorarch=/usr/lib/perl5/vendor_perl -Dscriptdir=/usr/bin/core_perl -Dsitescript=/usr/bin/site_perl -Dvendorscript=/usr/bin/vendor_perl -Dinc_version_list=none -Dman1ext=1perl -Dman3ext=3perl -Dlddlflags=-shared -Wl\,-O1\,--sort-common\,--as-needed\,-z\,relro -Dldflags=-Wl\,-O1\,--sort-common\,--as-needed\,-z\,relro'
hint=recommended\, useposix=true\, d_sigaction=define
useithreads=define\, usemultiplicity=define
useperlio=define\, d_sfio=undef\, uselargefiles=define\, usesocks=undef
use64bitint=undef\, use64bitall=undef\, uselongdouble=undef
usemymalloc=n\, bincompat5005=undef
Compiler:
cc='cc'\, ccflags ='-D_REENTRANT -D_GNU_SOURCE -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64'\,
optimize='-march=i686 -mtune=generic -O2 -pipe -fstack-protector --param=ssp-buffer-size=4 -D_FORTIFY_SOURCE=2'\,
cppflags='-D_REENTRANT -D_GNU_SOURCE -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include'
ccversion=''\, gccversion='4.7.2'\, gccosandvers=''
intsize=4\, longsize=4\, ptrsize=4\, doublesize=8\, byteorder=1234
d_longlong=define\, longlongsize=8\, d_longdbl=define\, longdblsize=12
ivtype='long'\, ivsize=4\, nvtype='double'\, nvsize=8\, Off_t='off_t'\, lseeksize=8
alignbytes=4\, prototype=define
Linker and Libraries:
ld='cc'\, ldflags ='-Wl\,-O1\,--sort-common\,--as-needed\,-z\,relro -fstack-protector -L/usr/local/lib'
libpth=/usr/local/lib /lib /usr/lib
libs=-lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lpthread -lc -lgdbm_compat
perllibs=-lnsl -ldl -lm -lcrypt -lutil -lpthread -lc
libc=/lib/libc-2.16.so\, so=so\, useshrplib=true\, libperl=libperl.so
gnulibc_version='2.16'
Dynamic Linking:
dlsrc=dl_dlopen.xs\, dlext=so\, d_dlsymun=undef\, ccdlflags='-Wl\,-E -Wl\,-rpath\,/usr/lib/perl5/core_perl/CORE'
cccdlflags='-fPIC'\, lddlflags='-shared -Wl\,-O1\,--sort-common\,--as-needed\,-z\,relro -L/usr/local/lib -fstack-protector'
Locally applied patches:
@INC for perl 5.16.2: /usr/lib/perl5/site_perl /usr/share/perl5/site_perl /usr/lib/perl5/vendor_perl /usr/share/perl5/vendor_perl /usr/lib/perl5/core_perl /usr/share/perl5/core_perl .
Environment for perl 5.16.2: HOME=/home/ph10 LANG=en_GB LANGUAGE (unset) LC_ALL=C LC_COLLATE=C LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=/home/ph10/bin:/usr/local/bin:/usr/bin:/bin:/usr/local/sbin:/usr/sbin:/sbin:/usr/bin/vendor_perl:/usr/bin/core_perl:. PERL_BADLANG (unset) SHELL=/bin/bash
On Fri\, Jan 25\, 2013 at 09:23:54AM -0800\, Philip Hazel wrote:
My understanding of how (*THEN) works is that the test below should match. The perlre page says "...this verb always matches\, and when backtracked into on failure\, it causes the regex engine to try the next alternation in the innermost enclosing group (capturing or otherwise) that has alternations." Unless I am going mad\, the examples below (one a normal group\, the other an assertion) fulfil the condition.
$ perl -e 'print (("ac" =~ /^(?=ab|ac)/)? "yes\n":"no\n")' yes $ perl -e 'print (("ac" =~ /^(?=a(*THEN)b|ac)/)? "yes\n":"no\n")' no
$ perl -e 'print (("ac" =~ /^(ab|ac)/)? "yes\n":"no\n")' yes $ perl -e 'print (("ac" =~ /^(a(*THEN)b|ac)/)? "yes\n":"no\n")' no
These work in 5.10.1\, but not in 5.14.1.
These are the only tests involving (*THEN) that expect a successful match\, from t/re/pat_advanced.t:
{ #Mindnumbingly simple test of (*THEN) for ("ABC"\,"BAX") { ok /A (*THEN) X | B (*THEN) C/x\, "Simple (*THEN) test"; } }
The key difference seems to be that in your tests\, the two alternations begin with the same character.
Ronald
The RT System itself - Status changed from 'new' to 'open'
On Fri\, Jan 25\, 2013 at 4:17 PM\, Ronald J Kimball \rjk@​tamias\.net wrote:
On Fri\, Jan 25\, 2013 at 09:23:54AM -0800\, Philip Hazel wrote:
My understanding of how (*THEN) works is that the test below should match. The perlre page says "...this verb always matches\, and when backtracked into on failure\, it causes the regex engine to try the next alternation in the innermost enclosing group (capturing or otherwise) that has alternations." Unless I am going mad\, the examples below (one a normal group\, the other an assertion) fulfil the condition.
$ perl -e 'print (("ac" =~ /^(?=ab|ac)/)? "yes\n":"no\n")' yes $ perl -e 'print (("ac" =~ /^(?=a(*THEN)b|ac)/)? "yes\n":"no\n")' no
$ perl -e 'print (("ac" =~ /^(ab|ac)/)? "yes\n":"no\n")' yes $ perl -e 'print (("ac" =~ /^(a(*THEN)b|ac)/)? "yes\n":"no\n")' no
These work in 5.10.1\, but not in 5.14.1.
These are the only tests involving (*THEN) that expect a successful match\, from t/re/pat_advanced.t:
\{ \#Mindnumbingly simple test of \(\*THEN\) for \("ABC"\,"BAX"\) \{ ok /A \(\*THEN\) X | B \(\*THEN\) C/x\, "Simple \(\*THEN\) test"; \} \}
The key difference seems to be that in your tests\, the two alternations begin with the same character.
This appears to be caused by the TRIE optimization (as far as I can tell)
$ perl -Mre=debug -e'print (("ac" =~ /^(a(*THEN)b|ac)/)? "yes\n":"no\n")'
Compiling REx "^(a(*THEN)b|ac)"
Final program:
1: BOL (2)
2: OPEN1 (4)
4: TRIE-EXACT[a] (14)
\ (7)
7: CUTGROUP (9)
9: EXACT \ (14)
\
This fails in the exact same manner:
$ perl -Mre=debug -e'print (("ac" =~ /^((?:a(*THEN)b)|ac)/)? "yes\n":"no\n")'
This succeeds:
$ perl -Mre=debug -e'print (("ac" =~ /^((a(*THEN)b)|ac)/)? "yes\n":"no\n")'
Compiling REx "^((a(*THEN)b)|ac)"
Final program:
1: BOL (2)
2: OPEN1 (4)
4: BRANCH (15)
5: OPEN2 (7)
7: EXACT \ (9)
9: CUTGROUP (11)
11: EXACT \ (13)
13: CLOSE2 (18)
15: BRANCH (FAIL)
16: EXACT \
On 01/25/2013 03:17 PM\, Ronald J Kimball wrote:
On Fri\, Jan 25\, 2013 at 09:23:54AM -0800\, Philip Hazel wrote:
My understanding of how (*THEN) works is that the test below should match. The perlre page says "...this verb always matches\, and when backtracked into on failure\, it causes the regex engine to try the next alternation in the innermost enclosing group (capturing or otherwise) that has alternations." Unless I am going mad\, the examples below (one a normal group\, the other an assertion) fulfil the condition.
$ perl -e 'print (("ac" =~ /^(?=ab|ac)/)? "yes\n":"no\n")' yes $ perl -e 'print (("ac" =~ /^(?=a(*THEN)b|ac)/)? "yes\n":"no\n")' no
$ perl -e 'print (("ac" =~ /^(ab|ac)/)? "yes\n":"no\n")' yes $ perl -e 'print (("ac" =~ /^(a(*THEN)b|ac)/)? "yes\n":"no\n")' no
These work in 5.10.1\, but not in 5.14.1.
These are the only tests involving (*THEN) that expect a successful match\, from t/re/pat_advanced.t:
\{ \#Mindnumbingly simple test of \(\*THEN\) for \("ABC"\,"BAX"\) \{ ok /A \(\*THEN\) X | B \(\*THEN\) C/x\, "Simple \(\*THEN\) test"; \} \}
The key difference seems to be that in your tests\, the two alternations begin with the same character.
Ronald
I bisected this problem. The offending commit is commit 2e64971a6530d2645969bc489f564bfd3ce64993 Author: David Mitchell \davem@​iabyn\.com Date: Mon May 3 13:57:58 2010 +0100
tries: don't allocate memory at runtime
This is an indirect fix for [perl #74484] Regex causing exponential runtime+mem usage
The trie runtime code was doing more SAVETMPS than FREETMPS and was thus growing a large tmps stack on heavy backtracking. Rather than fixing this directly\, I rewrote part of the trie code so that it no longer needs to allocate memory in S_regmatch (it still does in find_byclass()).
The basic issue is that multiple branches in the trie may trigger an accept state; for example:
"abcd" =~ /xyz/abcd.*X|ab.*Y|/
here\, words (branches) 2 and 3 are accept states. The original approach was\, at run time\, to create a list of accepted word numbers and the character positions of the end of each of those words. Then run the rest of the pattern for each word in the list in turn (in word index order). This requires memory for the list to be allocated and freed.
The new approach involves creating extra info at compile time; in particular\, for each word\, a pointer to the previous accepted word (if any) in the state tree. For example for the above pattern\, part of the state tree may be
q b c d 1 -> 2 -> 3 -> 4 -> 5 (#3) (#2)
(e.g. at state 1\, if the next char is 'a'\, we transition to state 2). Here\, state 3 is an accept state with word #3\, and 5 is an accept state with word #2. So we build a table indexed by word number\, which has wordinfo[2] = 3\, wordinfo[3] = 0\, thus building the word chain 2->3->0.
At run time we run the trie to completion\, and remember the word associated with the longest accept state (word #2 above). Then by following back the chain of .prev fields\, we can produce a list of all accepting words. We then iteratively find the smallest-numbered (ie LH-most) word in the chain\, and run with it. On failure and backtrack\, we find the next-smallest and so on.
Since we are no longer recording the end-position of each word in the string\, we have to recalculate this for each backtrack. We initially record the end-position of the shortest accepting word\, and given that we know the length of each word\, we can calculate the new position each time as an offset from that first word. Depending on unicode and folding\, that calculation can be cheap or expensive.
This algorithm is optimised for the typical case where there are a small number (\<= 2) accepting states.
This patch creates a new compile-time array\, trie->wordinfo[]\, indexed by word number\, which contains relevant info about each word. This also supersedes the old trie->newword[] array\, whose function of recording "overspills" of multiple words per accept state\, is now handled as part of the wordinfo[].prev chain.
On Fri\, Jan 25\, 2013 at 04:47:48PM -0700\, Karl Williamson wrote:
I bisected this problem. The offending commit is commit 2e64971a6530d2645969bc489f564bfd3ce64993 Author: David Mitchell \davem@​iabyn\.com Date: Mon May 3 13:57:58 2010 +0100
tries​: don't allocate memory at runtime This is an indirect fix for \[perl \#74484\] Regex causing exponential runtime\+mem usage
Oh joy. I'll add it to my list...
-- In the 70's we wore flares because we didn't know any better. What possible excuse does the current generation have?
Migrated from rt.perl.org#116537 (status was 'open')
Searchable as RT116537$