Perl / perl5

🐪 The Perl programming language
https://dev.perl.org/perl5/
Other
1.96k stars 555 forks source link

Simple regular expression is wrongly optimised #7158

Closed p5pRT closed 13 years ago

p5pRT commented 20 years ago

Migrated from rt.perl.org#27515 (status was 'resolved')

Searchable as RT27515$

p5pRT commented 20 years ago

From @jlokier

Created by @jlokier

Try this​:

  $_ = "1x"; /^(.*)(?=x)x/ ? print "match :$1​:\n" : print "no match\n"'

It should print "match :1​:" but it actually prints "no match". Now this​:

  $_ = "1x"; /(.*)(?=x)x/ ? print "match :$1​:\n" : print "no match\n"'

It should print "match :1​:" but it actually prints "match :​:".

What's going wrong? You can see from the following "use re 'debug'" output​: The "floating substr" optimisation is broken​:

  Guessing start of match\, REx `^(.*)(?=x)x' against `1x'...   Found floating substr `x' at offset 1...   By STCLASS​: moving 0 --> 1   Guessed​: match at offset 1   Matching REx `^(.*)(?=x)x' against `x'   Setting an EVAL scope\, savestack=3   1 \<1> \ | 1​: BOL   failed...   Match failed   no match

It is equally broken with regexes of the form /^(.*(?=x))x/\, /^(.*)(?=[x])x/ and /^(.*)(?=[a-z])x/. /^(.*)(?=[^\n])x/\, /^(.*)(?=[1x])x/ and /^(.*)(?=x)./ are fine.

Have a nice day\, Thanks\, -- Jamie

Perl Info ``` Flags: category=core severity=high Site configuration information for perl v5.8.0: Configured by bhcompile' cf_email='bhcompile at Wed Aug 13 11:45:59 EDT 2003. Summary of my rderl (revision 5.0 version 8 subversion 0) configuration: Platform: osname=linux, osvers=2.4.21-1.1931.2.382.entsmp, archname=i386-linux-thread-multi uname='linux str' config_args='-des -Doptimize=-O2 -g -pipe -march=i386 -mcpu=i686 -Dmyhostname=localhost -Dperladmin=root@localhost -Dcc=gcc -Dcf_by=Red Hat, Inc. -Dinstallprefix=/usr -Dprefix=/usr -Darchname=i386-linux -Dvendorprefix=/usr -Dsiteprefix=/usr -Dotherlibdirs=/usr/lib/perl5/5.8.0 -Duseshrplib -Dusethreads -Duseithreads -Duselargefiles -Dd_dosuid -Dd_semctl_semun -Di_db -Ui_ndbm -Di_gdbm -Di_shadow -Di_syslog -Dman3ext=3pm -Duseperlio -Dinstallusrbinperl -Ubincompat5005 -Uversiononly -Dpager=/usr/bin/less -isr' hint=recommended, useposix=true, d_sigaction=define usethreads=define use5005threads=undef' useithreads=define usemultiplicity= useperlio= d_sfio=undef uselargefiles=define usesocks=undef use64bitint=undef use64bitall=un uselongdouble= usemymalloc=, bincompat5005=undef Compiler: cc='gcc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -DDEBUGGING -fno-strict-aliasing -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -I/usr/include/gdbm', optimize='', cppflags='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -DDEBUGGING -fno-strict-aliasing -I/usr/local/include -I/usr/include/gdbm' ccversion='', gccversion='3.2.2 20030222 (Red Hat Linux 3.2.2-5)', gccosandvers='' gccversion='3.2.2 200302' intsize=r, longsize=r, ptrsize=5, doublesize=8, byteorder=1234 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12 ivtype='long' k', ivsize=4' ivtype='l, nvtype='double' o_nonbl', nvsize=, Off_t='', lseeksize=8 alignbytes=4, prototype=define Linker and Libraries: ld='gcc' l', ldflags =' -L/u' libpth=/usr/local/lib /lib /usr/lib libs=-lnsl -lgdbm -ldb -ldl -lm -lpthread -lc -lcrypt -lutil perllibs= libc=/lib/libc-2.3.2.so, so=so, useshrplib=true, libperl=libper gnulibc_version='2.3.2' Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so', d_dlsymun=undef, ccdlflags='-rdynamic -Wl,-rpath,/usr/lib/perl5/5.8.0/i386-linux-thread-multi/CORE' cccdlflags='-fPIC' ccdlflags='-rdynamic -Wl,-rpath,/usr/lib/perl5', lddlflags='s Unicode/Normalize XS/A' Locally applied patches: MAINT18379 @INC for perl v5.8.0: /usr/lib/perl5/5.8.0/i386-linux-thread-multi /usr/lib/perl5/5.8.0 /usr/lib/perl5/site_perl/5.8.0/i386-linux-thread-multi /usr/lib/perl5/site_perl/5.8.0 /usr/lib/perl5/site_perl /usr/lib/perl5/vendor_perl/5.8.0/i386-linux-thread-multi /usr/lib/perl5/vendor_perl/5.8.0 /usr/lib/perl5/vendor_perl /usr/lib/perl5/5.8.0/i386-linux-thread-multi /usr/lib/perl5/5.8.0 . Environment for perl v5.8.0: HOME=/home/jamie LANG=en_GB.UTF-8 LANGUAGE (unset) LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=/usr/local/bin:/bin:/usr/bin:/usr/X11R6/bin:/home/jamie/bin PERL_BADLANG (unset) SHELL=/bin/bash dlflags='-share (unset) ```
p5pRT commented 20 years ago

From @hvds

Jamie Lokier (via RT) \perlbug\-followup@&#8203;perl\.org wrote​: :Try this​: : : $_ = "1x"; /^(.*)(?=x)x/ ? print "match :$1​:\n" : print "no match\n"' : :It should print "match :1​:" but it actually prints "no match". :Now this​: : : $_ = "1x"; /(.*)(?=x)x/ ? print "match :$1​:\n" : print "no match\n"' : :It should print "match :1​:" but it actually prints "match :​:".

Thanks for the report\, I can confirm this bug still exists with the latest development version of perl. It was actually reported before\, and there are example tests (marked as 'expected to fail') in the test suite for exactly this type of case.

The original discussion is at​:   http​://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2002-01/msg00650.html

:What's going wrong? You can see from the following "use re 'debug'" :output​: The "floating substr" optimisation is broken​:

It isn't the floating substr but the stclass optimisation that's broken; quoting from another part of the -Dr output​:   floating `x' at 0..2147483647 (checking floating) stclass `ANYOF[x]' minlen 1

That's saying​: - we must find a string "x" somewhere\, at any position - the matched text must start with something matching character class [x] - the minimum length we will match is 1

Now\, the stclass logic for /.*x/ should be​:   must start with ( [^\n] or [x] ) .. and for /.*(?=x)x/​:   must start with ( [^\n] or ([x] and [x]) )

I believe what happens is that regcomp.c​:S_study_chunk() uses flags SCF_DO_STCLASS_AND and SCF_DO_STCLASS_OR to signal what we should do\, but because it is working forward through the compiled regexp\, it goes something like​:   (start) [any] AND   .* [^\n] OR   (?=x) ([^\n] or [x]) AND   x (([^\n] or [x]) and [x]) .. but study_chunk() is quite\, er\, challenging to read through\, and I think I got stuck last time I tried to track this down.

I'll give it another go. :)

Hugo

p5pRT commented 20 years ago

The RT System itself - Status changed from 'new' to 'open'

p5pRT commented 20 years ago

From @hvds

hv@​crypt.org wrote​: :I believe what happens is that regcomp.c​:S_study_chunk() uses flags :SCF_DO_STCLASS_AND and SCF_DO_STCLASS_OR to signal what we should do\, :but because it is working forward through the compiled regexp\, it :goes something like​: : (start) [any] AND : .* [^\n] OR : (?=x) ([^\n] or [x]) AND : x (([^\n] or [x]) and [x]) :.. but study_chunk() is quite\, er\, challenging to read through\, and :I think I got stuck last time I tried to track this down.

So\, if we have an expression like​:   a?(?=b)c?(?=d)e (where a..e represent possibly overlapping character classes)\, we need to be able to construct a start class like​:   a | (b & (c | (d & e)))

I think we can achieve this by passing two classes around\, both an 'or' class and an 'and' class. Initialise to [none] and [any] respectively\, and then​: - for each subpattern\, combine its two classes with OR - if the subpattern class should be ANDed\, AND it with our 'and' class - if the subpattern should be ORed\, OR (it AND our 'and' class) with   our 'or' class

For the above example that would give​:   where style or-class and-class   ------------------------------------------   (start) and none any   a? or a any   (?=b) and a b   c? or a | (b & c) b   (?=d) and a | (b & c) b & d   e and a | (b & c) b & d & e .. giving a final class of​:   (a | (b & c)) | (b & d & e) .. which I think is correctly equivalent to the original​:   a | (b & (c | (d & e)))

If anyone can follow this\, and can either see flaws or confirm its general correctness that would be very useful.

Hugo

p5pRT commented 20 years ago

From nick.ing-simmons@elixent.com

\hv@&#8203;crypt\.org writes​:

hv@​crypt.org wrote​: ​:I believe what happens is that regcomp.c​:S_study_chunk() uses flags ​:SCF_DO_STCLASS_AND and SCF_DO_STCLASS_OR to signal what we should do\, ​:but because it is working forward through the compiled regexp\, it ​:goes something like​: ​: (start) [any] AND ​: .* [^\n] OR ​: (?=x) ([^\n] or [x]) AND ​: x (([^\n] or [x]) and [x]) ​:.. but study_chunk() is quite\, er\, challenging to read through\, and ​:I think I got stuck last time I tried to track this down.

So\, if we have an expression like​: a?(?=b)c?(?=d)e (where a..e represent possibly overlapping character classes)\, we need to be able to construct a start class like​: a | (b & (c | (d & e)))

I think we can achieve this by passing two classes around\, both an 'or' class and an 'and' class. Initialise to [none] and [any] respectively\, and then​: - for each subpattern\, combine its two classes with OR - if the subpattern class should be ANDed\, AND it with our 'and' class - if the subpattern should be ORed\, OR (it AND our 'and' class) with our 'or' class

For the above example that would give​: where style or-class and-class ------------------------------------------ (start) and none any a? or a any (?=b) and a b c? or a | (b & c) b (?=d) and a | (b & c) b & d e and a | (b & c) b & d & e .. giving a final class of​: (a | (b & c)) | (b & d & e) .. which I think is correctly equivalent to the original​: a | (b & (c | (d & e)))

If anyone can follow this\,

Not at 2nd reading\, but it smells of De Morgan

  A&B = ~(~A|~B)

and can either see flaws or confirm its general correctness that would be very useful.

Hugo

p5pRT commented 16 years ago

From p5p@spam.wizbit.be

On Mon Mar 08 09​:11​:14 2004\, jamie wrote​:

This is a bug report for perl from jamie@​shareable.org\, generated with the help of perlbug 1.34 running under perl v5.8.0.

----------------------------------------------------------------- [Please enter your report here]

Try this​:

$\_ = "1x"; /^\(\.\*\)\(?=x\)x/ ? print "match :$1&#8203;:\\n" : print "no

match\n"'

It should print "match :1​:" but it actually prints "no match".

perl-5.8.8​: no-match perl-5.8.9-tobe​: no-match perl-5.10.0​: no-match perl-blead​: no-match

Now this​:

$\_ = "1x"; /\(\.\*\)\(?=x\)x/ ? print "match :$1&#8203;:\\n" : print "no

match\n"'

It should print "match :1​:" but it actually prints "match :​:".

perl-5.8.8​: match :​: perl-5.8.9-tobe​: match :1​: perl-5.10.0​: match :1​: perl-blead​: match :1​:

What's going wrong? You can see from the following "use re 'debug'" output​: The "floating substr" optimisation is broken​:

Guessing start of match\, REx \`^\(\.\*\)\(?=x\)x' against \`1x'\.\.\.
Found floating substr \`x' at offset 1\.\.\.
By STCLASS&#8203;: moving 0 \-\-> 1
Guessed&#8203;: match at offset 1
Matching REx \`^\(\.\*\)\(?=x\)x' against \`x'
  Setting an EVAL scope\, savestack=3
   1 \<1> \<x>              |  1&#8203;:  BOL
            failed\.\.\.
Match failed
no match

It is equally broken with regexes of the form /^(.*(?=x))x/\, /^(.*)(?=[x])x/ and /^(.*)(?=[a-z])x/. /^(.*)(?=[^\n])x/\, /^(.*)(?=[1x])x/ and /^(.*)(?=x)./ are fine.

p5pRT commented 13 years ago

From @khwilliamson

This now works as of 5.13.11. --Karl Williamson

p5pRT commented 13 years ago

@khwilliamson - Status changed from 'open' to 'resolved'