Perl / perl5

🐪 The Perl programming language
https://dev.perl.org/perl5/
Other
1.96k stars 555 forks source link

Perl Regexps sometimes need mystery [^\S\s]* to work right #2128

Closed p5pRT closed 20 years ago

p5pRT commented 24 years ago

Migrated from rt.perl.org#3421 (status was 'resolved')

Searchable as RT3421$

p5pRT commented 24 years ago

From flan@desktop.com

Created by flan@desktop.com

This is a bug report for perl from flan@​desktop.com\, generated with the help of perlbug 1.28 running under perl v5.6.0.

----------------------------------------------------------------- Here's what we're trying to do​:

'Yo Momma \a@&#8203;aa\.aa' =~ m{   ([-\w]+\ ){2} # First and last name\, space-separated   \\< # Email delimiter   [-\w.%_]+ # Email username   \@​ # Email separator   ([-\w]+\.)+ # Email domain   \w\w+ # Email TLD   \> # Email delimiter   }x;

This should match\, but it doesn't​:

perlsh> show scalar 'Yo Momma \a@&#8203;aa\.aa' =~ m{ (([-\w]+\ ){2} \\< [-\w.%_]+ \@​ ([-\w]+\.)+ \w\w+ \>) }x @​data = (   ""   );

However\, adding a mystery [^\S\s]* fixes the problem​:

perlsh> show scalar 'Yo Momma \a@&#8203;aa\.aa' =~ m{ (([-\w]+\ ){2} ([^\S\s]*) \\< [-\w.%_]+ \@​ ([-\w]+\.)+ \w\w+ \>) }x @​data = (   1   );

Note that the [^\S\s] can be anything (5\, p\, q\, .\, etc.). Adding one or more characters anywhere in the email address (on the left hand side) fixes it\, too. Removing the first and last name sub expression fixes it. Changing the {2} to a + fixes it. Making it {2\,} fixes it. Making it {2\,3}\, {1\,2}\, etc. fix it.

Hope this sheds some light.

Ian

flan@​desktop.com

Perl Info ``` Flags: category=core severity=medium Site configuration information for perl v5.6.0: Configured by bbuchanan at Mon Jun 12 20:33:27 PDT 2000. Summary of my perl5 (revision 5.0 version 6 subversion 0) configuration: Platform: osname=linux, osvers=2.2.5-15, archname=i686-linux uname='linux penseur.jumpdata.com 2.2.5-15 #1 mon apr 19 23:00:46 edt 1999 i686 unknown ' config_args='-des -O -Dprefix=/dt/vendor -Dlocincpth=/dt/vendor/include -Dloclibpth=/dt/vendor/lib -Dvendorprefix=/dt/server/perl5 -Dvendorbin=/dt/server/bin -Dvendorscript=/dt/server/bin -Dvendorlib=/dt/server/lib/perl5/5.6.0 -Dvendorman1=/dt/server/man/man1 -Dvendorman3=/dt/server/man/man3 -Dvendor=desktop.com -Dappllib=/dt/islands/lib/perl5 -Dbincompat5005 -Accflags=-DPERL_POLLUTE' hint=recommended, useposix=true, d_sigaction=define usethreads=undef use5005threads=undef useithreads=undef usemultiplicity=undef useperlio=undef d_sfio=undef uselargefiles=define use64bitint=undef use64bitall=undef uselongdouble=undef usesocks=undef Compiler: cc='cc', optimize='-O2', gccversion=egcs-2.91.66 19990314/Linux (egcs-1.1.2 release) cppflags='-DPERL_POLLUTE -fno-strict-aliasing -I/dt/vendor/include' ccflags ='-DPERL_POLLUTE -fno-strict-aliasing -I/dt/vendor/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64' stdchar='char', d_stdstdio=define, usevfork=false intsize=4, longsize=4, ptrsize=4, doublesize=8 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12 ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 alignbytes=4, usemymalloc=n, prototype=define Linker and Libraries: ld='cc', ldflags =' -L/dt/vendor/lib' libpth=/dt/vendor/lib /lib /usr/lib /usr/local/lib libs=-lnsl -lndbm -lgdbm -ldbm -ldb -ldl -lm -lc -lposix -lcrypt libc=/lib/libc-2.1.1.so, so=so, useshrplib=false, libperl=libperl.a Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-rdynamic' cccdlflags='-fpic', lddlflags='-shared -L/dt/vendor/lib' Locally applied patches: @INC for perl v5.6.0: /dt/islands/lib/perl5 /dt/vendor/lib/perl5/5.6.0/i686-linux /dt/vendor/lib/perl5/5.6.0 /dt/vendor/lib/perl5/site_perl/5.6.0/i686-linux /dt/vendor/lib/perl5/site_perl/5.6.0 /dt/vendor/lib/perl5/site_perl /dt/server/lib/perl5/5.6.0/i686-linux /dt/server/lib/perl5/5.6.0 /dt/server/lib/perl5 . Environment for perl v5.6.0: HOME=/u/flan LANG=en LANGUAGE (unset) LC_ALL=en_US LD_LIBRARY_PATH=/dt/server/lib:/dt/vendor/lib LOGDIR (unset) PATH=/usr/bin:/bin:/usr/X11R6/bin:/usr/local/bin:/opt/bin:/usr/X11R6/bin:/u/flan/bin:/usr/X11R6/bin:/u/flan/bin PERL_BADLANG (unset) SHELL=/bin/tcsh ```
p5pRT commented 24 years ago

From [Unknown Contact. See original ticket]

Ian Flanigan wrote​:

This is a bug report for perl from flan@​desktop.com\, generated with the help of perlbug 1.28 running under perl v5.6.0.

----------------------------------------------------------------- Here's what we're trying to do​:

I will refrain from saying the usual\, "you're can't validate email addresses that way". ;-)

  use re 'debug';

'Yo Momma \a@&#8203;aa\.aa' =~ m{ ([-\w]+\ ){2} # First and last name\, space-separated \\< # Email delimiter [-\w.%_]+ # Email username \@​ # Email separator ([-\w]+\.)+ # Email domain \w\w+ # Email TLD \> # Email delimiter }x;

Compiling REx `   ([-\w]+\ ){2} # First and last name\, space-separated   \< # Email delimiter   [-\w.%_]+ # Email username   \@​ # Email separator   ([-\w]+\.)+ # Email domain   \w\w+ # Email TLD   > # Email delimiter ' size 63 first at 6   1​: CURLYX {2\,2}(21)   3​: OPEN1(5)   5​: PLUS(16)   6​: ANYOF[\-0-9A-Z_a-z](0)   15​: END(0) floating ` \<' at 1..2147483647 (checking floating) stclass `ANYOF[\-0-9A-Z_a-z]' plus minlen 12 Guessing start of match\, REx `   ([-\w]+\ ){2} # First and last name\, space-separated   \<...' against `Yo Momma \a@&#8203;aa\.aa'... Did not find floating substr ` \<'... Match rejected by optimizer Freeing REx​: `   ([-\w]+\ ){2} # First and last name\, space-separated   \<...'

I think this bug has been reported before in different forms. It is unfixed AFAIK. But where are all the regexp nodes in the debugging output?

Simplifying​:

  use re 'debug';   'Yo' =~ m/[\w]o/;


Compiling REx `[\w]o' size 13 first at 1   1​: ANYOF[0-9A-Z_a-z](11)   10​: END(0) \<------ Huh? anchored `o' at 1 (checking anchored) stclass `ANYOF[0-9A-Z_a-z]' minlen 2 Guessing start of match\, REx `[\w]o' against `Yo'... Found anchored substr `o' at offset 1... Does not contradict STCLASS... Guessed​: match at offset 0 Matching REx `[\w]o' against `Yo'   Setting an EVAL scope\, savestack=3   0 \<> \ | 1​: ANYOF[0-9A-Z_a-z]   1 \ \ | 11​: EXACT \ \<---- NOT ABOVE!!   2 \ \<> | 13​: END Match successful! Freeing REx​: `[\w]o'

and the equivalent​:

  use re 'debug';   'Yo' =~ m/[0-9A-Z_a-z]o/;


Compiling REx `[0-9A-Z_a-z]o' size 12 first at 1   1​: ANYOF[0-9A-Z_a-z](10)   10​: EXACT \(12)   12​: END(0) anchored `o' at 1 (checking anchored) stclass `ANYOF[0-9A-Z_a-z]' minlen 2 Guessing start of match\, REx `[0-9A-Z_a-z]o' against `Yo'... Found anchored substr `o' at offset 1... Does not contradict STCLASS... Guessed​: match at offset 0 Matching REx `[0-9A-Z_a-z]o' against `Yo'   Setting an EVAL scope\, savestack=3   0 \<> \ | 1​: ANYOF[0-9A-Z_a-z]   1 \ \ | 10​: EXACT \   2 \ \<> | 12​: END Match successful! Freeing REx​: `[0-9A-Z_a-z]o'

Any of the predefined character sets\, \d\, \S\, etc. used in character classes have the same effect on debugging output.

p5pRT commented 24 years ago

From @vanstyn

:This is a bug report for perl from flan@​desktop.com\, :generated with the help of perlbug 1.28 running under perl v5.6.0. : : :----------------------------------------------------------------- :Here's what we're trying to do​: : :'Yo Momma \a@&#8203;aa\.aa' =~ m{ : ([-\w]+\ ){2} # First and last name\, space-separated : \\< # Email delimiter : [-\w.%_]+ # Email username : \@​ # Email separator : ([-\w]+\.)+ # Email domain : \w\w+ # Email TLD : \> # Email delimiter : }x; : :This should match\, but it doesn't [...]

The patch below fixes this problem as well.

Hugo --- forwarded message "MC" \mc@&#8203;backwoords\.org wrote​: :The code below matches when run on version 5.005_03 but fails when run on :version 5.06. If you remove the 's' indicated on line 8 it will match :successfully in both versions. : :I can conceive no explination for this and several other from :comp.lang.perl.misc thread 'Mystery Regex' and 'Mystery Regex [long]' agree that :this is indeed a bug.

A couple of shorter testcases​:

  perl -wle 'print "ok" if "a\,b\,c" =~ /^(?​:.\,){2}c/'   perl -wle 'print "ok" if "a\,b\,c" =~ /^(?​:[^\,]*\,){2}c/'

The regexp engine decided that these regexps were complicated enough to look for fixed substrings (which may in itself be a bug\, not sure)\, and spotted that '\,c' was the longest fixed substring; however\, it failed to take into account the {2} multiplier when determining where in the target string it should expect to find that substring\, so it searched starting from offset 1 rather than offset 3.

Attached patch passes all tests here\, including the new ones.

Hugo

Inline Patch ```diff --- regcomp.c.old Tue Mar 14 22:19:44 2000 +++ regcomp.c Thu Jul 13 19:06:34 2000 @@ -901,6 +901,9 @@ sv_catsv(data->last_found, last_str); data->last_end += l * (mincount - 1); } + } else { + /* start offset must point into the last copy */ + data->last_start_min += minnext * (mincount - 1); } } /* It is counted once already... */ --- t/op/re_tests.old Tue Jul 11 12:30:34 2000 +++ t/op/re_tests Thu Jul 13 18:59:44 2000 @@ -751,3 +751,7 @@ '^\S\s+aa$'m \nx aa y - - (^|a)b ab y - - ^([ab]*?)(b)?(c)$ abac y -$2- -- +^(?:.,){2}c a,b,c y - - +^(.,){2}c a,b,c y $1 b, +^(?:[^,]*,){2}c a,b,c y - - +^([^,]*,){2}c a,b,c y $1 b, ```