Perl / perl5

đŸȘ The Perl programming language
https://dev.perl.org/perl5/
Other
1.94k stars 554 forks source link

regexes: . different from [^\n] #11913

Closed p5pRT closed 12 years ago

p5pRT commented 12 years ago

Migrated from rt.perl.org#109206 (status was 'resolved')

Searchable as RT109206$

p5pRT commented 12 years ago

From @mauke

Created by @mauke

% perl -wle '$_ = "\n"; print $+[0] while /[^\n]*/g' 0 1

% perl -wle '$_ = "\n"; print $+[0] while /.*/g'
0

I think this is a bug because in the absence of /s '.' should match any character except newline\, i.e. be equivalent to '[^\n]'. The two programs should produce identical output.

I also think the first result is correct because there are two zero-length matches in "\n"\, one at the beginning of the string and one at the end. In conclusion​: it looks like /.*/g is broken.

Perl Info ``` Flags: category=core severity=medium This perlbug was built using Perl 5.12.1 - Thu Jun 3 20:09:15 CEST 2010 It is being executed now by Perl 5.14.2 - Wed Sep 28 23:40:09 CEST 2011. Site configuration information for perl 5.14.2: Configured by mauke at Wed Sep 28 23:40:09 CEST 2011. Summary of my perl5 (revision 5 version 14 subversion 2) configuration: Platform: osname=linux, osvers=2.6.38-gentoo-r6, archname=i686-linux uname='linux nora 2.6.38-gentoo-r6 #1 preempt sat aug 6 03:05:34 cest 2011 i686 amd athlon(tm) 64 processor 3200+ authenticamd gnulinux ' config_args='' hint=recommended, useposix=true, d_sigaction=define useithreads=undef, usemultiplicity=undef useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef use64bitint=undef, use64bitall=undef, uselongdouble=undef usemymalloc=n, bincompat5005=undef Compiler: cc='gcc', ccflags ='-fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64', optimize='-O2 -march=native -flto', cppflags='-fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include' ccversion='', gccversion='4.6.1', gccosandvers='' intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12 ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 alignbytes=4, prototype=define Linker and Libraries: ld='gcc', ldflags ='-fstack-protector -L/usr/local/lib -O2 -flto' libpth=/usr/local/lib /lib /usr/lib libs=-lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lc -lgdbm_compat perllibs=-lnsl -ldl -lm -lcrypt -lutil -lc libc=/lib/libc-2.12.2.so, so=so, useshrplib=false, libperl=libperl.a gnulibc_version='2.12.2' Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E' cccdlflags='-fPIC', lddlflags='-shared -O2 -march=native -flto -L/usr/local/lib -fstack-protector' Locally applied patches: SAVEARGV0 - disable magic open in @INC for perl 5.14.2: /home/mauke/usr/local/lib/perl5/site_perl/5.14.2/i686-linux /home/mauke/usr/local/lib/perl5/site_perl/5.14.2 /home/mauke/usr/local/lib/perl5/5.14.2/i686-linux /home/mauke/usr/local/lib/perl5/5.14.2 /home/mauke/usr/local/lib/perl5/site_perl/5.14.1/i686-linux /home/mauke/usr/local/lib/perl5/site_perl/5.14.1 /home/mauke/usr/local/lib/perl5/site_perl/5.14.0/i686-linux /home/mauke/usr/local/lib/perl5/site_perl/5.14.0 /home/mauke/usr/local/lib/perl5/site_perl . Environment for perl 5.14.2: HOME=/home/mauke LANG=en_US.UTF-8 LANGUAGE (unset) LC_COLLATE=POSIX LD_LIBRARY_PATH=/home/mauke/usr/local/lib LOGDIR (unset) PATH=/home/mauke/usr/perlbrew/bin:/home/mauke/usr/local/bin:/usr/local/bin:/usr/bin:/bin:/opt/bin:/usr/i686-pc-linux-gnu/gcc-bin/4.4.5:/opt/sun-jdk-1.4.2.13/bin:/opt/sun-jdk-1.4.2.13/jre/bin:/opt/sun-jdk-1.4.2.13/jre/javaws:/opt/dmd/bin:/usr/games/bin PERLBREW_HOME=/home/mauke/.perlbrew PERLBREW_PATH=/home/mauke/usr/perlbrew/bin PERLBREW_ROOT=/home/mauke/usr/perlbrew PERLBREW_VERSION=0.27 PERL_BADLANG (unset) PERL_UNICODE=SAL SHELL=/bin/bash ```
p5pRT commented 12 years ago

From @abigail

On Fri\, Jan 27\, 2012 at 06​:40​:09AM -0800\, l.mai@​web.de wrote​:

# New Ticket Created by l.mai@​web.de # Please include the string​: [perl #109206] # in the subject line of all future correspondence about this issue. # \<URL​: https://rt-archive.perl.org/perl5/Ticket/Display.html?id=109206 >

This is a bug report for perl from l.mai@​web.de\, generated with the help of perlbug 1.39 running under perl 5.14.2.

----------------------------------------------------------------- [Please describe your issue here]

% perl -wle '$_ = "\n"; print $+[0] while /[^\n]*/g' 0 1

% perl -wle '$_ = "\n"; print $+[0] while /.*/g'
0

I think this is a bug because in the absence of /s '.' should match any character except newline\, i.e. be equivalent to '[^\n]'. The two programs should produce identical output.

I also think the first result is correct because there are two zero-length matches in "\n"\, one at the beginning of the string and one at the end. In conclusion​: it looks like /.*/g is broken.

I agree. Note that if one makes the * possessive\, it does give the same answer as when using [^\n]​:

  $ perl -wE '$_ = "\n"; say scalar (() = /.*/g)'   1   $ perl -wE '$_ = "\n"; say scalar (() = /.*+/g)'   2   $

Abigail

p5pRT commented 12 years ago

The RT System itself - Status changed from 'new' to 'open'

p5pRT commented 12 years ago

From @demerphq

On 27 January 2012 15​:40\, l.mai@​web.de \perlbug\-followup@&#8203;perl\.org wrote​:

# New Ticket Created by  l.mai@​web.de # Please include the string​:  [perl #109206] # in the subject line of all future correspondence about this issue. # \<URL​: https://rt-archive.perl.org/perl5/Ticket/Display.html?id=109206 >

This is a bug report for perl from l.mai@​web.de\, generated with the help of perlbug 1.39 running under perl 5.14.2.

----------------------------------------------------------------- [Please describe your issue here]

% perl -wle '$_ = "\n"; print $+[0] while /[^\n]*/g' 0 1

% perl -wle '$_ = "\n"; print $+[0] while /.*/g' 0

I think this is a bug because in the absence of /s '.' should match any character except newline\, i.e. be equivalent to '[^\n]'. The two programs should produce identical output.

I also think the first result is correct because there are two zero-length matches in "\n"\, one at the beginning of the string and one at the end. In conclusion​: it looks like /.*/g is broken.

This problem is caused by a broken optimisation. The ANCH_MBOL optmisation. Notice it the principle difference in these two outputs​:

$ perl -Mre=Debug\,DUMP -wle '$_ = "\n"; print pos($_)\,"​:"\,$+[0] while /.*/g' Compiling REx ".*" Final program​:   1​: STAR (3)   2​: REG_ANY (0)   3​: END (0) anchored(MBOL) implicit minlen 0 0​:0 Freeing REx​: ".*"

$ perl -Mre=Debug\,DUMP -wle '$_ = "\n"; print pos($_)\,"​:"\,$+[0] while /[^\n]*/g' Compiling REx "[^\n]*" Final program​:   1​: STAR (13)   2​: ANYOF[\0-\11\13-\377][{unicode_all}] (0)   13​: END (0) minlen 0 0​:0 1​:1 Freeing REx​: "[^\n]*"

It is enabled by this block of code in regcomp. Notice the comment​:

/* turn .* into ^.* with an implied $*=1 */

I have to admit I have not checked to see what the heck $*=1 means.

  else if ((!sawopen || !RExC_sawback) &&   (OP(first) == STAR &&   PL_regkind[OP(NEXTOPER(first))] == REG_ANY) &&   !(r->extflags & RXf_ANCH) && !(RExC_seen & REG_SEEN_EVAL))   {   /* turn .* into ^.* with an implied $*=1 */   const int type =   (OP(NEXTOPER(first)) == REG_ANY)   ? RXf_ANCH_MBOL   : RXf_ANCH_SBOL;   r->extflags |= type;   r->intflags |= PREGf_IMPLICIT;   first = NEXTOPER(first);   goto again;   }

The following patch disables the optimization​:

Inline Patch ```diff diff --git a/regcomp.c b/regcomp.c index 668f8f7..12d0ac0 100644 --- a/regcomp.c +++ b/regcomp.c @@ -5235,7 +5235,7 @@ reStudy: first = NEXTOPER(first); goto again; } - else if ((!sawopen || !RExC_sawback) && + else if (0 && (!sawopen || !RExC_sawback) && (OP(first) == STAR && PL_regkind[OP(NEXTOPER(first))] == REG_ANY) && !(r->extflags & RXf_ANCH) && !(RExC_seen & REG_SEEN_EVAL)) ```

Producing this output: $ ./perl -Ilib -Mre=Debug\,DUMP -wle '$_ = "\n"; print pos($_)\,"​:"\,$+[0] while /.*/g' Compiling REx ".*" Final program​:   1​: STAR (3)   2​: REG_ANY (0)   3​: END (0) minlen 0 0​:0 1​:1 Freeing REx​: ".*"

I have not committed this patch as I dont know what effects it might have\, however as it is a "conversion optimization" I would assume it can be safely disabled until the underlying logic is fixed. However I will note that fixing it might be tricky\, the relevent code is spread out over pp_hot.c and CALLREG_INTUIT_START()\, and is particularly hairy anyway. It always makes me kinda cringe when I look at pp_match.

Yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 12 years ago

From @demerphq

On 27 January 2012 16​:33\, demerphq \demerphq@&#8203;gmail\.com wrote​:

I have not committed this patch as I dont know what effects it might have\,

I decided to try out the smoke-me thing\, and pushed it as

smoke-me/disable_anch_mbol

Lets see what they say.

Yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 12 years ago

From @cpansprout

On Fri Jan 27 07​:33​:32 2012\, demerphq wrote​:

/* turn .* into ^.* with an implied $*=1 */

I have to admit I have not checked to see what the heck $*=1 means.

$* doesn’t do anything anymore\, unless you are using Classic​::Perl.

$* = 1 puts /m on every match in 5.8\, bugs aside.

What the comment means exactly by implied $*=1 I don’t know. Is it referring to /^/ meaning /^/m in split? But that couldn’t be right.

--

Father Chrysostomos

p5pRT commented 12 years ago

From @mauke

On 2012-01-27 Father Chrysostomos via RT wrote​:

On Fri Jan 27 07​:33​:32 2012\, demerphq wrote​:

/* turn .* into ^.* with an implied $*=1 */

I have to admit I have not checked to see what the heck $*=1 means.

$* doesn’t do anything anymore\, unless you are using Classic​::Perl.

$* = 1 puts /m on every match in 5.8\, bugs aside.

What the comment means exactly by implied $*=1 I don’t know. Is it referring to /^/ meaning /^/m in split? But that couldn’t be right.

It means that a regexp that starts with .* is implicitly anchored because if it doesn't match at offset 0\, it won't match at offsets 1\, 2\, 3 ... either. /m is implied because (since .* won't cross newlines) there can be multiple possible match locations if the string contains \n. Which means you have to check every embedded \n for a match.

(Conversely\, if /s is active\, leading .* should generate an implicit ^ with /m off (a.k.a. \A).)

AFAICS this optimization is valid except when the target string ends with a newline. In that case .* could (and should) match\, but /^/m won't. That is\, "\n" =~ /^/mg only matches once.

So ... I guess the regex code should behave differently if the /^/m is implicit and \n is the last character in the target string?

(And maybe there's a missed optimization opportunity here because I don't see why this special case shouldn't trigger for [^\n]* at the beginning of a pattern.)

p5pRT commented 12 years ago

From @demerphq

On 27 January 2012 21​:51\, Lukas Mai \l\.mai@&#8203;web\.de wrote​:

On 2012-01-27 Father Chrysostomos via RT wrote​:

On Fri Jan 27 07​:33​:32 2012\, demerphq wrote​:

 /* turn .* into ^.* with an implied $*=1 */

I have to admit I have not checked to see what the heck $*=1 means.

$* doesn’t do anything anymore\, unless you are using Classic​::Perl.

$* = 1 puts /m on every match in 5.8\, bugs aside.

Ah\, thanks. Pity the comment doesnt say "with an implied /m" instead.

What the comment means exactly by implied $*=1 I don’t know.  Is it referring to /^/ meaning /^/m in split?  But that couldn’t be right.

It means that a regexp that starts with .* is implicitly anchored because if it doesn't match at offset 0\, it won't match at offsets 1\, 2\, 3 ... either. /m is implied because (since .* won't cross newlines) there can be multiple possible match locations if the string contains \n. Which means you have to check every embedded \n for a match.

Yes\, right.

(Conversely\, if /s is active\, leading .* should generate an implicit ^ with /m off (a.k.a. \A).)

AFAICS this optimization is valid except when the target string ends with a newline. In that case .* could (and should) match\, but /^/m won't. That is\, "\n" =~ /^/mg only matches once.

One might argue this is the bug. It probably should match before and after as well.

So ... I guess the regex code should behave differently if the /^/m is implicit and \n is the last character in the target string?

Thing is the optimization is enabled before we ever see the string at all. It cannot depend on the contents of the string.

So we either have to figure out how to make it match properly or simply disable it.

(And maybe there's a missed optimization opportunity here because I don't see why this special case shouldn't trigger for [^\n]* at the beginning of a pattern.)

Because it isnt easy to introspect the contents of a charclass.

-- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 12 years ago

From @mauke

On 2012-01-28 demerphq wrote​:

You say that you consider "\n" to contain only one line.

But what about "\nfoo". Does it contain one or two lines? Do you expect ^ to match after the \n in "\nfoo"? If you do then do you not agree there is an inconsistency about it not matching after the \n in "\n"?

"\nfoo" contains 1.5 lines\, i.e. one complete (but empty) line and one incomplete (unterminated) line. "\nfoo" =~ /^foo/m should match\, yes. I don't think there's an inconsistency because \n is only the beginning of a line if more text follows. That is\, my model of /^/m is /(?​:\A|(?\<=\n)(?!\z))/.

Because to me that is the exact same thing as expecting /.*/ to match at the end of the string in "\n".

So really the bug here is in ^ not in .*

That doesn't match my intuitive understanding of "beginning of line".

p5pRT commented 12 years ago

From @demerphq

On 28 January 2012 19​:30\, Lukas Mai \l\.mai@&#8203;web\.de wrote​:

On 2012-01-28 demerphq wrote​:

You say that you consider "\n" to contain only one line.

But what about "\nfoo". Does it contain one or two lines? Do you expect ^ to match after the \n in "\nfoo"? If you do then do you not agree there is an inconsistency about it not matching after the \n in "\n"?

"\nfoo" contains 1.5 lines\, i.e. one complete (but empty) line and one incomplete (unterminated) line. "\nfoo" =~ /^foo/m should match\, yes. I don't think there's an inconsistency because \n is only the beginning of a line if more text follows. That is\, my model of /^/m is /(?​:\A|(?\<=\n)(?!\z))/.

And the docs agree with you\, perlre says this​:

  You may\, however\, wish to treat a string as a multi-line buffer\, such that   the "^" will match after any newline within the string (except if the newline   is the last character in the string)\, and "$" will match before any newline.

Though I do wonder if the "except if the newline is the last character of the string" was a special case added later.

Because to me that is the exact same thing as expecting /.*/ to match at the end of the string in "\n".

So really the bug here is in ^ not in .*

That doesn't match my intuitive understanding of "beginning of line".

It is sort of a lawyers point I guess. To me the definition (start of string or immediately after a newline) would match up with expecting /.*/ to match twice against "\n".

Anyway\, it sounds like the ANCH_MBOL optimization is buggy\, so do we turn it off or try to fix it somehow...

cheers\, Yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 12 years ago

From @cpansprout

On Sat Jan 28 13​:27​:55 2012\, demerphq wrote​:

On 28 January 2012 19​:30\, Lukas Mai \l\.mai@&#8203;web\.de wrote​:

On 2012-01-28 demerphq wrote​:

You say that you consider "\n" to contain only one line.

But what about "\nfoo". Does it contain one or two lines? Do you expect ^ to match after the \n in "\nfoo"? If you do then do you not agree there is an inconsistency about it not matching after the \n in "\n"?

"\nfoo" contains 1.5 lines\, i.e. one complete (but empty) line and one incomplete (unterminated) line. "\nfoo" =~ /^foo/m should match\, yes. I don't think there's an inconsistency because \n is only the beginning of a line if more text follows. That is\, my model of /^/m is /(?​:\A|(?\<=\n)(?!\z))/.

And the docs agree with you\, perlre says this​:

You may\, however\, wish to treat a string as a multi-line buffer\, such that the "^" will match after any newline within the string (except if the newline is the last character in the string)\, and "$" will match before any newline.

Though I do wonder if the "except if the newline is the last character of the string" was a special case added later.

Which is\, interestingly (but irrelevantly)\, the way JavaScript does it​:

$ perl -MJE -le 'print new JE->eval(q|/\n^/m.test("\n")|)'true true

(Or enter javascript​:alert(/\n^/m.test("\n")) in a web browser.)

In JavaScript\, /^/m is equivalent to Perl’s /\A|(?\<=[\cm\cj\x{2028}\x{2029}])/.

--

Father Chrysostomos

p5pRT commented 12 years ago

From @demerphq

Thanks for your report. I have fixed this in bleadperl with​:

commit 21eede782bed11b0263f9bff02b9ca7b7dfcd6eb Author​: Yves Orton \demerphq@&#8203;gmail\.com Date​: Sun Jan 29 00​:06​:23 2012 +0100

  Fix bug #109206​: ANCH_MBOL with while /.*/g  
  We had a fencepost error when ANCH_MBOL was enabled that meant we   did not "see" matches at the end of string. This fixes the problem   and adds tests.

Cheers\, yves

p5pRT commented 12 years ago

@demerphq - Status changed from 'open' to 'resolved'