Closed p5pRT closed 12 years ago
% perl -wle '$_ = "\n"; print $+[0] while /[^\n]*/g' 0 1
% perl -wle '$_ = "\n"; print $+[0] while /.*/g'
0
I think this is a bug because in the absence of /s '.' should match any character except newline\, i.e. be equivalent to '[^\n]'. The two programs should produce identical output.
I also think the first result is correct because there are two zero-length matches in "\n"\, one at the beginning of the string and one at the end. In conclusion: it looks like /.*/g is broken.
On Fri\, Jan 27\, 2012 at 06:40:09AM -0800\, l.mai@web.de wrote:
# New Ticket Created by l.mai@web.de # Please include the string: [perl #109206] # in the subject line of all future correspondence about this issue. # \<URL: https://rt-archive.perl.org/perl5/Ticket/Display.html?id=109206 >
This is a bug report for perl from l.mai@web.de\, generated with the help of perlbug 1.39 running under perl 5.14.2.
----------------------------------------------------------------- [Please describe your issue here]
% perl -wle '$_ = "\n"; print $+[0] while /[^\n]*/g' 0 1
% perl -wle '$_ = "\n"; print $+[0] while /.*/g'
0I think this is a bug because in the absence of /s '.' should match any character except newline\, i.e. be equivalent to '[^\n]'. The two programs should produce identical output.
I also think the first result is correct because there are two zero-length matches in "\n"\, one at the beginning of the string and one at the end. In conclusion: it looks like /.*/g is broken.
I agree. Note that if one makes the * possessive\, it does give the same answer as when using [^\n]:
$ perl -wE '$_ = "\n"; say scalar (() = /.*/g)' 1 $ perl -wE '$_ = "\n"; say scalar (() = /.*+/g)' 2 $
Abigail
The RT System itself - Status changed from 'new' to 'open'
On 27 January 2012 15:40\, l.mai@web.de \perlbug\-followup@​perl\.org wrote:
# New Ticket Created by  l.mai@web.de # Please include the string:  [perl #109206] # in the subject line of all future correspondence about this issue. # \<URL: https://rt-archive.perl.org/perl5/Ticket/Display.html?id=109206 >
This is a bug report for perl from l.mai@web.de\, generated with the help of perlbug 1.39 running under perl 5.14.2.
----------------------------------------------------------------- [Please describe your issue here]
% perl -wle '$_ = "\n"; print $+[0] while /[^\n]*/g' 0 1
% perl -wle '$_ = "\n"; print $+[0] while /.*/g' 0
I think this is a bug because in the absence of /s '.' should match any character except newline\, i.e. be equivalent to '[^\n]'. The two programs should produce identical output.
I also think the first result is correct because there are two zero-length matches in "\n"\, one at the beginning of the string and one at the end. In conclusion: it looks like /.*/g is broken.
This problem is caused by a broken optimisation. The ANCH_MBOL optmisation. Notice it the principle difference in these two outputs:
$ perl -Mre=Debug\,DUMP -wle '$_ = "\n"; print pos($_)\,":"\,$+[0] while /.*/g' Compiling REx ".*" Final program: 1: STAR (3) 2: REG_ANY (0) 3: END (0) anchored(MBOL) implicit minlen 0 0:0 Freeing REx: ".*"
$ perl -Mre=Debug\,DUMP -wle '$_ = "\n"; print pos($_)\,":"\,$+[0] while /[^\n]*/g' Compiling REx "[^\n]*" Final program: 1: STAR (13) 2: ANYOF[\0-\11\13-\377][{unicode_all}] (0) 13: END (0) minlen 0 0:0 1:1 Freeing REx: "[^\n]*"
It is enabled by this block of code in regcomp. Notice the comment:
/* turn .* into ^.* with an implied $*=1 */
I have to admit I have not checked to see what the heck $*=1 means.
else if ((!sawopen || !RExC_sawback) && (OP(first) == STAR && PL_regkind[OP(NEXTOPER(first))] == REG_ANY) && !(r->extflags & RXf_ANCH) && !(RExC_seen & REG_SEEN_EVAL)) { /* turn .* into ^.* with an implied $*=1 */ const int type = (OP(NEXTOPER(first)) == REG_ANY) ? RXf_ANCH_MBOL : RXf_ANCH_SBOL; r->extflags |= type; r->intflags |= PREGf_IMPLICIT; first = NEXTOPER(first); goto again; }
The following patch disables the optimization:
Producing this output: $ ./perl -Ilib -Mre=Debug\,DUMP -wle '$_ = "\n"; print pos($_)\,":"\,$+[0] while /.*/g' Compiling REx ".*" Final program: 1: STAR (3) 2: REG_ANY (0) 3: END (0) minlen 0 0:0 1:1 Freeing REx: ".*"
I have not committed this patch as I dont know what effects it might have\, however as it is a "conversion optimization" I would assume it can be safely disabled until the underlying logic is fixed. However I will note that fixing it might be tricky\, the relevent code is spread out over pp_hot.c and CALLREG_INTUIT_START()\, and is particularly hairy anyway. It always makes me kinda cringe when I look at pp_match.
Yves
-- perl -Mre=debug -e "/just|another|perl|hacker/"
On 27 January 2012 16:33\, demerphq \demerphq@​gmail\.com wrote:
I have not committed this patch as I dont know what effects it might have\,
I decided to try out the smoke-me thing\, and pushed it as
smoke-me/disable_anch_mbol
Lets see what they say.
Yves
-- perl -Mre=debug -e "/just|another|perl|hacker/"
On Fri Jan 27 07:33:32 2012\, demerphq wrote:
/* turn .* into ^.* with an implied $*=1 */
I have to admit I have not checked to see what the heck $*=1 means.
$* doesnât do anything anymore\, unless you are using Classic::Perl.
$* = 1 puts /m on every match in 5.8\, bugs aside.
What the comment means exactly by implied $*=1 I donât know. Is it referring to /^/ meaning /^/m in split? But that couldnât be right.
--
Father Chrysostomos
On 2012-01-27 Father Chrysostomos via RT wrote:
On Fri Jan 27 07:33:32 2012\, demerphq wrote:
/* turn .* into ^.* with an implied $*=1 */
I have to admit I have not checked to see what the heck $*=1 means.
$* doesnât do anything anymore\, unless you are using Classic::Perl.
$* = 1 puts /m on every match in 5.8\, bugs aside.
What the comment means exactly by implied $*=1 I donât know. Is it referring to /^/ meaning /^/m in split? But that couldnât be right.
It means that a regexp that starts with .* is implicitly anchored because if it doesn't match at offset 0\, it won't match at offsets 1\, 2\, 3 ... either. /m is implied because (since .* won't cross newlines) there can be multiple possible match locations if the string contains \n. Which means you have to check every embedded \n for a match.
(Conversely\, if /s is active\, leading .* should generate an implicit ^ with /m off (a.k.a. \A).)
AFAICS this optimization is valid except when the target string ends with a newline. In that case .* could (and should) match\, but /^/m won't. That is\, "\n" =~ /^/mg only matches once.
So ... I guess the regex code should behave differently if the /^/m is implicit and \n is the last character in the target string?
(And maybe there's a missed optimization opportunity here because I don't see why this special case shouldn't trigger for [^\n]* at the beginning of a pattern.)
On 27 January 2012 21:51\, Lukas Mai \l\.mai@​web\.de wrote:
On 2012-01-27 Father Chrysostomos via RT wrote:
On Fri Jan 27 07:33:32 2012\, demerphq wrote:
 /* turn .* into ^.* with an implied $*=1 */
I have to admit I have not checked to see what the heck $*=1 means.
$* doesnât do anything anymore\, unless you are using Classic::Perl.
$* = 1 puts /m on every match in 5.8\, bugs aside.
Ah\, thanks. Pity the comment doesnt say "with an implied /m" instead.
What the comment means exactly by implied $*=1 I donât know. Â Is it referring to /^/ meaning /^/m in split? Â But that couldnât be right.
It means that a regexp that starts with .* is implicitly anchored because if it doesn't match at offset 0\, it won't match at offsets 1\, 2\, 3 ... either. /m is implied because (since .* won't cross newlines) there can be multiple possible match locations if the string contains \n. Which means you have to check every embedded \n for a match.
Yes\, right.
(Conversely\, if /s is active\, leading .* should generate an implicit ^ with /m off (a.k.a. \A).)
AFAICS this optimization is valid except when the target string ends with a newline. In that case .* could (and should) match\, but /^/m won't. That is\, "\n" =~ /^/mg only matches once.
One might argue this is the bug. It probably should match before and after as well.
So ... I guess the regex code should behave differently if the /^/m is implicit and \n is the last character in the target string?
Thing is the optimization is enabled before we ever see the string at all. It cannot depend on the contents of the string.
So we either have to figure out how to make it match properly or simply disable it.
(And maybe there's a missed optimization opportunity here because I don't see why this special case shouldn't trigger for [^\n]* at the beginning of a pattern.)
Because it isnt easy to introspect the contents of a charclass.
-- perl -Mre=debug -e "/just|another|perl|hacker/"
On 2012-01-28 demerphq wrote:
You say that you consider "\n" to contain only one line.
But what about "\nfoo". Does it contain one or two lines? Do you expect ^ to match after the \n in "\nfoo"? If you do then do you not agree there is an inconsistency about it not matching after the \n in "\n"?
"\nfoo" contains 1.5 lines\, i.e. one complete (but empty) line and one incomplete (unterminated) line. "\nfoo" =~ /^foo/m should match\, yes. I don't think there's an inconsistency because \n is only the beginning of a line if more text follows. That is\, my model of /^/m is /(?:\A|(?\<=\n)(?!\z))/.
Because to me that is the exact same thing as expecting /.*/ to match at the end of the string in "\n".
So really the bug here is in ^ not in .*
That doesn't match my intuitive understanding of "beginning of line".
On 28 January 2012 19:30\, Lukas Mai \l\.mai@​web\.de wrote:
On 2012-01-28 demerphq wrote:
You say that you consider "\n" to contain only one line.
But what about "\nfoo". Does it contain one or two lines? Do you expect ^ to match after the \n in "\nfoo"? If you do then do you not agree there is an inconsistency about it not matching after the \n in "\n"?
"\nfoo" contains 1.5 lines\, i.e. one complete (but empty) line and one incomplete (unterminated) line. "\nfoo" =~ /^foo/m should match\, yes. I don't think there's an inconsistency because \n is only the beginning of a line if more text follows. That is\, my model of /^/m is /(?:\A|(?\<=\n)(?!\z))/.
And the docs agree with you\, perlre says this:
You may\, however\, wish to treat a string as a multi-line buffer\, such that the "^" will match after any newline within the string (except if the newline is the last character in the string)\, and "$" will match before any newline.
Though I do wonder if the "except if the newline is the last character of the string" was a special case added later.
Because to me that is the exact same thing as expecting /.*/ to match at the end of the string in "\n".
So really the bug here is in ^ not in .*
That doesn't match my intuitive understanding of "beginning of line".
It is sort of a lawyers point I guess. To me the definition (start of string or immediately after a newline) would match up with expecting /.*/ to match twice against "\n".
Anyway\, it sounds like the ANCH_MBOL optimization is buggy\, so do we turn it off or try to fix it somehow...
cheers\, Yves
-- perl -Mre=debug -e "/just|another|perl|hacker/"
On Sat Jan 28 13:27:55 2012\, demerphq wrote:
On 28 January 2012 19:30\, Lukas Mai \l\.mai@​web\.de wrote:
On 2012-01-28 demerphq wrote:
You say that you consider "\n" to contain only one line.
But what about "\nfoo". Does it contain one or two lines? Do you expect ^ to match after the \n in "\nfoo"? If you do then do you not agree there is an inconsistency about it not matching after the \n in "\n"?
"\nfoo" contains 1.5 lines\, i.e. one complete (but empty) line and one incomplete (unterminated) line. "\nfoo" =~ /^foo/m should match\, yes. I don't think there's an inconsistency because \n is only the beginning of a line if more text follows. That is\, my model of /^/m is /(?:\A|(?\<=\n)(?!\z))/.
And the docs agree with you\, perlre says this:
You may\, however\, wish to treat a string as a multi-line buffer\, such that the "^" will match after any newline within the string (except if the newline is the last character in the string)\, and "$" will match before any newline.
Though I do wonder if the "except if the newline is the last character of the string" was a special case added later.
Which is\, interestingly (but irrelevantly)\, the way JavaScript does it:
$ perl -MJE -le 'print new JE->eval(q|/\n^/m.test("\n")|)'true true
(Or enter javascript:alert(/\n^/m.test("\n")) in a web browser.)
In JavaScript\, /^/m is equivalent to Perlâs /\A|(?\<=[\cm\cj\x{2028}\x{2029}])/.
--
Father Chrysostomos
Thanks for your report. I have fixed this in bleadperl with:
commit 21eede782bed11b0263f9bff02b9ca7b7dfcd6eb Author: Yves Orton \demerphq@​gmail\.com Date: Sun Jan 29 00:06:23 2012 +0100
Fix bug #109206: ANCH_MBOL with while /.*/g
We had a fencepost error when ANCH_MBOL was enabled that meant we
did not "see" matches at the end of string. This fixes the problem
and adds tests.
Cheers\, yves
@demerphq - Status changed from 'open' to 'resolved'
Migrated from rt.perl.org#109206 (status was 'resolved')
Searchable as RT109206$