regex optimiser wrongly rejects certain matches involving embedded comments

Perl / perl5

🐪 The Perl programming language

https://dev.perl.org/perl5/

Other

1.98k stars 559 forks source link

regex optimiser wrongly rejects certain matches involving embedded comments #12755

Closed p5pRT closed 8 years ago

p5pRT commented 11 years ago

Migrated from rt.perl.org#116639 (status was 'resolved')

Searchable as RT116639$

p5pRT commented 11 years ago

From @nwc10

Created by @nwc10

It looks like the regex optimiser is wrongly concluding "can never match" for certain patterns involving embedded comments.

Pathological patterns that put a embedded comment between an atom and its quantifier :-)

Not a regression - has been present since embedded comments were added in 5.001:

$ ~/Sandpit/5001/bin/perl -le 'print "abbc" =~ /.b(?#Comment){2}c/ ? "Y" : "n"' Y $ ~/Sandpit/5001/bin/perl -le 'print "abbc" =~ /ab(?#Comment){2}c/ ? "Y" : "n"' n

Still present in blead. The smoking gun is the debugging output "String shorter than min possible regex match"

Compare the pattern /.b(?#Comment){2}c/ which enters the engine and matches:

$ ./perl -Dr -le 'print "abbc" =~ /.b(?#Comment){2}c/ ? "Y" : "n"' Compiling REx ".b(?#Comment){2}c" rarest char b at 0 Final program: 1: REG_ANY (2) 2: CURLY {2\,2} (6) 4: EXACT \ (0) 6: EXACT \ (8) 8: END (0) anchored "bbc" at 1 (checking anchored) minlen 4 Enabling $` $& $' support (0x7).

EXECUTING...

Guessing start of match in sv for REx ".b(?#Comment){2}c" against "abbc" Found anchored substr "bbc" at offset 1... Guessed: match at offset 0 Matching REx ".b(?#Comment){2}c" against "abbc" 0 \<> \ | 1:REG_ANY(2) 1 \ \ | 2:CURLY {2\,2}(6) EXACT \ can match 2 times out of 2... 3 \ \ | 6: EXACT \(8) 4 \ \<> | 8: END(0) Match successful! Y Freeing REx: ".b(?#Comment){2}c"

with /ab(?#Comment){2}c/ which is (wrongly) rejected by the optimiser\, and hence never enters the engine:

$ ./perl -Dr -le 'print "abbc" =~ /ab(?#Comment){2}c/ ? "Y" : "n"' Compiling REx "ab(?#Comment){2}c" rarest char b at 1 Final program: 1: CURLYM[0] {2\,2} (7) 3: EXACT \ (5) 5: SUCCEED (0) 6: NOTHING (7) 7: EXACT \ (9) 9: END (0) anchored "ababc" at 0 (checking anchored isall) minlen 5 Enabling $` $& $' support (0x7).

EXECUTING...

String shorter than min possible regex match n Freeing REx: "ab(?#Comment){2}c"

Note\, multiline comments\, using /x\, are not affected by this bug:

$ ./perl -le 'print "abbc" =~ /ab#Comment' -e '{2}c/x ? "Y" : "n"' Y

Nicholas Clark

Perl Info
``` Flags: category=core severity=low Site configuration information for perl 5.16.2: Configured by nick at Fri Dec 21 07:57:09 GMT 2012. Summary of my perl5 (revision 5 version 16 subversion 2) configuration: Commit id: db7f29e888f2e6d75a806a11ebc6caa6acd84577 Platform: osname=freebsd, osvers=7.0-stable, archname=i386-freebsd uname='freebsd plum.flirble.org 7.0-stable freebsd 7.0-stable #7: sat jul 26 20:39:26 bst 2008 root@plum.flirble.org:usrobjusrsrcsysplum i386 ' config_args='-des -Dprefix=/home/nick/Sandpit/5162' hint=recommended, useposix=true, d_sigaction=define useithreads=undef, usemultiplicity=undef useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef use64bitint=undef, use64bitall=undef, uselongdouble=undef usemymalloc=n, bincompat5005=undef Compiler: cc='cc', ccflags ='-DHAS_FPSETMASK -DHAS_FLOATINGPOINT_H -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include', optimize='-O', cppflags='-DHAS_FPSETMASK -DHAS_FLOATINGPOINT_H -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include' ccversion='', gccversion='4.2.1 20070719 [FreeBSD]', gccosandvers='' intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12 ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 alignbytes=4, prototype=define Linker and Libraries: ld='cc', ldflags ='-Wl,-E -fstack-protector -L/usr/local/lib' libpth=/usr/lib /usr/local/lib libs=-lgdbm -lm -lcrypt -lutil -lc perllibs=-lm -lcrypt -lutil -lc libc=, so=so, useshrplib=false, libperl=libperl.a gnulibc_version='' Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags=' ' cccdlflags='-DPIC -fPIC', lddlflags='-shared -L/usr/local/lib -fstack-protector' Locally applied patches: @INC for perl 5.16.2: /home/nick/Sandpit/5162/lib/perl5/site_perl/5.16.2/i386-freebsd /home/nick/Sandpit/5162/lib/perl5/site_perl/5.16.2 /home/nick/Sandpit/5162/lib/perl5/5.16.2/i386-freebsd /home/nick/Sandpit/5162/lib/perl5/5.16.2 . Environment for perl 5.16.2: HOME=/home/nick LANG (unset) LANGUAGE (unset) LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=/home/nick/bin:/opt/local/bin:/opt/local/sbin:/usr/flirble/admin/bin:/bin:/usr/bin:/usr/local/bin:/usr/X11R6/bin:/home/nick/bin:/usr/local/sbin:/sbin:/usr/sbin PERL_BADLANG (unset) SHELL=/usr/local/bin/bash ```

p5pRT commented 11 years ago

From @nwc10

On Tue\, Feb 05\, 2013 at 01:26:30AM -0800\, Nicholas Clark wrote:

It looks like the regex optimiser is wrongly concluding "can never match" for certain patterns involving embedded comments.

String shorter than min possible regex match n Freeing REx: "ab(?#Comment){2}c"

OK\, seems that it's more ugly\, and affects more things. Or maybe this is a different bug:

$ ./perl -Dr -le 'print "abbc" =~ /a(?{"b"}){2}c/ ? "Y" : "n"' Compiling REx "a(?{%"b%"}){2}c" rarest char c at 1 Final program: 1: EXACT \ (3) 3: CURLYX[0] {2\,2} (8) 5: EVAL (7) 7: WHILEM (0) 8: NOTHING (9) 9: EXACT \ (11) 11: END (0) anchored "ac" at 0 (checking anchored) minlen 2 with eval Enabling $` $& $' support (0x7).

EXECUTING...

Guessing start of match in sv for REx "a(?{%"b%"}){2}c" against "abbc" Did not find anchored substr "ac"... Match rejected by optimizer n Freeing REx: "a(?{%"b%"}){2}c"

There is no fixed string "ac" in the pattern. Bad optimiser\, no cookie.

Nicholas Clark

p5pRT commented 11 years ago

From @demerphq

On 5 February 2013 10:59\, Nicholas Clark \nick@ccl4\.org wrote:

On Tue\, Feb 05\, 2013 at 01:26:30AM -0800\, Nicholas Clark wrote:

It looks like the regex optimiser is wrongly concluding "can never match" for certain patterns involving embedded comments.

String shorter than min possible regex match n Freeing REx: "ab(?#Comment){2}c"

OK\, seems that it's more ugly\, and affects more things. Or maybe this is a different bug:

$ ./perl -Dr -le 'print "abbc" =~ /a(?{"b"}){2}c/ ? "Y" : "n"' Compiling REx "a(?{%"b%"}){2}c" rarest char c at 1 Final program: 1: EXACT \ (3) 3: CURLYX[0] {2\,2} (8) 5: EVAL (7) 7: WHILEM (0) 8: NOTHING (9) 9: EXACT \ (11) 11: END (0) anchored "ac" at 0 (checking anchored) minlen 2 with eval Enabling $` $& $' support (0x7).

EXECUTING...

Guessing start of match in sv for REx "a(?{%"b%"}){2}c" against "abbc" Did not find anchored substr "ac"... Match rejected by optimizer n Freeing REx: "a(?{%"b%"}){2}c"

There is no fixed string "ac" in the pattern. Bad optimiser\, no cookie.

Bad programmer misread manual\, no cookie. :-)

The (?{...}) is a zero width pattern. It always matches\, and the return is discarded.

So the pattern is identical to:

/ac/

And is correctly rejected by the optimizer.

Perhaps you meant (??{ ... }) which is quite different.

Also\, you must be testing on an old perl no? I dont recall seeing that rarest char output in a long time....

Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 11 years ago

The RT System itself - Status changed from 'new' to 'open'

p5pRT commented 11 years ago

From @demerphq

On 5 February 2013 10:26\, Nicholas Clark \perlbug\-followup@perl\.org wrote:

# New Ticket Created by Nicholas Clark # Please include the string: [perl #116639] # in the subject line of all future correspondence about this issue. # \<URL: https://rt-archive.perl.org/perl5/Ticket/Display.html?id=116639 >

This is a bug report for perl from nick@ccl4.org\, generated with the help of perlbug 1.39 running under perl 5.16.2.

----------------------------------------------------------------- [Please describe your issue here]

It looks like the regex optimiser is wrongly concluding "can never match" for certain patterns involving embedded comments.

Pathological patterns that put a embedded comment between an atom and its quantifier :-)

Not a regression - has been present since embedded comments were added in 5.001:

$ ~/Sandpit/5001/bin/perl -le 'print "abbc" =~ /.b(?#Comment){2}c/ ? "Y" : "n"' Y $ ~/Sandpit/5001/bin/perl -le 'print "abbc" =~ /ab(?#Comment){2}c/ ? "Y" : "n"' n

Still present in blead. The smoking gun is the debugging output "String shorter than min possible regex match"

Compare the pattern /.b(?#Comment){2}c/ which enters the engine and matches:

$ ./perl -Dr -le 'print "abbc" =~ /.b(?#Comment){2}c/ ? "Y" : "n"' Compiling REx ".b(?#Comment){2}c" rarest char b at 0 Final program: 1: REG_ANY (2) 2: CURLY {2\,2} (6) 4: EXACT \ (0) 6: EXACT \ (8) 8: END (0) anchored "bbc" at 1 (checking anchored) minlen 4 Enabling $` $& $' support (0x7).

EXECUTING...

Guessing start of match in sv for REx ".b(?#Comment){2}c" against "abbc" Found anchored substr "bbc" at offset 1... Guessed: match at offset 0 Matching REx ".b(?#Comment){2}c" against "abbc" 0 \<> \ | 1:REG_ANY(2) 1 \ \ | 2:CURLY {2\,2}(6) EXACT \ can match 2 times out of 2... 3 \ \ | 6: EXACT \(8) 4 \ \<> | 8: END(0) Match successful! Y Freeing REx: ".b(?#Comment){2}c"

with /ab(?#Comment){2}c/ which is (wrongly) rejected by the optimiser\, and hence never enters the engine:

$ ./perl -Dr -le 'print "abbc" =~ /ab(?#Comment){2}c/ ? "Y" : "n"' Compiling REx "ab(?#Comment){2}c" rarest char b at 1 Final program: 1: CURLYM[0] {2\,2} (7) 3: EXACT \ (5) 5: SUCCEED (0) 6: NOTHING (7) 7: EXACT \ (9) 9: END (0) anchored "ababc" at 0 (checking anchored isall) minlen 5 Enabling $` $& $' support (0x7).

EXECUTING...

String shorter than min possible regex match n Freeing REx: "ab(?#Comment){2}c"

Note\, multiline comments\, using /x\, are not affected by this bug:

$ ./perl -le 'print "abbc" =~ /ab#Comment' -e '{2}c/x ? "Y" : "n"' Y

I consider this a documentation bug and not a feature bug. You simply cannot put comments *anywhere*\, and afaik\, never could have.

Yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 11 years ago

From @nwc10

On Tue\, Feb 05\, 2013 at 02:16:52PM +0100\, demerphq wrote:

On 5 February 2013 10:26\, Nicholas Clark \perlbug\-followup@perl\.org wrote:

$ ./perl -Dr -le 'print "abbc" =~ /.b(?#Comment){2}c/ ? "Y" : "n"'

Y

$ ./perl -Dr -le 'print "abbc" =~ /ab(?#Comment){2}c/ ? "Y" : "n"'

n

I consider this a documentation bug and not a feature bug. You simply cannot put comments *anywhere*\, and afaik\, never could have.

I disagree. In that it should not matter whether the first symbol of my pattern is /./ or /a/. Either both patterns should match\, or both should not. It shouldn't be the case that one is rejected by the optimiser\, and one is accepted by the engine. That's the optimiser changing behaviour\, not just run time needed.

Also\, as best I can tell from the *implementation* of comments\, the intent is that you *can* put comments anywhere. The implementation of S_nextchar() is intended to skip over comments when getting the next character of the pattern\, and as best I can eyeball from the rest of the code in regcomp.c\, the intent is to use it always as the way to get more characters.

Nicholas Clark

p5pRT commented 11 years ago

From @arc

Nicholas Clark \perlbug\-followup@perl\.org wrote:

$ ./perl -Dr -le 'print "abbc" =~ /ab(?#Comment){2}c/ ? "Y" : "n"' Compiling REx "ab(?#Comment){2}c" rarest char b at 1 Final program: 1: CURLYM[0] {2\,2} (7) 3: EXACT \ (5) 5: SUCCEED (0) 6: NOTHING (7) 7: EXACT \ (9) 9: END (0) anchored "ababc" at 0 (checking anchored isall) minlen 5

This seems very wrong — the pattern's been compiled as if the initial "ab" is inside (?:). The same problem can be triggered with other quantifiers:

$ ./miniperl -Dr -le 'print "abbc" =~ /ab(?#rem)*c/ ? "Y" : "n"' Compiling REx "ab(?#rem)*c" rarest char c at 0 synthetic stclass "ANYOF_SYNTHETIC[ac][]". Final program: 1: CURLYM[0] {0\,32767} (7) 3: EXACT \ (5) 5: SUCCEED (0) 6: NOTHING (7) 7: EXACT \ (9) 9: END (0)

-- Aaron Crane ** http://aaroncrane.co.uk/

p5pRT commented 11 years ago

From @demerphq

On 5 February 2013 14:25\, Nicholas Clark \nick@ccl4\.org wrote:u

On Tue\, Feb 05\, 2013 at 02:16:52PM +0100\, demerphq wrote:

On 5 February 2013 10:26\, Nicholas Clark \perlbug\-followup@perl\.org wrote:

$ ./perl -Dr -le 'print "abbc" =~ /.b(?#Comment){2}c/ ? "Y" : "n"'

Y

$ ./perl -Dr -le 'print "abbc" =~ /ab(?#Comment){2}c/ ? "Y" : "n"'

n

I consider this a documentation bug and not a feature bug. You simply cannot put comments *anywhere*\, and afaik\, never could have.

I disagree. In that it should not matter whether the first symbol of my pattern is /./ or /a/. Either both patterns should match\, or both should not. It shouldn't be the case that one is rejected by the optimiser\, and one is accepted by the engine. That's the optimiser changing behaviour\, not just run time needed.

I have just looked at the output. The cause of this behavior is not the optimiser\, it is the comment parsing feature behaving in ways that you dont expect. See below comments on poorly thought through quick hacks to the regex engine.

What happens is that the (?#....) is parsed out\, the {2} then applies to whatever was parsed out before. As a side effect of how the parsing behavior works this looks to the regex engine like you had done: /(?:ab)(?#Comment){2}c/.

Which if you remove the (?#...) looks like:

/(?:ab){2}c/

At which point the optimizer quite correctly expects this pattern to be the same as /ababc/.

In the case with the leading dot\, the parsing step which merges 'a' and 'b' into (?:ab) does not occur\, and the {2} applies to the b alone.

Anyway\, none of this is an optimizer bug\, its either a doc bug\, or a bug in how quantifiers are applied the (?#...) sequences\, or a bug in how comments are parsed generally. But not an optimizer bug. The optimizer doesn't even see the original pattern\, it operates on opcodes\, and comments shouldnt modify the opcodes.

Also notice that you happen to see the optimiser kick in\, which is natural\, its the first thing that might go wrong\, however even if it didn't you would still see these patterns behave differently due to the way they are constructed. Unfortunately there is no way to turn off all optimisations\, or you would see the problem. But if you inspect the opcode generated you will see it.

Take a look at the debug output:

$ ./perl -Ilib -Mre=Debug\,ALL -le 'print "abbc" =~ /ab(?#Comment){2}c/ ? "Y" : "n"' Compiling REx "ab(?#Comment){2}c" Starting first pass (sizing)

ab(?#Comme... | 1| reg | | brnc | | piec | | atom c\< | 4| inst - CURLYX | 7| piec | | atom Required size 9 nodes Starting second pass (creation) ab(?#Comme... | 1| reg | | brnc | | piec | | atom c\< | 4| tail~ EXACT \ (1) -> WHILEM | | inst - CURLYX | 7| tail~ CURLYX[2] {0\,0} (1) -> NOTHING | | piec | | atom \< | 9| tail~ CURLYX[0] {2\,2} (1) | | ~ NOTHING (6) -> EXACT | 10| lsbr~ tying lastbr CURLYX[0] {2\,2} (1) to ender END (9) offset 8 | | tail~ CURLYX[0] {2\,2} (1) | | ~ NOTHING (6) | | ~ EXACT \ (7) -> END first:> 3: EXACT \ (5) first at 3 Peep:Pos:0/0 Flags: 0x0 Whilem_c: 0 Lcp: 0 Last:'' 0:0/0 *Fixed:'' @ 0 Float: '' @ 0/0 Peep> 1: CURLYX[0] {2\,2} (6) Peep:Pos:0/0 Flags: 0x0 Whilem_c: 0 Lcp: 0 Last:'' 0:0/0 *Fixed:'' @ 0 Float: '' @ 0/0 Peep> 3: EXACT \ (5) join> 3: EXACT \ (5) Peep:Pos:2/0 Flags: 0x0 Whilem_c: 0 Lcp: 0 Last:'ab' 2:0/0 *Fixed:'' @ 0 Float: '' @ 0/0 Peep> 5: WHILEM (0) pre-fin:Pos:2/0 Flags: 0x0 Whilem_c: 0 Lcp: 0 Last:'ab' 2:0/0 *Fixed:'' @ 0 Float: '' @ 0/0 post-fin:Pos:2/0 Flags: 0x0 Whilem_c: 0 Lcp: 0 Last:'ab' 2:0/0 *Fixed:'' @ 0 Float: '' @ 0/0 Peep:Pos:4/0 Flags: 0x0 Whilem_c: 0 Lcp: 0 Last:'abab' 4:0/0 *Fixed:'' @ 0 Float: '' @ 0/0 Peep> 6: NOTHING (7) Peep:Pos:4/0 Flags: 0x0 Whilem_c: 0 Lcp: 0 Last:'abab' 4:0/0 *Fixed:'' @ 0 Float: '' @ 0/0 Peep> 7: EXACT \ (9) join> 7: EXACT \ (9) pre-fin:Pos:5/0 Flags: 0x0 Whilem_c: 0 Lcp: 0 Last:'ababc' 5:0/0 *Fixed:'' @ 0 Float: '' @ 0/0 post-fin:Pos:5/0 Flags: 0x0 Whilem_c: 0 Lcp: 0 Last:'ababc' 5:0/0 *Fixed:'' @ 0 Float: '' @ 0/0 commit: Pos:5/0 Flags: 0x0 Whilem_c: 0 Lcp: 0 Last:'ababc' -1:0/0 *Fixed:'ababc' @ 0 Float: '' @ 0/0 minlen: 5 r->minlen:0 Final program: 1: CURLYM[0] {2\,2} (7) 3: EXACT \ (5) 5: SUCCEED (0) 6: NOTHING (7) 7: EXACT \ (9) 9: END (0) anchored "ababc" at 0 (checking anchored isall) minlen 5 r->extflags: CHECK_ALL USE_INTUIT_NOML USE_INTUIT_ML n Freeing REx: "ab(?#Comment){2}c"

$ ./perl -Ilib -Mre=Debug\,ALL -le 'print "abbc" =~ /.b(?#Comment){2}c/ ? "Y" : "n"' Compiling REx ".b(?#Comment){2}c" Starting first pass (sizing)

.b(?#Comme... | 1| reg | | brnc | | piec | | atom b(?#Commen... | 2| piec | | atom c\< | 4| inst - CURLY | 6| piec | | atom Required size 8 nodes Starting second pass (creation) .b(?#Comme... | 1| reg | | brnc | | piec | | atom b(?#Commen... | 2| piec | | atom c\< | 4| inst - CURLY | 6| tail~ REG_ANY (1) -> CURLY | | piec | | atom \< | 8| tail~ CURLY {2\,2} (2) -> EXACT | 9| lsbr~ tying lastbr REG_ANY (1) to ender END (8) offset 7 | | tail~ REG_ANY (1) | | ~ CURLY {2\,2} (2) | | ~ EXACT \ (6) -> END first:> 1: REG_ANY (2) first at 1 Peep:Pos:0/0 Flags: 0x0 Whilem_c: 0 Lcp: 0 Last:'' 0:0/0 *Fixed:'' @ 0 Float: '' @ 0/0 Peep> 1: REG_ANY (2) commit: Pos:0/0 Flags: 0x0 Whilem_c: 0 Lcp: 0 Last:'' -1:0/0 *Fixed:'' @ 0 Float: '' @ 0/0 Peep:Pos:1/0 Flags: 0x0 Whilem_c: 0 Lcp: 0 Last:'' -1:0/0 *Fixed:'' @ 0 Float: '' @ 0/0 Peep> 2: CURLY {2\,2} (6) Peep:Pos:1/0 Flags: 0x0 Whilem_c: 0 Lcp: 0 Last:'' -1:0/0 *Fixed:'' @ 0 Float: '' @ 0/0 Peep> 4: EXACT \ (0) join> 4: EXACT \ (0) pre-fin:Pos:2/0 Flags: 0x0 Whilem_c: 0 Lcp: 0 Last:'b' 2:1/1 *Fixed:'' @ 0 Float: '' @ 0/0 post-fin:Pos:2/0 Flags: 0x0 Whilem_c: 0 Lcp: 0 Last:'b' 2:1/1 *Fixed:'' @ 0 Float: '' @ 0/0 Peep:Pos:3/0 Flags: 0x0 Whilem_c: 0 Lcp: 0 Last:'bb' 3:1/1 *Fixed:'' @ 0 Float: '' @ 0/0 Peep> 6: EXACT \ (8) join> 6: EXACT \ (8) pre-fin:Pos:4/0 Flags: 0x0 Whilem_c: 0 Lcp: 0 Last:'bbc' 4:1/1 *Fixed:'' @ 0 Float: '' @ 0/0 post-fin:Pos:4/0 Flags: 0x0 Whilem_c: 0 Lcp: 0 Last:'bbc' 4:1/1 *Fixed:'' @ 0 Float: '' @ 0/0 commit: Pos:4/0 Flags: 0x0 Whilem_c: 0 Lcp: 0 Last:'bbc' -1:1/1 *Fixed:'bbc' @ 1 Float: '' @ 0/0 minlen: 4 r->minlen:0 Final program: 1: REG_ANY (2) 2: CURLY {2\,2} (6) 4: EXACT \ (0) 6: EXACT \ (8) 8: END (0) anchored "bbc" at 1 (checking anchored) minlen 4 r->extflags: USE_INTUIT_NOML USE_INTUIT_ML Guessing start of match in sv for REx ".b(?#Comment){2}c" against "abbc" Check offset min: 1 Start shift: 1 End shift 0 Real End Shift: 0 fbm_instr len=3 str=\ Found anchored substr "bbc" at offset 1... Check offset min:1 max:1 S:1 t:0 D:-1 end:4 Guessed: match at offset 0 Matching REx ".b(?#Comment){2}c" against "abbc" Setting an EVAL scope\, savestack=2 regmatch start 0 \<> \ | 1:REG_ANY(2) 1 \ \ | 2:CURLY {2\,2}(6) EXACT \ can match 2 times out of 2... Setting an EVAL scope\, savestack=11 #0 CURLY_B_max 3 \ \ | 6: EXACT \(8) 4 \ \<> | 8: END(0) Match successful! Y Freeing REx: ".b(?#Comment){2}c"

Also\, as best I can tell from the *implementation* of comments\, the intent is that you *can* put comments anywhere.

You didn't look hard enough then. Sorry.

The implementation of S_nextchar() is intended to skip over comments when getting the next character of the pattern\, and as best I can eyeball from the rest of the code in regcomp.c\, the intent is to use it always as the way to get more characters.

If so its been broken for a long time. Afaik S_nextchar is only called at the beginning or end of a parsable sequence.

This is a farily typical piece of code IMO:

case '\<': /* (?\<...) */ if (*RExC_parse == '!') paren = '\,'; else if (*RExC_parse != '=') named_capture: { /* (?\<...>) */ char *name_start; SV *svname; paren= '>';

No "nextchar" there.

And IMO\, what /some/ of the nextchars suggest as being legal\, such as a comment in between the '(' and '?' and in (?:foo) is IMO borderline to completely insane\, and not something we should use as a precedent.

We know that regex comments were a quick hack added by Larry when Jeffrey Friedl commented how they would be useful. I also know that quick hacks in the regex engine\, while sometimes quite possible\, leave nasty edge cases that are not so fun to deal with afterwards\, and probably\, had people been aware of them at the time\, blocked the idea. (Yay\, I can now say that Larry and I have something in common: adding a feature to the regex engine without thinking it through properly and without accounting for the edge cases!)

Yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 11 years ago

From @demerphq

On 5 February 2013 14:57\, Aaron Crane \arc@cpan\.org wrote:

Nicholas Clark \perlbug\-followup@perl\.org wrote:

$ ./perl -Dr -le 'print "abbc" =~ /ab(?#Comment){2}c/ ? "Y" : "n"' Compiling REx "ab(?#Comment){2}c" rarest char b at 1 Final program: 1: CURLYM[0] {2\,2} (7) 3: EXACT \ (5) 5: SUCCEED (0) 6: NOTHING (7) 7: EXACT \ (9) 9: END (0) anchored "ababc" at 0 (checking anchored isall) minlen 5

This seems very wrong — the pattern's been compiled as if the initial "ab" is inside (?:). The same problem can be triggered with other quantifiers:

$ ./miniperl -Dr -le 'print "abbc" =~ /ab(?#rem)*c/ ? "Y" : "n"' Compiling REx "ab(?#rem)*c" rarest char c at 0 synthetic stclass "ANYOF_SYNTHETIC[ac][]". Final program: 1: CURLYM[0] {0\,32767} (7) 3: EXACT \ (5) 5: SUCCEED (0) 6: NOTHING (7) 7: EXACT \ (9) 9: END (0)

This is byproduct of how we parse strings into EXACT nodes. When we are parsing a char\, we look ahead and check if there is a quantifier next. If not then we merge them into a single opcode. (?#...) seems to not absorb the quantifier\, and as such it applies to whatever _opcode_ is before it\, in this case the EXACT \.

About the only thing I can see here likely to change is that we should absorb quantifiers on (?#...) the rest is IMO a doc patch waiting to happen\, along with maybe the removal of some nextchars() that probably cause trouble.

The docpatch should probably say "You cannot put a comment in between an atom and its quantifier."

cheers\, yves -- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 11 years ago

From @demerphq

On 5 February 2013 15:02\, demerphq \demerphq@gmail\.com wrote:

On 5 February 2013 14:57\, Aaron Crane \arc@cpan\.org wrote:

Nicholas Clark \perlbug\-followup@perl\.org wrote:

$ ./perl -Dr -le 'print "abbc" =~ /ab(?#Comment){2}c/ ? "Y" : "n"' Compiling REx "ab(?#Comment){2}c" rarest char b at 1 Final program: 1: CURLYM[0] {2\,2} (7) 3: EXACT \ (5) 5: SUCCEED (0) 6: NOTHING (7) 7: EXACT \ (9) 9: END (0) anchored "ababc" at 0 (checking anchored isall) minlen 5

This seems very wrong — the pattern's been compiled as if the initial "ab" is inside (?:). The same problem can be triggered with other quantifiers:

$ ./miniperl -Dr -le 'print "abbc" =~ /ab(?#rem)*c/ ? "Y" : "n"' Compiling REx "ab(?#rem)*c" rarest char c at 0 synthetic stclass "ANYOF_SYNTHETIC[ac][]". Final program: 1: CURLYM[0] {0\,32767} (7) 3: EXACT \ (5) 5: SUCCEED (0) 6: NOTHING (7) 7: EXACT \ (9) 9: END (0)

This is byproduct of how we parse strings into EXACT nodes. When we are parsing a char\, we look ahead and check if there is a quantifier next. If not then we merge them into a single opcode. (?#...) seems to not absorb the quantifier\, and as such it applies to whatever _opcode_ is before it\, in this case the EXACT \.

About the only thing I can see here likely to change is that we should absorb quantifiers on (?#...) the rest is IMO a doc patch waiting to happen\, along with maybe the removal of some nextchars() that probably cause trouble.

The docpatch should probably say "You cannot put a comment in between an atom and its quantifier."

Also\, I'd like to point out that fixing this without modifying the docs is very likely to result in a non trivial performance hit for everyone else.

IMO comments in insane places in regexes is not worth the price.

Yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 11 years ago

From @khwilliamson

On 02/05/2013 07:10 AM\, demerphq wrote:

IMO comments in insane places in regexes is not worth the price.

Yves

+1

p5pRT commented 11 years ago

From @rjbs

* demerphq \demerphq@gmail\.com [2013-02-05T08:58:19]

I have just looked at the output. The cause of this behavior is not the optimiser\, it is the comment parsing feature behaving in ways that you dont expect. See below comments on poorly thought through quick hacks to the regex engine.

Thanks\, this was a really interesting message.

-- rjbs

p5pRT commented 11 years ago

From @rjbs

* demerphq \demerphq@gmail\.com [2013-02-05T09:10:42]

On 5 February 2013 15:02\, demerphq \demerphq@gmail\.com wrote:

The docpatch should probably say "You cannot put a comment in between an atom and its quantifier."

Also\, I'd like to point out that fixing this without modifying the docs is very likely to result in a non trivial performance hit for everyone else.

IMO comments in insane places in regexes is not worth the price.

I think it's fine to document that this won't work. I wonder whether we can issue a warning\, too? On one hand\, I'd like to imagine that anybody doing something so crazy would know to consider that it might not work and that he or she should consult the friendly manual.

On the other hand\, the way in which this fails is pretty lame to allow to happen silently. If a warning could be issued\, I think it would be warranted. (Do we have a "too-clever" category yet? :) )

-- rjbs

p5pRT commented 11 years ago

From @demerphq

On 6 February 2013 01:04\, Ricardo Signes \perl\.p5p@rjbs\.manxome\.org wrote:

* demerphq \demerphq@gmail\.com [2013-02-05T09:10:42]

On 5 February 2013 15:02\, demerphq \demerphq@gmail\.com wrote:

The docpatch should probably say "You cannot put a comment in between an atom and its quantifier."

Also\, I'd like to point out that fixing this without modifying the docs is very likely to result in a non trivial performance hit for everyone else.

IMO comments in insane places in regexes is not worth the price.

I think it's fine to document that this won't work. I wonder whether we can issue a warning\, too?

My inclination is that we should just allow quantifiers on a (?#...) block.

Then we could warn something like "quantifier on comment\, maybe you meant something else"

On one hand\, I'd like to imagine that anybody doing something so crazy would know to consider that it might not work and that he or she should consult the friendly manual.

Unfortunately tho the existing docs on the subject are pretty sparse. So there isn't that much to consult.

On the other hand\, the way in which this fails is pretty lame to allow to happen silently. If a warning could be issued\, I think it would be warranted. (Do we have a "too-clever" category yet? :) )

Heh. :-)

I haven't checked the code yet to see what is involved\, but I imagine that implementing the warning mentioned above should not be too difficult.

Cheers\, Yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 11 years ago

From @hvds

demerphq \demerphq@gmail\.com wrote: :On 6 February 2013 01:04\, Ricardo Signes \perl\.p5p@rjbs\.manxome\.org wrote: :> I think it's fine to document that this won't work. I wonder whether we can :> issue a warning\, too? : :My inclination is that we should just allow quantifiers on a (?#...) block. : :Then we could warn something like "quantifier on comment\, maybe you :meant something else"

It is not a useful thing to use in any context; I'd prefer just to make it an error.

Allowing it with a warning only makes sense to me if we want to be able to define the effect in terms of something else\, like '(?#comment) is the same as (?=(?:|comment))'; but it isn't\, and we don't. :)

Hugo

p5pRT commented 11 years ago

From @demerphq

On 6 February 2013 01:01\, Ricardo Signes \perl\.p5p@rjbs\.manxome\.org wrote:

* demerphq \demerphq@gmail\.com [2013-02-05T08:58:19]

I have just looked at the output. The cause of this behavior is not the optimiser\, it is the comment parsing feature behaving in ways that you dont expect. See below comments on poorly thought through quick hacks to the regex engine.

Thanks\, this was a really interesting message.

You are welcome. Very happy to be of service. :-)

BTW\, I have started investigating a fix. The offending code is as follows:

case '#': /* (?#...) */ while (*RExC_parse && *RExC_parse != ')') RExC_parse++; if (*RExC_parse != ')') FAIL("Sequence (?#... not terminated"); nextchar(pRExC_state); *flagp = TRYAGAIN; return NULL;

The documentation for this construct is:

"(?#text)" A comment. The text is ignored. If the "/x" modifier enables whitespace formatting\, a simple "#" will suffice. Note that Perl closes the comment as soon as it sees a ")"\, so there is no way to put a literal ")" in the comment.

The docs do not document that you cannot put a quantifier on such a construct\, yet we know this to be the case.

Actually the behavior of this construct is quite weird. I would argue that the following is a bug:

$ ./perl -Ilib -Mre=debug -le'my $x=")"; /(?#foo${x}bar)/' Compiling REx "(?#foo${x}bar)" Final program: 1: NOTHING (2) 2: END (0) minlen 0 Matching REx "(?#foo${x}bar)" against "" 0 \<> \<> | 1:NOTHING(2) 0 \<> \<> | 2:END(0) Match successful! Freeing REx: "(?#foo${x}bar)"

I would also argue that this is a bug:

$ ./perl -Ilib -Mre=debug -le'my $x="(?#foo\0bar)"; /$x/' Compiling REx "(?#foo%0bar)" Final program: 1: NOTHING (2) 2: END (0) minlen 0 Matching REx "(?#foo%0bar)" against "" 0 \<> \<> | 1:NOTHING(2) 0 \<> \<> | 2:END(0) Match successful! Freeing REx: "(?#foo%0bar)"

I also think that this is weird:

$ ./perl -Ilib -Mre=debug -le'my $x="(?#foo\0bar)"; /$x/' Compiling REx "(?#foo%0bar)" Final program: 1: NOTHING (2) 2: END (0) minlen 0 Matching REx "(?#foo%0bar)" against "" 0 \<> \<> | 1:NOTHING(2) 0 \<> \<> | 2:END(0) Match successful! Freeing REx: "(?#foo%0bar)"

Anyway\, this construct has all the fingerprints of "poorly thought through regex change". The regex engine is incredibly complex\, interacts with the rest of the code in complicated ways so it is really easy to miss edge case like I would argue this does.

cheers\, Yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 11 years ago

From @demerphq

On 6 February 2013 08:44\, \hv@crypt\.org wrote:

demerphq \demerphq@gmail\.com wrote: :On 6 February 2013 01:04\, Ricardo Signes \perl\.p5p@rjbs\.manxome\.org wrote: :> I think it's fine to document that this won't work. I wonder whether we can :> issue a warning\, too? : :My inclination is that we should just allow quantifiers on a (?#...) block. : :Then we could warn something like "quantifier on comment\, maybe you :meant something else"

It is not a useful thing to use in any context; I'd prefer just to make it an error.

Allowing it with a warning only makes sense to me if we want to be able to define the effect in terms of something else\, like '(?#comment) is the same as (?=(?:|comment))'; but it isn't\, and we don't. :)

Thanks for the analysis\, I agree. If I can figure it out at all I will make it die instead of warn.

cheers\, Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 11 years ago

From @demerphq

On 6 February 2013 09:53\, demerphq \demerphq@gmail\.com wrote:

On 6 February 2013 01:01\, Ricardo Signes \perl\.p5p@rjbs\.manxome\.org wrote:

* demerphq \demerphq@gmail\.com [2013-02-05T08:58:19]

I have just looked at the output. The cause of this behavior is not the optimiser\, it is the comment parsing feature behaving in ways that you dont expect. See below comments on poorly thought through quick hacks to the regex engine.

Thanks\, this was a really interesting message.

You are welcome. Very happy to be of service. :-)

BTW\, I have started investigating a fix. The offending code is as follows:

case '\#': /\* $?\#\.\.\.$ \*/ while $\*RExC\_parse && \*RExC\_parse \!= '$'\) RExC\_parse\+\+; if $\*RExC\_parse \!= '$'\) FAIL$"Sequence \(?\#\.\.\. not terminated"$; nextchar$pRExC\_state$; \*flagp = TRYAGAIN; return NULL;

The documentation for this construct is:

"$?\#text$" A comment\. The text is ignored\. If the "/x"

modifier enables whitespace formatting\, a simple "#" will suffice. Note that Perl closes the comment as soon as it sees a ")"\, so there is no way to put a literal ")" in the comment.

The tests for this feature are:

t/re/re_tests:^a(?#xxx){3}c aaac y $& aaac t/re/re_tests:'^a (?#xxx) (?#yyy) {3}c'x aaac y $& aaac

Which means someone thought that putting a comment in between the quantifier and the item it applies to is a good idea. I personally think they are wrong\, undertested their patch\, and never notice how horribly broken various other things would be by this patch.

Btw\, according to blame these tests are from Ilya and not Larry.

c277df42 t/op/re_tests (Ilya Zakharevich 1997-11-15 19:29:39 -0500 547) ^a(?#xxx){3}c aaac y $& aaac c277df42 t/op/re_tests (Ilya Zakharevich 1997-11-15 19:29:39 -0500 548) '^a (?#xxx) (?#yyy) {3}c'x aaac y $& aaac

I checked\, and the feature was added by Larry as part of the 5.000 "mega-patch" (a0d0e21e)

The docs from that commit are:

+=item (?#text) + +A comment. The text is ignored.

As far as I can tell there were no tests for the feature at that time.

Anyway\, I do not think it is a feasible or reasonable to fix this to be legal. And will proceed under the assumption that the tests are wrong.

Yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 11 years ago

From @arc

Ricardo Signes \perl\.p5p@rjbs\.manxome\.org wrote:

* demerphq \demerphq@gmail\.com [2013-02-05T09:10:42]

On 5 February 2013 15:02\, demerphq \demerphq@gmail\.com wrote:

The docpatch should probably say "You cannot put a comment in between an atom and its quantifier."

IMO comments in insane places in regexes is not worth the price.

I think it's fine to document that this won't work.

I'm not entirely convinced. We already allow a comment between an atom and its quantifier in /x mode:

$ perl -Mre=debug -wE '"" =~ /ab# +c/x' Compiling REx "ab#%n+c" Final program: 1: EXACT \ (3) 3: PLUS (6) 4: EXACT \ (0) 6: EXACT \ (8) 8: END (0) anchored "ab" at 0 floating "bc" at 1..2147483647 (checking floating) minlen 3 Freeing REx: "ab#%n+c"

I think that means we have three options\, in the abstract:

1. We preserve this inconsistency between /x comments and (?#) comments

2. We start forbidding (whitespace and?) comments between an atom and a quantifier in /x mode

3. We start permitting (?#) comments between an atom and its quantifier

I believe option 2 risks breaking existing code.

I'm happy to look at doing option 3\, but I probably won't have chance till the weekend.

I wonder whether we can issue a warning\, too?

If we can detect the situation well enough to issue a warning\, wouldn't that imply we could instead skip past an embedded (?#) comment when parsing? At least\, that's my impression to date from reading regcomp.c.

On one hand\, I'd like to imagine that anybody doing something so crazy would know to consider that it might not work and that he or she should consult the friendly manual.

My initial response is to agree that it's crazy\, but on reflection\, I think at least the /x case is actually useful. I've seen code formatted like this before:

$line =~ m{ ^ # beginning of line (?: \s* \w+)+ # keywords $ # end of line }x

at which point it's not much of a stretch for someone to write that like this:

$line =~ m{ ^ # beginning of line (?: \s* \w+) # keyword + # ... one or more times $ # end of line }x

which has whitespace and a comment before the "one or more times" quantifier.

-- Aaron Crane ** http://aaroncrane.co.uk/

p5pRT commented 11 years ago

From Eirik-Berg.Hanssen@allverden.no

On Wed\, Feb 6\, 2013 at 10:08 AM\, demerphq \demerphq@gmail\.com wrote:

Anyway\, I do not think it is a feasible or reasonable to fix this to be legal. And will proceed under the assumption that the tests are wrong.

The way you describe it\, and given my ignorance of the implementation\, the odd thing is it seems to work with braceless comments (and /x):

eirik@bluebird[10:38:44]~$ perl use 5.014; use strict; use warnings;

say "/x:\t"\, "food" =~ / fo # then {2} # twice d /x ? "match" : "miss";

say "(?#):\t"\, "food" =~ /fo(?#then){2}(?#twice)d/ ? "match" : "miss";

say "both:\t"\, "food" =~ / fo (?# then) {2} (?# twice) d /x ? "match" : "miss"; __END__ /x: match (?#): miss both: miss eirik@bluebird[10:38:54]~$

Couldn't these be expected – and made? – to parse identically?

Eirik

p5pRT commented 11 years ago

From @demerphq

On 6 February 2013 10:43\, Eirik Berg Hanssen \Eirik\-Berg\.Hanssen@allverden\.no wrote:

On Wed\, Feb 6\, 2013 at 10:08 AM\, demerphq \demerphq@gmail\.com wrote:

Anyway\, I do not think it is a feasible or reasonable to fix this to be legal. And will proceed under the assumption that the tests are wrong.

The way you describe it\, and given my ignorance of the implementation\, the odd thing is it seems to work with braceless comments (and /x):

eirik@bluebird[10:38:44]~$ perl use 5.014; use strict; use warnings;

say "/x:\t"\, "food" =~ / fo # then {2} # twice d /x ? "match" : "miss";

say "(?#):\t"\, "food" =~ /fo(?#then){2}(?#twice)d/ ? "match" : "miss";

say "both:\t"\, "food" =~ / fo (?# then) {2} (?# twice) d /x ? "match" : "miss"; __END__ /x: match (?#): miss both: miss eirik@bluebird[10:38:54]~$

Couldn't these be expected – and made? – to parse identically?

IMO not very easily no.

You have to understand how the parser works\, but basically the code we would strip off the # style comments is the same code that handles EXACT so it is easy to handle there.

However the code that handles (?#...) is the same code that handles (?:...) and any other such compound structure. Each of those expressions is treated an independent sub expression. The code makes many assumptions that when it enters the code that parses an independent subexpression that anything before it is\, well\, independent. :-) And this whole thing with quantifiers breaks that assumption\, with predictable results.

cheers\, Yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 11 years ago

From @demerphq

On 6 February 2013 10:31\, Aaron Crane \arc@cpan\.org wrote:

Ricardo Signes \perl\.p5p@rjbs\.manxome\.org wrote:

* demerphq \demerphq@gmail\.com [2013-02-05T09:10:42]

On 5 February 2013 15:02\, demerphq \demerphq@gmail\.com wrote:

The docpatch should probably say "You cannot put a comment in between an atom and its quantifier."

IMO comments in insane places in regexes is not worth the price.

I think it's fine to document that this won't work.

I'm not entirely convinced. We already allow a comment between an atom and its quantifier in /x mode:

Yep. But that is handled by the same code that handles EXACT.

See the mail I just sent to Eirik for more details but also take a look at the commentary at the bottom of this mail.

$ perl -Mre=debug -wE '"" =~ /ab# +c/x' Compiling REx "ab#%n+c" Final program: 1: EXACT \ (3) 3: PLUS (6) 4: EXACT \ (0) 6: EXACT \ (8) 8: END (0) anchored "ab" at 0 floating "bc" at 1..2147483647 (checking floating) minlen 3 Freeing REx: "ab#%n+c"

I think that means we have three options\, in the abstract:

1. We preserve this inconsistency between /x comments and (?#) comments

2. We start forbidding (whitespace and?) comments between an atom and a quantifier in /x mode

3. We start permitting (?#) comments between an atom and its quantifier

I believe option 2 risks breaking existing code.

I'm happy to look at doing option 3\, but I probably won't have chance till the weekend.

I dont believe that 3 is a good idea\, 2 is unnecessary. I think that 1 is a good idea when combined with the following:

Basically the fix I want to make is: treat a quantifier following a (?#...) as applying to that construct\, and DIE when we encounter it.

IMO the only thing this might break is broken code already\, and we will break it in a way that prevents data loss or mismatches.

I wonder whether we can issue a warning\, too?

If we can detect the situation well enough to issue a warning\, wouldn't that imply we could instead skip past an embedded (?#) comment when parsing? At least\, that's my impression to date from reading regcomp.c.

Nope I dont think so. And I think the code to do so would be horrible\, or inefficient\, or both. :-)

On one hand\, I'd like to imagine that anybody doing something so crazy would know to consider that it might not work and that he or she should consult the friendly manual.

My initial response is to agree that it's crazy\, but on reflection\, I think at least the /x case is actually useful. I've seen code formatted like this before:

$line =~ m{ ^ # beginning of line (?: \s* \w+)+ # keywords $ # end of line }x

Yeah\, this is no problem

at which point it's not much of a stretch for someone to write that like this:

$line =~ m{ ^ # beginning of line (?: \s* \w+) # keyword + # ... one or more times $ # end of line }x

which has whitespace and a comment before the "one or more times" quantifier.

This is also not a problem\, and I have no intention to change it.

The problem is strictly the use of (?#...) in between an atom and its quantifier. I have outside of the code in core never seen anybody use (?#...) at all\, ever\, in code. No doubt there is someone who does so\, but I bet they are very few.

On the other hand /x with # style escapes is widely used\, is not a problem and IMO should not change.

Commentary: To illustrate the core problem here take a look at the output of this command (trimmed for brevity). You might want to use a fixed width font to view it. Each of the terms "reg"\, "brnc"\, "piec"\, "atom" refers to different routine in the parsing of the pattern. Notice the routine that handles the (?#...) is called "atom". The point here is that once you enter "atom" you have terminated any previous "atom"\, and once you terminate an atom any quantifier encountered is assumed to apply to the previous *atom*. The only way I see fixing this is NOT merging characters into an EXACT atom\, however that will definitely have performance and memory consequences - we stopped doing this because it was a significant performance hit\, and massively complicates matching unicode. I also see it probably breaking (?:foo)QUANTIFIER.

$ ./perl -Ilib -Mre=Debug\,ALL -le 'print "abbc" =~ /.b(?#Comment){2}c/ ? "Y" : "n"' Compiling REx ".b(?#Comment){2}c" Starting first pass (sizing)

.b(?#Comme... | 1| reg | | brnc | | piec | | atom b(?#Commen... | 2| piec | | atom c\< | 4| inst - CURLY | 6| piec | | atom Required size 8 nodes Starting second pass (creation) .b(?#Comme... | 1| reg | | brnc | | piec | | atom b(?#Commen... | 2| piec | | atom c\< | 4| inst - CURLY | 6| tail~ REG_ANY (1) -> CURLY | | piec | | atom \< | 8| tail~ CURLY {2\,2} (2) -> EXACT | 9| lsbr~ tying lastbr REG_ANY (1) to ender END (8) offset 7 | | tail~ REG_ANY (1) | | ~ CURLY {2\,2} (2) | | ~ EXACT \ (6) -> END [....] Final program: 1: REG_ANY (2) 2: CURLY {2\,2} (6) 4: EXACT \ (0) 6: EXACT \ (8) 8: END (0) anchored "bbc" at 1 (checking anchored) minlen 4

Anyway\, if you can see an easy fix then of course go for it. But I will continue to work on the simple fix that I know will not have nasty side-effects.

Yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 11 years ago

From @nwc10

On Wed\, Feb 06\, 2013 at 11:34:50AM +0100\, demerphq wrote:

On 6 February 2013 10:31\, Aaron Crane \arc@cpan\.org wrote:

Ricardo Signes \perl\.p5p@rjbs\.manxome\.org wrote:

* demerphq \demerphq@gmail\.com [2013-02-05T09:10:42]

On 5 February 2013 15:02\, demerphq \demerphq@gmail\.com wrote:

The docpatch should probably say "You cannot put a comment in between an atom and its quantifier."

IMO comments in insane places in regexes is not worth the price.

Specifically\, embedded comments. In that comments in /x mode don't seem to have the price tag.

I think it's fine to document that this won't work.

I'm not entirely convinced. We already allow a comment between an atom and its quantifier in /x mode:

Yep. But that is handled by the same code that handles EXACT.

I think that means we have three options\, in the abstract:

1. We preserve this inconsistency between /x comments and (?#) comments

2. We start forbidding (whitespace and?) comments between an atom and a quantifier in /x mode

3. We start permitting (?#) comments between an atom and its quantifier

I believe option 2 risks breaking existing code.

I'm happy to look at doing option 3\, but I probably won't have chance till the weekend.

I dont believe that 3 is a good idea\, 2 is unnecessary. I think that 1 is a good idea when combined with the following:

Basically the fix I want to make is: treat a quantifier following a (?#...) as applying to that construct\, and DIE when we encounter it.

That might be as simple as treating embedded comments the same way as embedded pattern match modifiers:

$ perl -le '/ab(?i)+c/' Quantifier follows nothing in regex; marked by \<-- HERE in m/ab(?i)+ \<-- HERE c/ at -e line 1.

(Might be a bit more tricky to get an error that says "comment" for comments and "nothing" for embedded pattern match modifiers\, but possibly it makes just as much sense to keep the word "nothing" for comments)

IMO the only thing this might break is broken code already\, and we will break it in a way that prevents data loss or mismatches.

Yes. Right now\, I feel that the most bonkers thing is that the (mis)parsing differs depending on whether the preceding thing is multiple literals\, or a single literal. Consistently dieing on /.b(?#comment)*/ and /ab(?#comment)*/ (etc) would be an improvement.

I wonder whether we can issue a warning\, too?

If we can detect the situation well enough to issue a warning\, wouldn't that imply we could instead skip past an embedded (?#) comment when parsing? At least\, that's my impression to date from reading regcomp.c.

Nope I dont think so. And I think the code to do so would be horrible\, or inefficient\, or both. :-)

It wouldn't surprise me that it's much easier to detect the problem (after it's too late to parse it correctly)\, than to detect it early enough to parse correctly\, or to fix up the parse after the event.

The problem is strictly the use of (?#...) in between an atom and its quantifier. I have outside of the code in core never seen anybody use (?#...) at all\, ever\, in code. No doubt there is someone who does so\, but I bet they are very few.

On the other hand /x with # style escapes is widely used\, is not a problem and IMO should not change.

Agree. In that I feel that it's perfectly reasonable to document that (?#) style has limitations on where it can be used.

Anyway\, if you can see an easy fix then of course go for it. But I will continue to work on the simple fix that I know will not have nasty side-effects.

Yes\, I'm not going to object to Aaron having fun grovelling in the code either. :-)

Nicholas Clark

p5pRT commented 11 years ago

From @demerphq

On 6 February 2013 12:10\, Nicholas Clark \nick@ccl4\.org wrote:

On Wed\, Feb 06\, 2013 at 11:34:50AM +0100\, demerphq wrote:

On 6 February 2013 10:31\, Aaron Crane \arc@cpan\.org wrote:

Ricardo Signes \perl\.p5p@rjbs\.manxome\.org wrote:

* demerphq \demerphq@gmail\.com [2013-02-05T09:10:42]

On 5 February 2013 15:02\, demerphq \demerphq@gmail\.com wrote:

The docpatch should probably say "You cannot put a comment in between an atom and its quantifier."

IMO comments in insane places in regexes is not worth the price.

Specifically\, embedded comments. In that comments in /x mode don't seem to have the price tag.

Exactly.

I think it's fine to document that this won't work.

I'm not entirely convinced. We already allow a comment between an atom and its quantifier in /x mode:

Yep. But that is handled by the same code that handles EXACT.

I think that means we have three options\, in the abstract:

1. We preserve this inconsistency between /x comments and (?#) comments

2. We start forbidding (whitespace and?) comments between an atom and a quantifier in /x mode

3. We start permitting (?#) comments between an atom and its quantifier

I believe option 2 risks breaking existing code.

I'm happy to look at doing option 3\, but I probably won't have chance till the weekend.

I dont believe that 3 is a good idea\, 2 is unnecessary. I think that 1 is a good idea when combined with the following:

Basically the fix I want to make is: treat a quantifier following a (?#...) as applying to that construct\, and DIE when we encounter it.

That might be as simple as treating embedded comments the same way as embedded pattern match modifiers:

$ perl -le '/ab(?i)+c/' Quantifier follows nothing in regex; marked by \<-- HERE in m/ab(?i)+ \<-- HERE c/ at -e line 1.

(Might be a bit more tricky to get an error that says "comment" for comments and "nothing" for embedded pattern match modifiers\, but possibly it makes just as much sense to keep the word "nothing" for comments)

Ah good catch. I guess that is also there for things like /(|)*/

IMO the only thing this might break is broken code already\, and we will break it in a way that prevents data loss or mismatches.

Yes. Right now\, I feel that the most bonkers thing is that the (mis)parsing differs depending on whether the preceding thing is multiple literals\, or a single literal. Consistently dieing on /.b(?#comment)*/ and /ab(?#comment)*/ (etc) would be an improvement.

Indeed. I also have a suspicion that an exhaustive search would find other "interesting" oddities related to this.

I suspect there is deeper fu involved here that has not been remarked on in this thread\, as the whole thing with

/(?#$foo)/

where $foo is not expanded is very strange\, and cant be accounted for as far as I know in the regex engine alone.

I wonder whether we can issue a warning\, too?

If we can detect the situation well enough to issue a warning\, wouldn't that imply we could instead skip past an embedded (?#) comment when parsing? At least\, that's my impression to date from reading regcomp.c.

Nope I dont think so. And I think the code to do so would be horrible\, or inefficient\, or both. :-)

It wouldn't surprise me that it's much easier to detect the problem (after it's too late to parse it correctly)\, than to detect it early enough to parse correctly\, or to fix up the parse after the event.

Thanks. That is a much more succinct description of what I was trying to say than what I managed to type out. :-)

The problem is strictly the use of (?#...) in between an atom and its quantifier. I have outside of the code in core never seen anybody use (?#...) at all\, ever\, in code. No doubt there is someone who does so\, but I bet they are very few.

On the other hand /x with # style escapes is widely used\, is not a problem and IMO should not change.

Agree. In that I feel that it's perfectly reasonable to document that (?#) style has limitations on where it can be used.

Anyway\, if you can see an easy fix then of course go for it. But I will continue to work on the simple fix that I know will not have nasty side-effects.

Yes\, I'm not going to object to Aaron having fun grovelling in the code either. :-)

Especially if he proves me wrong. :-) (secretly this thread is all part of our evil master plan to get Aaron hacking on the regex engine more... Oh wait did I type that...)

Yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 11 years ago

From @rjbs

* demerphq \demerphq@gmail\.com [2013-02-06T03:54:55]

On 6 February 2013 08:44\, \hv@crypt\.org wrote:

Allowing it with a warning only makes sense to me if we want to be able to define the effect in terms of something else\, like '(?#comment) is the same as (?=(?:|comment))'; but it isn't\, and we don't. :)

Thanks for the analysis\, I agree. If I can figure it out at all I will make it die instead of warn.

Agreed\, too\, thanks Hugo.

-- rjbs

p5pRT commented 11 years ago

From @demerphq

On 7 February 2013 00:58\, Ricardo Signes \perl\.p5p@rjbs\.manxome\.org wrote:

* demerphq \demerphq@gmail\.com [2013-02-06T03:54:55]

On 6 February 2013 08:44\, \hv@crypt\.org wrote:

Allowing it with a warning only makes sense to me if we want to be able to define the effect in terms of something else\, like '(?#comment) is the same as (?=(?:|comment))'; but it isn't\, and we don't. :)

Thanks for the analysis\, I agree. If I can figure it out at all I will make it die instead of warn.

Agreed\, too\, thanks Hugo.

A patch to implement this has been pushed as: smoke-me/regex_comment_quantifier_fix

Yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 11 years ago

From @tonycoz

On Thu Feb 07 22:11:16 2013\, demerphq wrote:

On 7 February 2013 00:58\, Ricardo Signes \perl\.p5p@rjbs\.manxome\.org wrote:

* demerphq \demerphq@gmail\.com [2013-02-06T03:54:55]

On 6 February 2013 08:44\, \hv@crypt\.org wrote:

Allowing it with a warning only makes sense to me if we want to be able to define the effect in terms of something else\, like '(?#comment) is the same as (?=(?:|comment))'; but it isn't\, and we don't. :)

Thanks for the analysis\, I agree. If I can figure it out at all I will make it die instead of warn.

Agreed\, too\, thanks Hugo.

A patch to implement this has been pushed as: smoke-me/regex_comment_quantifier_fix

Did this change smoke cleanly for you?

I don't see anything at

http://perl.develop-help.com/?b=smoke-me%2Fregex_comment_quantifier_fix

that looks like it was caused by this patch.

This branch no longer rebases cleanly on blead.

Tony

p5pRT commented 11 years ago

From @tonycoz

On Wed Feb 06 15:59:37 2013\, perl.p5p@rjbs.manxome.org wrote:

* demerphq \demerphq@gmail\.com [2013-02-06T03:54:55]

On 6 February 2013 08:44\, \hv@crypt\.org wrote:

Allowing it with a warning only makes sense to me if we want to be able to define the effect in terms of something else\, like '(?#comment) is the same as (?=(?:|comment))'; but it isn't\, and we don't. :)

Thanks for the analysis\, I agree. If I can figure it out at all I will make it die instead of warn.

Agreed\, too\, thanks Hugo.

Karl deprecated at least some of this behaviour in 000947ada5b027f394ee63b5166df8c06b64a74e.

Is the intent to ban it entirely in 5.20\, or wait longer?

Tony

p5pRT commented 11 years ago

From @khwilliamson

On 11/14/2013 04:04 PM\, Tony Cook via RT wrote:

On Wed Feb 06 15:59:37 2013\, perl.p5p@rjbs.manxome.org wrote:

* demerphq \demerphq@gmail\.com [2013-02-06T03:54:55]

On 6 February 2013 08:44\, \hv@crypt\.org wrote:

Allowing it with a warning only makes sense to me if we want to be able to define the effect in terms of something else\, like '(?#comment) is the same as (?=(?:|comment))'; but it isn't\, and we don't. :)

Thanks for the analysis\, I agree. If I can figure it out at all I will make it die instead of warn.

Agreed\, too\, thanks Hugo.

Karl deprecated at least some of this behaviour in 000947ada5b027f394ee63b5166df8c06b64a74e.

Is the intent to ban it entirely in 5.20\, or wait longer?

Some releases ago we agreed to having all deprecations be at least 2 cycles\, because our release dates may not mesh with bundlers like debian.

p5pRT commented 11 years ago

From @demerphq

Should I investigate a new patch?

On 10 July 2013 08:43\, Tony Cook via RT \perlbug\-followup@perl\.org wrote:

On Thu Feb 07 22:11:16 2013\, demerphq wrote:

On 7 February 2013 00:58\, Ricardo Signes \perl\.p5p@rjbs\.manxome\.org wrote:

* demerphq \demerphq@gmail\.com [2013-02-06T03:54:55]

On 6 February 2013 08:44\, \hv@crypt\.org wrote:

Allowing it with a warning only makes sense to me if we want to be able to define the effect in terms of something else\, like '(?#comment) is the same as (?=(?:|comment))'; but it isn't\, and we don't. :)

Thanks for the analysis\, I agree. If I can figure it out at all I will make it die instead of warn.

Agreed\, too\, thanks Hugo.

A patch to implement this has been pushed as: smoke-me/regex_comment_quantifier_fix

Did this change smoke cleanly for you?

I don't see anything at

http://perl.develop-help.com/?b=smoke-me%2Fregex_comment_quantifier_fix

that looks like it was caused by this patch.

This branch no longer rebases cleanly on blead.

Tony

--- via perlbug: queue: perl5 status: open https://rt-archive.perl.org/perl5/Ticket/Display.html?id=116639

-- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 10 years ago

From @tonycoz

On Fri Nov 15 01:21:14 2013\, demerphq wrote:

Should I investigate a new patch?

It can wait\, since it won't be in until 5.22.

https://rt-archive.perl.org/perl5/Ticket/Display.html?id=116639#txn-1268303

Tony

p5pRT commented 9 years ago

From @jkeenan

On Sun Nov 24 14:36:08 2013\, tonyc wrote:

On Fri Nov 15 01:21:14 2013\, demerphq wrote:

Should I investigate a new patch?

It can wait\, since it won't be in until 5.22.

https://rt-archive.perl.org/perl5/Ticket/Display.html?id=116639#txn-1268303

Tony

Would it be possible to get an update on the status of this ticket (as it's marked as a blocker for 5.22)?

Thank you very much.

-- James E Keenan (jkeenan@cpan.org)

p5pRT commented 9 years ago

From @demerphq

On 7 Mar 2015 03:26\, "James E Keenan via RT" \perlbug\-followup@perl\.org wrote:

On Sun Nov 24 14:36:08 2013\, tonyc wrote:

On Fri Nov 15 01:21:14 2013\, demerphq wrote:

Should I investigate a new patch?

It can wait\, since it won't be in until 5.22.

https://rt-archive.perl.org/perl5/Ticket/Display.html?id=116639#txn-1268303

Tony

Would it be possible to get an update on the status of this ticket (as it's marked as a blocker for 5.22)?

Thank you very much.

As I recall I wrote my patch right before 5.20 was released and we were in a code freeze.

I've complained in the past about how code freeze isnt necessary and causes patches to be lost and this is an example\, a patch that got forgotten because of it.

We are now in another code freeze\, again done suboptimally\, and this change is user visible so I guess it will have to wait till 5.26 for the fix.

We should ditch the code freeze\, and simply create a new prerelease branch instead and let people merge to blead while the prerelease branch is finalized. Then after release we would merge the prerelease branch back to blead. That would prevent things like this from happening\, and allow people to continue to hack blead during the code freeze.

Anyway\, if ricardo wants to make an exception here its his call. I assume he will say wait till 5.26.

p5pRT commented 9 years ago

From perl5-porters@perl.org

Yves Orton wrote:

As I recall I wrote my patch right before 5.20 was released and we were in a code freeze.

I've complained in the past about how code freeze isnt necessary and causes patches to be lost and this is an example\, a patch that got forgotten because of it.

We are now in another code freeze\, again done suboptimally\, and this change is user visible so I guess it will have to wait till 5.26 for the fix.

We should ditch the code freeze\, and simply create a new prerelease branch instead and let people merge to blead while the prerelease branch is finalized. Then after release we would merge the prerelease branch back to blead. That would prevent things like this from happening\, and allow people to continue to hack blead during the code freeze.

All it takes is for someone to create a post-5.22 branch and rebase and merge it to blead after 5.22. I did that a few years ago\, but the last couple of years I was offline during code freeze.

p5pRT commented 9 years ago

From @demerphq

On 11 March 2015 at 00:08\, Father Chrysostomos \sprout@cpan\.org wrote:

Yves Orton wrote:

As I recall I wrote my patch right before 5.20 was released and we were in a code freeze.

I've complained in the past about how code freeze isnt necessary and causes patches to be lost and this is an example\, a patch that got forgotten because of it.

We are now in another code freeze\, again done suboptimally\, and this change is user visible so I guess it will have to wait till 5.26 for the fix.

We should ditch the code freeze\, and simply create a new prerelease branch instead and let people merge to blead while the prerelease branch is finalized. Then after release we would merge the prerelease branch back to blead. That would prevent things like this from happening\, and allow people to continue to hack blead during the code freeze.

All it takes is for someone to create a post-5.22 branch and rebase and merge it to blead after 5.22. I did that a few years ago\, but the last couple of years I was offline during code freeze.

IMO That is the wrong way around.

When we create a release candidate we should create the release candidate branch.

At that point that branch is "owned" by the release manager\, and only they can commit to it.

Anything that is required for release can be applied to either blead or the release branch\, if necessary it can be cherry picked from one to the other.

Freezing blead just ignores the capabilities of git to merge and cherry pick\, and slows down core development for no good reason.

Yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 9 years ago

From @khwilliamson

Fixed by 45a2623346a3b7028c030eb1dc738723b216d2dc -- Karl Williamson

p5pRT commented 9 years ago

@khwilliamson - Status changed from 'open' to 'pending release'

p5pRT commented 9 years ago

From @khwilliamson

I may have closed this prematurely. I had not read the extensive commentary on this when I closed it\, only the original report. So I had forgotten the controversy over what should happen.

To recap what has happened in blead: It turns out that no one (including me) thought about nextchr()'s behavior when the pattern is UTF-8 encoded. It did a simple ++ of the parse position\, which is the wrong thing to do when the character is a multi-byte character. It would point to the 2nd byte of that\, hence the tests it did after the increment for white space under /x would fail for white space that was multi-byte. When I tried to write tests after fixing that\, I discovered that nothing I came up with would reliably fail. And valgrind showed that there reads outside the buffer of garbage data. That led to me fixing a bunch of nextchr calls\, and that led to making all such stuff uniform. And that led to this bug being fixed.

But do we really want a (?#...) comment between a character and its quantifier? I can see both sides of the issue\, so am now bringing it up to discussion again. blead is now in a state where it would be easy to add the ability to choose which places allow (?#...) and which forbid it\, but allow white space and regular # comments\, both only under /x. We could allow (?#...) only under /x in such cases if we choose. It's easy to change it to do any of this\, and I'm willing to do the work\, once a decision has been made as to what to do.

My only stance on this is that I think (but am convince-able the other way) that under /x\, anywhere there is a # comment\, should also allow a (?#...) comment -- Karl Williamson

p5pRT commented 9 years ago

From @demerphq

On 14 October 2015 at 00:42\, Karl Williamson via RT \perlbug\-followup@perl\.org wrote:

I may have closed this prematurely. I had not read the extensive commentary on this when I closed it\, only the original report. So I had forgotten the controversy over what should happen.

To recap what has happened in blead: It turns out that no one (including me) thought about nextchr()'s behavior when the pattern is UTF-8 encoded. It did a simple ++ of the parse position\, which is the wrong thing to do when the character is a multi-byte character. It would point to the 2nd byte of that\, hence the tests it did after the increment for white space under /x would fail for white space that was multi-byte. When I tried to write tests after fixing that\, I discovered that nothing I came up with would reliably fail. And valgrind showed that there reads outside the buffer of garbage data. That led to me fixing a bunch of nextchr calls\, and that led to making all such stuff uniform. And that led to this bug being fixed.

But do we really want a (?#...) comment between a character and its quantifier?

IMO no.

Yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 9 years ago

From @abigail

On Tue\, Oct 13\, 2015 at 03:42:29PM -0700\, Karl Williamson via RT wrote:

I may have closed this prematurely. I had not read the extensive commentary on this when I closed it\, only the original report. So I had forgotten the controversy over what should happen.

To recap what has happened in blead: It turns out that no one (including me) thought about nextchr()'s behavior when the pattern is UTF-8 encoded. It did a simple ++ of the parse position\, which is the wrong thing to do when the character is a multi-byte character. It would point to the 2nd byte of that\, hence the tests it did after the increment for white space under /x would fail for white space that was multi-byte. When I tried to write tests after fixing that\, I discovered that nothing I came up with would reliably fail. And valgrind showed that there reads outside the buffer of garbage data. That led to me fixing a bunch of nextchr calls\, and that led to making all such stuff uniform. And that led to this bug being fixed.

But do we really want a (?#...) comment between a character and its quantifier?

I'd vote yes. For consistency. See below.

quantifier? I can see both sides of the issue\, so am now bringing it up to discussion again. blead is now in a state where it would be easy to add the ability to choose which places allow (?#...) and which forbid it\, but allow white space and regular # comments\, both only under /x. We could allow (?#...) only under /x in such cases if we choose. It's easy to change it to do any of this\, and I'm willing to do the work\, once a decision has been made as to what to do.

My only stance on this is that I think (but am convince-able the other way) that under /x\, anywhere there is a # comment\, should also allow a (?#...) comment

I agree. And I'd throw whitespace in it as well: anywhere where we ignore whitespace under /x\, we should allow a # comment\, and hence\, should allow a (?#...) comment.

Ignorable whitespace between a character and its quantifier(s) is allowed:

$ perl -wE 'say "aa" =~ /^a + +$/x' 1 $

A comment there is also allowed:

$ perl -wE 'say "aa" =~ /^a # Foo + # Bar +$/x' 1 $

If we have different rules for whitespace and (?#)\, it won't be easy to document properly\, and it won't be easy to learn.

Abigail

p5pRT commented 9 years ago

From @ikegami

On Wed\, Oct 14\, 2015 at 4:05 AM\, demerphq \demerphq@gmail\.com wrote:

On 14 October 2015 at 00:42\, Karl Williamson via RT \perlbug\-followup@perl\.org wrote:

I may have closed this prematurely. I had not read the extensive commentary on this when I closed it\, only the original report. So I had forgotten the controversy over what should happen.

To recap what has happened in blead: It turns out that no one (including me) thought about nextchr()'s behavior when the pattern is UTF-8 encoded. It did a simple ++ of the parse position\, which is the wrong thing to do when the character is a multi-byte character. It would point to the 2nd byte of that\, hence the tests it did after the increment for white space under /x would fail for white space that was multi-byte. When I tried to write tests after fixing that\, I discovered that nothing I came up with would reliably fail. And valgrind showed that there reads outside the buffer of garbage data. That led to me fixing a bunch of nextchr calls\, and that led to making all such stuff uniform. And that led to this bug being fixed.

But do we really want a (?#...) comment between a character and its quantifier?

IMO no.

Yves

So the follow question would be: What should happen when someone does that?

p5pRT commented 9 years ago

From @demerphq

On 15 October 2015 at 02:29\, Eric Brine \ikegami@adaelis\.com wrote:

On Wed\, Oct 14\, 2015 at 4:05 AM\, demerphq \demerphq@gmail\.com wrote:

On 14 October 2015 at 00:42\, Karl Williamson via RT \perlbug\-followup@perl\.org wrote:

I may have closed this prematurely. I had not read the extensive commentary on this when I closed it\, only the original report. So I had forgotten the controversy over what should happen.

To recap what has happened in blead: It turns out that no one (including me) thought about nextchr()'s behavior when the pattern is UTF-8 encoded. It did a simple ++ of the parse position\, which is the wrong thing to do when the character is a multi-byte character. It would point to the 2nd byte of that\, hence the tests it did after the increment for white space under /x would fail for white space that was multi-byte. When I tried to write tests after fixing that\, I discovered that nothing I came up with would reliably fail. And valgrind showed that there reads outside the buffer of garbage data. That led to me fixing a bunch of nextchr calls\, and that led to making all such stuff uniform. And that led to this bug being fixed.

But do we really want a (?#...) comment between a character and its quantifier?

IMO no.

Yves

So the follow question would be: What should happen when someone does that?

Throw an error\, modifier illegal on zero width match.

Yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 9 years ago

From @rjbs

* Abigail \abigail@abigail\.be [2015-10-14T07:43:05]

On Tue\, Oct 13\, 2015 at 03:42:29PM -0700\, Karl Williamson via RT wrote:

My only stance on this is that I think (but am convince-able the other way) that under /x\, anywhere there is a # comment\, should also allow a (?#...) comment

I agree. And I'd throw whitespace in it as well: anywhere where we ignore whitespace under /x\, we should allow a # comment\, and hence\, should allow a (?#...) comment.

I wish I'd read your testing out of this before going through and doing it myself. :-)

Anyway\, I agree. While I'd rather nobody actually write this\, I think that's a matter of style. Anyway: making it a matter of grammar would require either weird inconsistency or breakage.

-- rjbs

p5pRT commented 9 years ago

From @khwilliamson

On 10/17/2015 07:53 PM\, Ricardo Signes wrote:

* Abigail \abigail@abigail\.be [2015-10-14T07:43:05]

On Tue\, Oct 13\, 2015 at 03:42:29PM -0700\, Karl Williamson via RT wrote:

My only stance on this is that I think (but am convince-able the other way) that under /x\, anywhere there is a # comment\, should also allow a (?#...) comment

I agree. And I'd throw whitespace in it as well: anywhere where we ignore whitespace under /x\, we should allow a # comment\, and hence\, should allow a (?#...) comment.

I wish I'd read your testing out of this before going through and doing it myself. :-)

Anyway\, I agree. While I'd rather nobody actually write this\, I think that's a matter of style. Anyway: making it a matter of grammar would require either weird inconsistency or breakage.

Just to make sure everyone understands.

Currently (?#...) comments are allowed even when there is no /x. We probably have to support that in the places where it's been that way all along\, but we could decide to not support them in the places that I just added\, when not under /x. Thus\, we could say that you can't split a quantifier from its atom except under /x.

I don't have an opinion on this.

p5pRT commented 9 years ago

From @rjbs

* Karl Williamson \public@khwilliamson\.com [2015-10-18T00:16:44]

Just to make sure everyone understands.

Currently (?#...) comments are allowed even when there is no /x. We probably have to support that in the places where it's been that way all along\, but we could decide to not support them in the places that I just added\, when not under /x. Thus\, we could say that you can't split a quantifier from its atom except under /x.

Thanks\, I was confused.

I don't have an opinion on this.

I'm not strongly opinionated on this\, but: I think that I would find it useful to say:

If you want to put comments into a regular expression\, you have two options. You use /x and then insert space and any kind of comments between tokens\, or you can skip /x and use (?#...) between tokens.

That is: always allow (?#...) in those places where space and comments become allowed under /x.

-- rjbs

p5pRT commented 9 years ago

From @demerphq

On 19 October 2015 at 04:50\, Ricardo Signes \perl\.p5p@rjbs\.manxome\.org wrote:

* Karl Williamson \public@khwilliamson\.com [2015-10-18T00:16:44]

Just to make sure everyone understands.

Currently (?#...) comments are allowed even when there is no /x. We probably have to support that in the places where it's been that way all along\, but we could decide to not support them in the places that I just added\, when not under /x. Thus\, we could say that you can't split a quantifier from its atom except under /x.

Thanks\, I was confused.

I don't have an opinion on this.

I'm not strongly opinionated on this\, but: I think that I would find it useful to say:

If you want to put comments into a regular expression\, you have two options. You use /x and then insert space and any kind of comments between tokens\, or you can skip /x and use (?#...) between tokens.

That is: always allow (?#...) in those places where space and comments become allowed under /x.

I really dont like this. A) it complicates the regex engine\, and B) it makes a mockery of what an expert would consider to be one token.

so for instance to *me*: a{1\,10} is a single token.

a(?#whatever){1\,10}

is two "tokens".

I think expecting people to grok that

/a

{1\,10}/x

is the same as /a{1\,10}/x

is IMO just asking for trouble and for people to misunderstand what is going on.

Related\, would you expect

\(?#is this a \w match or not)w

to be the same as

\w

?

Its not\, but IMO if you think that a(?#foo){1\,10} should be the same as a{1\,10} then why shouldnt \(?#whatever)w?

Of course\, it wouldnt work\, as \( would mean that (?#whatever) is not treated as a comment at all.

So this idea just increases inconsistency in our regexes for IMO basically no value.

Yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 9 years ago

From @abigail

On Mon\, Oct 19\, 2015 at 11:33:55AM +0200\, demerphq wrote:

On 19 October 2015 at 04:50\, Ricardo Signes \perl\.p5p@rjbs\.manxome\.org wrote:

* Karl Williamson \public@khwilliamson\.com [2015-10-18T00:16:44]

Just to make sure everyone understands.

Currently (?#...) comments are allowed even when there is no /x. We probably have to support that in the places where it's been that way all along\, but we could decide to not support them in the places that I just added\, when not under /x. Thus\, we could say that you can't split a quantifier from its atom except under /x.

Thanks\, I was confused.

I don't have an opinion on this.

I'm not strongly opinionated on this\, but: I think that I would find it useful to say:

If you want to put comments into a regular expression\, you have two options. You use /x and then insert space and any kind of comments between tokens\, or you can skip /x and use (?#...) between tokens.

That is: always allow (?#...) in those places where space and comments become allowed under /x.

I really dont like this. A) it complicates the regex engine\, and B) it makes a mockery of what an expert would consider to be one token.

so for instance to *me*: a{1\,10} is a single token.

To me\, it isn't.

To Perl\, it isn't either\, as

/a {1\,10}/x

matches up to 10 a's:

$ perl -wE '"aaaaaaaaaa" =~ /a {1\,10}/x; say $& // "UNDEF"' aaaaaaaaaa $

And it's not that /x just removes any whitespace it finds\, as it doesn't remove the whitespace inside the braces:

$ perl -wE '"aaaaaaaaaa" =~ /a { 1\,10}/x; say $& // "UNDEF"' UNDEF $

(In this case\, it tries to match the literal string "a{1\,10}" -- one can wonder whether that's the most useful thing it could do).

Compiling /a {1\,10}/x shows the tokes perl think there are:

1: CURLY {1\,10} (5) 3: EXACT \ (0) 5: END (0)

a(?#whatever){1\,10}

is two "tokens".

I don't see any benefit of suddenly disallowing a space (under /x) between 'a' and '{1\,10}'\, nor to allow space there\, but not a comment. I don't see myself easily use an inline comment there\, but I also don't think Perl should enforce a particular coding style upon its users. That just makes the language less flexible\, and prohibits creativity.

Abigail

p5pRT commented 9 years ago

From @khwilliamson

On 10/19/2015 08:26 AM\, Abigail wrote:

$ perl \-wE '"aaaaaaaaaa" =~ /a \{ 1\,10\}/x; say $& // "UNDEF"' UNDEF $

(In this case\, it tries to match the literal string "a{1\,10}" -- one can wonder whether that's the most useful thing it could do).

Note that this now give a warning\, even without the '-w' command line option:

$ blead -E '"aaaaaaaaaa" =~ /a { 1\,10}/x; say $& // "UNDEF"' Unescaped left brace in regex is deprecated\, passed through in regex; marked by \<-- HERE in m/a { \<-- HERE 1\,10}/ at -e line 1. UNDEF $

p5pRT commented 9 years ago

From @abigail

On Mon\, Oct 19\, 2015 at 08:33:51AM -0600\, Karl Williamson wrote:

On 10/19/2015 08:26 AM\, Abigail wrote:

$ perl \-wE '"aaaaaaaaaa" =~ /a \{ 1\,10\}/x; say $& // "UNDEF"' UNDEF $

(In this case\, it tries to match the literal string "a{1\,10}" -- one can wonder whether that's the most useful thing it could do).

Note that this now give a warning\, even without the '-w' command line
option:

$ blead \-E '"aaaaaaaaaa" =~ /a \{ 1\,10\}/x; say $& // "UNDEF"' Unescaped left brace in regex is deprecated\, passed through in regex;

marked by \<-- HERE in m/a { \<-- HERE 1\,10}/ at -e line 1. UNDEF $

Yes\, I know.

Is that useful? I'd wager that of all the people who write

/a { 1\,10}/x

more than 99.9% of want it to behave like

/a {1\,10}/x

and none of them actually want to match the literal string

"a{1\,10}"

Abigail (who has put a space after the \, in /x{n\,m}/ way too often).

p5pRT commented 9 years ago

From @ikegami

On Mon\, Oct 19\, 2015 at 5:33 AM\, demerphq \demerphq@gmail\.com wrote:

On 19 October 2015 at 04:50\, Ricardo Signes \perl\.p5p@rjbs\.manxome\.org wrote:

* Karl Williamson \public@khwilliamson\.com [2015-10-18T00:16:44]

Just to make sure everyone understands.

Currently (?#...) comments are allowed even when there is no /x. We probably have to support that in the places where it's been that way all along\, but we could decide to not support them in the places that I just added\, when not under /x. Thus\, we could say that you can't split a quantifier from its atom except under /x.

Thanks\, I was confused.

I don't have an opinion on this.

I'm not strongly opinionated on this\, but: I think that I would find it useful to say:

If you want to put comments into a regular expression\, you have two options. You use /x and then insert space and any kind of comments between tokens\, or you can skip /x and use (?#...) between tokens.

That is: always allow (?#...) in those places where space and comments become allowed under /x.

I really dont like this. A) it complicates the regex engine\, and B) it makes a mockery of what an expert would consider to be one token.

so for instance to *me*: a{1\,10} is a single token.

Whether whitespace is allowed within or not\, exports would not consider that a token. Tokens are indivisible\, yet a number of parts of "a{1\,10}" are optional. That pattern has 6 tokens ("a"\, "{"\, "1"\, "\,"\, "10"\, "}").

a(?#whatever){1\,10} is two "tokens".

Still the same 6 tokens.

Next

© Githubissues.

Githubissues is a development platform for aggregating issues.