Closed p5pRT closed 14 years ago
This is a bug report for perl from public@khwilliamson.com\, generated with the help of perlbug 1.39 running under perl 5.11.0.
Any character above 255 will not match itself when the pattern is the octal number of that character.
Flags: category=core severity=medium
Site configuration information for perl 5.11.0:
Configured by khw at Thu Sep 25 12:12:59 MDT 2008.
Summary of my perl5 (revision 5 version 11 subversion 0 patch 34418) configuration: Platform: osname=linux\, osvers=2.6.24-19-generic\, archname=i686-linux uname='linux karl 2.6.24-19-generic #1 smp wed aug 20 22:56:21 utc 2008 i686 gnulinux ' config_args='' hint=recommended\, useposix=true\, d_sigaction=define useithreads=undef\, usemultiplicity=undef useperlio=define\, d_sfio=undef\, uselargefiles=define\, usesocks=undef use64bitint=undef\, use64bitall=undef\, uselongdouble=undef usemymalloc=n\, bincompat5005=undef Compiler: cc='cc'\, ccflags ='-fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64'\, optimize='-O2'\, cppflags='-fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include' ccversion=''\, gccversion='4.2.3 (Ubuntu 4.2.3-2ubuntu7)'\, gccosandvers='' intsize=4\, longsize=4\, ptrsize=4\, doublesize=8\, byteorder=1234 d_longlong=define\, longlongsize=8\, d_longdbl=define\, longdblsize=12 ivtype='long'\, ivsize=4\, nvtype='double'\, nvsize=8\, Off_t='off_t'\, lseeksize=8 alignbytes=4\, prototype=define Linker and Libraries: ld='cc'\, ldflags =' -fstack-protector -L/usr/local/lib' libpth=/usr/local/lib /lib /usr/lib libs=-lnsl -ldl -lm -lcrypt -lutil -lc perllibs=-lnsl -ldl -lm -lcrypt -lutil -lc libc=/lib/libc-2.7.so\, so=so\, useshrplib=false\, libperl=libperl.a gnulibc_version='2.7' Dynamic Linking: dlsrc=dl_dlopen.xs\, dlext=so\, d_dlsymun=undef\, ccdlflags='-Wl\,-E' cccdlflags='-fPIC'\, lddlflags='-shared -O2 -L/usr/local/lib -fstack-protector'
Locally applied patches: DEVEL
@INC for perl 5.11.0: /home/khw/perl5.11/lib/5.11.0/i686-linux /home/khw/perl5.11/lib/5.11.0 /home/khw/localperl/lib/site_perl/5.11.0/i686-linux /home/khw/localperl/lib/site_perl/5.11.0 /home/khw/localperl/lib/site_perl .
Environment for perl 5.11.0: HOME=/home/khw LANG=en_US.UTF-8 LANGUAGE (unset) LD_LIBRARY_PATH (unset) LOGDIR (unset)
PATH=/home/khw/perl5.11/bin:/home/khw/bin:/home/khw/print/bin:/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/usr/games:/home/khw/cxoffice/bin PERL_BADLANG (unset) SHELL=/bin/ksh
On Thu\, Sep 25\, 2008 at 05:39:41PM -0700\, karl williamson wrote:
# New Ticket Created by karl williamson # Please include the string: [perl #59342] # in the subject line of all future correspondence about this issue. # \<URL: http://rt.perl.org/rt3/Ticket/Display.html?id=59342 >
This is a bug report for perl from public@khwilliamson.com\, generated with the help of perlbug 1.39 running under perl 5.11.0.
----------------------------------------------------------------- Any character above 255 will not match itself when the pattern is the octal number of that character. -----------------------------------------------------------------
It's a bit more complicated than that.
The point is that \NN notiation is ambiguous in a Perl regexp. It could mean the character with code point NN in octal\, or it's a backreference to the NNth capture group.
I thought Perl uses the following heuristics:
\N (one digit following the \) is always a reference to the Nth capture group. \NN (two digits following the \) is a reference the the NNth capture group if the engine has seen at least NN capture groups; otherwise\, it's the character with NN (octal) as code point. \NNN (three or more digits following the \) is always a reference to the NNNth capture group.
but that doesn't seem to be quite true:
$ perl -Mre=debug -wcE '/\400/' Compiling REx "\400" Final program: 1: EXACT \<\0> (3) 3: END (0) anchored "%0" at 0 (checking anchored isall) minlen 1 -e syntax OK $ perl -wE 'say "\x00" =~ /^\400$/' 1 $
I would expect /\400/ to complain it hasn't seen 400 capture groups yet\, instead of matching "\x00"\, which is just plain wrong. Compare:
$ perl -Mre=debug -wcE '/\4/' Compiling REx "\4" Reference to nonexistent group in regex; marked by \<-- HERE in m/\4 \<-- HERE / at -e line 1. Freeing REx: "\4" $
And this is suboptimal as well:
$ perl -Mre=debug -wcE '/\18/' Compiling REx "\18" Illegal octal digit '8' ignored at -e line 1. Illegal octal digit '8' ignored at -e line 1. Final program: 1: EXACT \<\0018> (3) 3: END (0) anchored "%0018" at 0 (checking anchored isall) minlen 2 -e syntax OK Freeing REx: "\18" $
If perl knows '8' is an illegal octal digit\, it shouldn't go ahead and construct a regexp using \0018 (which\, as far as I can determine doesn't match anything). It should either die\, or use it as a backreference to the 18th capture group - and then complain no such group exists.
Abigail
The RT System itself - Status changed from 'new' to 'open'
On Fri\, Sep 26\, 2008 at 10:42 AM\, Abigail \abigail@​abigail\.be wrote:
I thought Perl uses the following heuristics:
... \NNN (three or more digits following the \) is always a reference to the NNNth capture group.
but that doesn't seem to be quite true:
255 == 0xFF == 0377 octal matching apparently works up to 0377\, rather than counting the digits at least according to perl -Mre=debug -wce '/\377/'
based on perldoc perlre:
There is no limit to the number of captured substrings that you may use. However Perl also uses \10\, \11\, etc. as aliases for \010\, \011\, etc. (Recall that 0 means octal\, so \011 is the character at number 9 in your coded character set; which would be the 10th character\, a hori- zontal tab under ASCII.) Perl resolves this ambiguity by interpreting \10 as a backreference only if at least 10 left parentheses have opened before it. Likewise \11 is a backreference only if at least 11 left parentheses have opened before it. And so on. \1 through \9 are always interpreted as backreferences.
one expects this to work all the way up.
$ perl -Mre=debug -wle ' chr(0477) =~ /\477/ and print "yes"' prints "match failed"
$ perl -Mre=debug -wle ' chr(0477) =~ /\x{13f}/ and print "yes"' prints "yes"
The problem is not with the regex engine\, which can handle encoded multibyte characters just fine\, even 0400 as \x{100}\, but it is the parsing of \NN+ that is failing to properly construct the multibyte character when ((eval "0$NNplus") > 255).
I suspect I shoould have made this a low priority item. The changes I'm working on to regcomp.c solve this; so provided they're accepted\, this gets fixed automatically. But in case they don't get accepted\, it's still a bug.
David Nicol wrote:
On Fri\, Sep 26\, 2008 at 10:42 AM\, Abigail \abigail@​abigail\.be wrote:
I thought Perl uses the following heuristics:
... \NNN (three or more digits following the \) is always a reference to the NNNth capture group.
but that doesn't seem to be quite true:
255 == 0xFF == 0377 octal matching apparently works up to 0377\, rather than counting the digits at least according to perl -Mre=debug -wce '/\377/'
based on perldoc perlre:
There is no limit to the number of captured substrings that you may use\. However Perl also uses \\10\, \\11\, etc\. as aliases for \\010\, \\011\, etc\. \(Recall that 0 means octal\, so \\011 is the character at number 9 in your coded character set; which would be the 10th character\, a hori\- zontal tab under ASCII\.\) Perl resolves this ambiguity by interpreting \\10 as a backreference only if at least 10 left parentheses have opened before it\. Likewise \\11 is a backreference only if at least 11 left parentheses have opened before it\. And so on\. \\1 through \\9 are always interpreted as backreferences\.
one expects this to work all the way up.
$ perl -Mre=debug -wle ' chr(0477) =~ /\477/ and print "yes"' prints "match failed"
$ perl -Mre=debug -wle ' chr(0477) =~ /\x{13f}/ and print "yes"' prints "yes"
The problem is not with the regex engine\, which can handle encoded multibyte characters just fine\, even 0400 as \x{100}\, but it is the parsing of \NN+ that is failing to properly construct the multibyte character when ((eval "0$NNplus") > 255).
This is my first time submitting a patch. Hopefully I'm doing it the way I should be.
The regcomp.c code failed to set a flag that it does in all other cases when a character value is above 255.
Karl\,
Thanks very much for the patch.
I confess guess I never *expected* "\400" to be "\x{100}"\, but rather "\x{20}0".
However\, I'm a bit concerned about perl -0777\, as it's documented to
The special value 00 will cause Perl to slurp files in paragraph mode. The value 0777 will cause Perl to slurp files whole because there is no legal byte with that value. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Either put all the switches after the 32-character boundary
(if applicable)\, or replace the use of B\<-0>I\
While your patch doesn't seem to affect that\, it might lead one to question differing behaviors in different places.
regexp: \777 means naughty string: \777 means... um what? CLI: \777 means undef $/
And so I wonder what's to be done about that.
But perhaps I under-understand?
--tom
--- regcomp.c.orig 2008-10-18 12:16:42.000000000 -0600 +++ regcomp.c 2008-10-24 10:22:24.000000000 -0600 @@ -7417\,6 +7417\,7 @@ I32 flags = 0; STRLEN numlen = 3; ender = grok_oct(p\, &numlen\, &flags\, NULL); + if (ender > 0xff) RExC_utf8 = 1; p += numlen; } else { --- t/op/re_tests.orig 2008-09-22 14:42:42.000000000 -0600 +++ t/op/re_tests 2008-10-24 10:51:35.000000000 -0600 @@ -1357\,3 +1357\,8 @@ /^\s*i.*?o\s*$/s io\n io y - - # As reported in #59168 by Father Chrysostomos: /(.*?)a(?!(a+)b\2c)/ baaabaac y $&-$1 baa-ba + +# #59342 +/\377/ \377 y $& \377 +/\400/ \400 y $& \400 +/\777/ \777 y $& \777
Tom Christiansen wrote:
Karl\,
Thanks very much for the patch.
I confess guess I never *expected* "\400" to be "\x{100}"\, but rather "\x{20}0".
I'm not wedded to the interpretation I sumitted. It came out of testing against the documentation that a blackslash in a regular expression can be followed by 1-3 octal digits (the first one not needing to be a 0). And the interpretation I submitted is the one that the pre-existing code meant. Witness
perl -le 'print "\x{100}\400" =~ /\x{100}\400/' perl -le 'print "\400\x{100}" =~ /\400\x{100}/' both print 1.
That is\, If any other character anywhere in the pattern causes regcomp to think that it should store the re as utf8\, then \400 matches \400 and so on up to the maximum 3 digit octal: \777 matches \777. Failing that\, /\400/ will match \01\00. It never matches "\x{20}0".
So we have an existing bug. sometimes \400 matches \400\, and sometimes it matches \01\00\, depending on what I would call spooky action at a distance. (This means that \777 sometimes already matches \777 now.) I'm trying to get rid of these consistencies. I think something should be done here\, but perhaps its not what I thought it should be. My patch follows what the code was intending to do\, but perhaps we should change that intention. Please guide me.
However\, I'm a bit concerned about perl -0777\, as it's documented to
The special value 00 will cause Perl to slurp files in paragraph mode\. The value 0777 will cause Perl to slurp files whole because there is no legal byte with that value\. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Either put all the switches after the 32\-character boundary \(if applicable\)\, or replace the use of B\<\-0>I\<digits> by C\<BEGIN\{ $/ = "\\0digits"; \}>\. ^^^^^^^^
While your patch doesn't seem to affect that\, it might lead one to question differing behaviors in different places.
regexp​: \\777 means naughty string​: \\777 means\.\.\. um what? CLI​: \\777 means undef $/
And so I wonder what's to be done about that.
But perhaps I under-understand?
--tom
\-\-\- regcomp\.c\.orig 2008\-10\-18 12​:16​:42\.000000000 \-0600 \+\+\+ regcomp\.c 2008\-10\-24 10​:22​:24\.000000000 \-0600 @​@​ \-7417\,6 \+7417\,7 @​@​ I32 flags = 0; STRLEN numlen = 3; ender = grok\_oct\(p\, &numlen\, &flags\, NULL\); \+ if \(ender > 0xff\) RExC\_utf8 = 1; p \+= numlen; \} else \{ \-\-\- t/op/re\_tests\.orig 2008\-09\-22 14​:42​:42\.000000000 \-0600 \+\+\+ t/op/re\_tests 2008\-10\-24 10​:51​:35\.000000000 \-0600 @​@​ \-1357\,3 \+1357\,8 @​@​ /^\\s\*i\.\*?o\\s\*$/s io\\n io y \- \- \# As reported in \#59168 by Father Chrysostomos​: /\(\.\*?\)a\(?\!\(a\+\)b\\2c\)/ baaabaac y $&\-$1 baa\-ba \+ \+\# \#59342 \+/\\377/ \\377 y $& \\377 \+/\\400/ \\400 y $& \\400 \+/\\777/ \\777 y $& \\777
On approximately 10/25/2008 7:32 AM\, came the following characters from the keyboard of karl williamson:
Tom Christiansen wrote:
Karl\,
Thanks very much for the patch.
I confess guess I never *expected* "\400" to be "\x{100}"\, but rather "\x{20}0".
I'm not wedded to the interpretation I sumitted. It came out of testing against the documentation that a blackslash in a regular expression can be followed by 1-3 octal digits (the first one not needing to be a 0).
And the interpretation I submitted is the one that the pre-existing code meant. Witnessperl -le 'print "\x{100}\400" =~ /\x{100}\400/' perl -le 'print "\400\x{100}" =~ /\400\x{100}/' both print 1.
That is\, If any other character anywhere in the pattern causes regcomp to think that it should store the re as utf8\, then \400 matches \400 and so on up to the maximum 3 digit octal: \777 matches \777. Failing that\, /\400/ will match \01\00. It never matches "\x{20}0".
So we have an existing bug. sometimes \400 matches \400\, and sometimes it matches \01\00\, depending on what I would call spooky action at a distance. (This means that \777 sometimes already matches \777 now.) I'm trying to get rid of these consistencies. I think something should be done here\, but perhaps its not what I thought it should be. My patch follows what the code was intending to do\, but perhaps we should change that intention. Please guide me.
I understand where Tom is coming from\, but he has no grounds for expecting "\400" to be the same as " 0".
I pulled out my old K&R\, which is likely one of the earliest published books documenting the octal escape notation\, and it explicitly says (section 2.3) "an arbitrary byte-sized bit pattern" can be created with an octal escape\, and the also "maximum of 3 octal digits". Section 2.2 talks about both 8 and 9 bit characters on example architectures.
So it would be wrong to limit the values to 8-bits. It is probably "platform specific" what interpretation should be applied to octal escapes that exceed the platform specific size of a byte\, but it is not correct to assume that the octal escape "\400" ends after 2 digits simply because the numeric value of it exceeeds \377. Writing such code is probably non-portable\, because of the possible variation in byte sizes.
In systems with larger character values\, it seems that:
1) numbers greater than "\377" could be interpreted as larger character values\, at Karl proposes\, but doing so is likely to cause confusion. Also\, it should be pointed out that the escape was intended to fill a "byte"\, so it is my belief that octal escapes producing values that exceed the value of a platform-specific byte size should be rejected with an error. I'm not sure if Perl supports systems with byte sizes other than 8 bits\, but if it does\, this would be a platform specific check. Note that limiting the octal escape to 3 digits prevents the octal escape from being used to create all possible bit patterns for bytes larger than 12 bits (but I am unaware of any computer platform ever defining a byte larger than 10 bits).
2) Unicode values can clearly exceed 12 bits\, so it seems that the octal escape is somewhat useless for creating all the possible values\, so extending them to deal with values greater than the value of a platform-specific byte seems inappropriate\, given all the documentation that
3) It seems much more likely\, in my opinion\, that an octal escape that exceeds the value of a platform specific byte is an error rather than an extension feature.
4) It is easy to convert octal escapes into hex escapes if any existing programs presently misusing octal escapes that exceed the value of a platform-specific byte would encounter versions of Perl that suddenly reject such values. In fact\, a clever error message might be crafted that says:
sprintf 'Octal escape sequence "%o" is invalid. You probably meant "\x{%02.2x}\x{%02.2x}" or "\x{%04.4x}"'\, ender\, ender / 8\, ender % 8\, ender
to help the programmer quickly fix the problem.
On the other hand MS VC++ 6.0 explicitly allows the use of the full 12 bits possible to represent in an octal escape as an initializer for a wchar_t constant.
So there is precedent for Karl's scheme\, even if there is no precedent for Tom's (that I could find).
A protocol is complete when there is nothing left to remove. -- Stuart Cheshire\, Apple Computer\, regarding Zero Configuration Networking
I forgot to mention that currently (5.8 5.10)\, perl -le 'print "\400" =~ /[\400]/'
prints 1. So again the current implementation is inconsistent
Glenn Linderman wrote:
On approximately 10/25/2008 7:32 AM\, came the following characters from the keyboard of karl williamson:
Tom Christiansen wrote:
Karl\,
Thanks very much for the patch.
I confess guess I never *expected* "\400" to be "\x{100}"\, but rather "\x{20}0".
I'm not wedded to the interpretation I sumitted. It came out of testing against the documentation that a blackslash in a regular expression can be followed by 1-3 octal digits (the first one not needing to be a 0). And the interpretation I submitted is the one that the pre-existing code meant. Witnessperl -le 'print "\x{100}\400" =~ /\x{100}\400/' perl -le 'print "\400\x{100}" =~ /\400\x{100}/' both print 1.
That is\, If any other character anywhere in the pattern causes regcomp to think that it should store the re as utf8\, then \400 matches \400 and so on up to the maximum 3 digit octal: \777 matches \777. Failing that\, /\400/ will match \01\00. It never matches "\x{20}0".
So we have an existing bug. sometimes \400 matches \400\, and sometimes it matches \01\00\, depending on what I would call spooky action at a distance. (This means that \777 sometimes already matches \777 now.) I'm trying to get rid of these consistencies. I think something should be done here\, but perhaps its not what I thought it should be. My patch follows what the code was intending to do\, but perhaps we should change that intention. Please guide me.
I understand where Tom is coming from\, but he has no grounds for expecting "\400" to be the same as " 0".
I pulled out my old K&R\, which is likely one of the earliest published books documenting the octal escape notation\, and it explicitly says (section 2.3) "an arbitrary byte-sized bit pattern" can be created with an octal escape\, and the also "maximum of 3 octal digits". Section 2.2 talks about both 8 and 9 bit characters on example architectures.
So it would be wrong to limit the values to 8-bits. It is probably "platform specific" what interpretation should be applied to octal escapes that exceed the platform specific size of a byte\, but it is not correct to assume that the octal escape "\400" ends after 2 digits simply because the numeric value of it exceeeds \377. Writing such code is probably non-portable\, because of the possible variation in byte sizes.
In systems with larger character values\, it seems that:
1) numbers greater than "\377" could be interpreted as larger character values\, at Karl proposes\, but doing so is likely to cause confusion. Also\, it should be pointed out that the escape was intended to fill a "byte"\, so it is my belief that octal escapes producing values that exceed the value of a platform-specific byte size should be rejected with an error. I'm not sure if Perl supports systems with byte sizes other than 8 bits\, but if it does\, this would be a platform specific check. Note that limiting the octal escape to 3 digits prevents the octal escape from being used to create all possible bit patterns for bytes larger than 12 bits (but I am unaware of any computer platform ever defining a byte larger than 10 bits).
2) Unicode values can clearly exceed 12 bits\, so it seems that the octal escape is somewhat useless for creating all the possible values\, so extending them to deal with values greater than the value of a platform-specific byte seems inappropriate\, given all the documentation that
3) It seems much more likely\, in my opinion\, that an octal escape that exceeds the value of a platform specific byte is an error rather than an extension feature.
4) It is easy to convert octal escapes into hex escapes if any existing programs presently misusing octal escapes that exceed the value of a platform-specific byte would encounter versions of Perl that suddenly reject such values. In fact\, a clever error message might be crafted that says:
sprintf 'Octal escape sequence "%o" is invalid. You probably meant "\x{%02.2x}\x{%02.2x}" or "\x{%04.4x}"'\, ender\, ender / 8\, ender % 8\, ender
to help the programmer quickly fix the problem.
On the other hand MS VC++ 6.0 explicitly allows the use of the full 12 bits possible to represent in an octal escape as an initializer for a wchar_t constant.
So there is precedent for Karl's scheme\, even if there is no precedent for Tom's (that I could find).
On approximately 10/25/2008 12:25 PM\, came the following characters from the keyboard of karl williamson:
I forgot to mention that currently (5.8 5.10)\, perl -le 'print "\400" =~ /[\400]/'
prints 1. So again the current implementation is inconsistent
Indeed. So this increases the depth of my opinion that the solution should be to outlaw octal escapes greater than \377 on 8-bit platforms. But not by interpreting them as a two-character octal escape\, followed by an ASCII 0-7 character (sorry Tom\, I just can't find a precedent for that!).
(a little oops in my previous reply\, I mentioned 12 bits where I should have said 9 bits\, in three places...)
A protocol is complete when there is nothing left to remove. -- Stuart Cheshire\, Apple Computer\, regarding Zero Configuration Networking
Dear Glenn and Karl\,
+=============================+ | SUMMARY of Exposition Below | v-----------------------------v
* I fully agree there's a bug.
* I believe Karl has produced a reasonable patch to fix it.
* I wonder what *else* might/should also change in tandem with this estimable amendment so as to:
? avoid evoking or astonishing any hobgoblins of foolish inconsistency (ie: breaking bad expectations)
? what (if any?) backwards-compat story might need spinning (ie\, breaking code\, albeit cum credible Apologia)
tom++ [-: Congrats\, you've hit 10% read; only 90% below! :-]
/* * BEGIN EXPOSITION */
On Saturday\, 25 October 2008 at 13:20:20 PDT\, Glenn Linderman \perl@​NevCal\.com wrote:
But not by interpreting them as a two-character octal escape\, followed by an ASCII 0-7 character (sorry Tom\, I just can't find a precedent for that!).
Sorry? Thank you\, Glenn\, for your courtesy. Truth told\, being human I do appreciate it. Yet at my hacker heart\, I remain a meritocrat\, or close to it.
Thus apologies\, let alone "deference"(?)\, are never obligatory when someone has something reasoned to contribute\, even if it be to gently contradict someone--OR ANYONE.
Technical arguments can\, do\, and must stand on their own\, and no science-minded person should take the least offence for ever being shown he's been wrong in his calculations. Indeed\, he should be thankful for the enlightenment.
And so I am; still\, I appreciate your courtesy--and research.
I may've been a bit loose in "shooting from the hip" by writing:
I confess[/]guess I never *expected* "\400" to be "\x{100}"\, but rather "\x{20}0".
That may come from having been myself recently bitten/burnt by all of "\1"\, "\11"\, and "\111" sometimes--and sometimes not--meaning octal specs for characters\, even when I didn't intend this.
How so?
Because in true string-interpolation of the full qq!! variety\, they all fit tidily into an 8-bit octet\, wherein they *ALWAYS* mean chr(01)\, chr(011)\, and chr(0111).
However\, in "faux" qq!!ish processing\, per m// and qr//\, whether they mean that in the regex \depends on how many captures regcomp()'s seen so far\.
/* * Er\, I *believe*. See\, another of my hunches is that * (??{...}) *may* monkey-wrench these matters. I've * not looked into that\, and rather prefer not to. :( */
So\, when I saw "\40"\, knowing that *any* other digit would exceed the 0000 .. 0377 range\, I hunched it would stop there.
Wrongly.
And whether "would" or "should" are the more operative\, or at least more desired\, modality is the heart of this entire discussion we're having.
This would somewhat follow how "\x123" stops at \x12 (?) rather than (ever(?)) generate a single string of length one containing a char >8bits in length\, giving instead a two-char string "\x{12}3"\, which is different than the longer string \x{123} would produce after encoding/decoding for UTF-8 etc output.
I think the font of folly is found in the way that an
*unbraced* \x takes TWO AND EXACTLY TWO CHARS following
it...
...whereas \\
There's no way {say\, braces} to delimit the octal escape's characters from what follows it\, which seems to be the crux of the problem here. We can't put Z\<> or \& strings\, per POD or troff respectively; we have to break them up.
So you can't say "\{40}0" or "\0{40}0" or whatnot as you can can with "\x{20}" and "\x{20}0".
Now 5.10 gives us m/(stuff)\g{1}1/ to save one from the trouble that m/(stuff)\11/ would otherwise give you had you meant stuff\, followed by the stuff in cap-1\, then a literal digit-1\, neither capture #(decimal-)11\, ie \g{11}; nor chr(011)\, either.
Because unbraced \x stops parsing after two hex digits\, due both to compatibility with pre-Unicode days when then it would've otherwise generated a character whose code point would exceed U+0100 but also because long strings really need \x{BADBEEF}ish delimiters for safety and clarity\, I putting no thought into it cavalierly imagined that \0 might in this behave analogously to how \x does.
Not that that's how it works now\, nor how I should DESIRE it to work. I'm just explaining my (lack-of-)thought-process; foolish hobgoblins of little minds and all\, you know.
*PLEASE* misconstrue none of my chatterish kibitzing on this thread as somehow disapproving of more reliable\, more predictable\, more understandable\, and more explicable behavior.
Those are all admirable goals\, and I support them--full stop.
Probably this naïveté derives from having no direct experience with "bytes" (native\, non-utf[-]?8 wide/long chars) of length >8 bits. Even on the shorter end of that stick\, I've only a wee\, ancient bit of experience with bytes \<8 bits. That is\, long ago and far away\, we used to pack six\, 6-bit "RAD-50" chars into a 36-bit word under EXEC-8\, and sometimes used them even from DEC systems.
(See RAD-50 in BASIC-PLUS for one thing Larry *didn't* borrow from there; I guess we get pack()'s BER w* whatzitzes instead.)
Karl has clearly identified an area crying out for improvement (read: an indisputable bug)\, and even better\, he's sacrificed his own mental elbow-grease to address the problem for the greater good of us all.
I can't see how to ask more--and so I strongly applaud his generous contribution to the greater good of the making the world a better place.
I'm still a little skittish though\, because as far as I noticed\, perhaps peremptorily\, Karl's patch provided for /\0777/ or /\400/ scenarios alone: ie\, regex m/atche/s.
I meant only to say that addressing patterns alone while leaving both strings and the CLI for -0 setting $/ out of the changes risked introducing a petty inconsistency: a conceptual break between `perl -0777` as an unimplementable "octal" byte spec that therefore means undef()ing $/.
Plus\, there's how the docs equate $/ = "\0777" to undef $/.
This seems troublesome\, and I'd wonder how it worked if that *were* a legal sequence. And I wonder the ramifications of "breaking" it; it's really quite possible they're even worth doing\, but I don't know.
Again\, I've never been somewhere a character or byte's ever been more than 8 bits *NOR LESS THAN* 6\, so I don't know what expectations or experience in such hypothetical places might be.
I'm sure some out there have broader experience than mine\, and hope to hear from them.
^-----------------------------^ | SUMMARY of Exposition Above | +=============================+
* I agree there's a bug.
* I believe Karl has produced a reasonable patch to fix it.
* I wonder what *else* might/should also change in tandem with estimable amendment so as to:
? avoid evoking or astonishing any hobgoblins of foolish inconsistency (ie: breaking expectations)
? what (if any?) backwards-compat story might need spinning (ie: breaking code\, albeit cum credible Apologia)
Hope this makes some sense now. :(
--tom
PS: And what about *Perl VI* in this treatment of "\0ctals"\, eh?!
And\, I forgot to mention that setting $v = "\400" is the same thing as setting it to 256 or \x{100}.
Thus the only inconsistency I am aware of is in the area I patched. But\, as it's becoming increasingly obvious as I subscribe to this list\, my knowledge of the Perl language is minuscule.
I don't understand Glenn's point about using octal to fill whatever a byte is on a machine\, but no more. Suppose there were a nine bit byte machine (which one could argue is the maximum the C language specification allows from their limiting a byte to be filled by 3 octal digits). What would those extra high-bit-set patterns represent if not characters? And what could one do with them if not?
So\, it seems to me that one either limits an octal constant to \377\, or one allows it up to \777 with them all potentially mapping into the characters (or code points if you prefer) whose ordinal number corresponds to the constant. If we limit them\, there is the possibility that existing code will break\, as in many places now they aren't limited. I don't know where all those places are. If my patch is accepted\, then it gets rid of one place where there is an inconsistency; and I know of no others.
Maybe we should let some others weigh in on the matter.
Tom Christiansen wrote:
Dear Glenn and Karl\,
+=============================+ | SUMMARY of Exposition Below | v-----------------------------v
* I fully agree there's a bug.
* I believe Karl has produced a reasonable patch to fix it.
* I wonder what *else* might/should also change in tandem with this estimable amendment so as to:
? avoid evoking or astonishing any hobgoblins of foolish inconsistency \(ie​: breaking bad expectations\) ? what \(if any?\) backwards\-compat story might need spinning \(ie\, breaking code\, albeit cum credible Apologia\)
tom++ [-: Congrats\, you've hit 10% read; only 90% below! :-]
/* * BEGIN EXPOSITION */
On Saturday\, 25 October 2008 at 13:20:20 PDT\, Glenn Linderman \perl@​NevCal\.com wrote:
But not by interpreting them as a two-character octal escape\, followed by an ASCII 0-7 character (sorry Tom\, I just can't find a precedent for that!).
Sorry? Thank you\, Glenn\, for your courtesy. Truth told\, being human I do appreciate it. Yet at my hacker heart\, I remain a meritocrat\, or close to it.
Thus apologies\, let alone "deference"(?)\, are never obligatory when someone has something reasoned to contribute\, even if it be to gently contradict someone--OR ANYONE.
Technical arguments can\, do\, and must stand on their own\, and no science-minded person should take the least offence for ever being shown he's been wrong in his calculations. Indeed\, he should be thankful for the enlightenment.
And so I am; still\, I appreciate your courtesy--and research.
I may've been a bit loose in "shooting from the hip" by writing:
I confess[/]guess I never *expected* "\400" to be "\x{100}"\, but rather "\x{20}0".
That may come from having been myself recently bitten/burnt by all of "\1"\, "\11"\, and "\111" sometimes--and sometimes not--meaning octal specs for characters\, even when I didn't intend this.
How so?
Because in true string-interpolation of the full qq!! variety\, they all fit tidily into an 8-bit octet\, wherein they *ALWAYS* mean chr(01)\, chr(011)\, and chr(0111).
However\, in "faux" qq!!ish processing\, per m// and qr//\, whether they mean that in the regex \depends on how many captures regcomp()'s seen so far\.
/* * Er\, I *believe*. See\, another of my hunches is that * (??{...}) *may* monkey-wrench these matters. I've * not looked into that\, and rather prefer not to. :( */
So\, when I saw "\40"\, knowing that *any* other digit would exceed the 0000 .. 0377 range\, I hunched it would stop there.
Wrongly.
And whether "would" or "should" are the more operative\, or at least more desired\, modality is the heart of this entire discussion we're having.
This would somewhat follow how "\x123" stops at \x12 (?) rather than (ever(?)) generate a single string of length one containing a char >8bits in length\, giving instead a two-char string "\x{12}3"\, which is different than the longer string \x{123} would produce after encoding/decoding for UTF-8 etc output.
I think the font of folly is found in the way that an *unbraced* \x takes TWO AND EXACTLY TWO CHARS following it... ...whereas \\
takes 1\, 2\, *or* 3 (and what's this about MS-pain about it taking more\, anyway?) octal digits\, and that therein lay the rubbish that afflicted me. There's no way {say\, braces} to delimit the octal escape's characters from what follows it\, which seems to be the crux of the problem here. We can't put Z\<> or \& strings\, per POD or troff respectively; we have to break them up.
So you can't say "\{40}0" or "\0{40}0" or whatnot as you can can with "\x{20}" and "\x{20}0".
Now 5.10 gives us m/(stuff)\g{1}1/ to save one from the trouble that m/(stuff)\11/ would otherwise give you had you meant stuff\, followed by the stuff in cap-1\, then a literal digit-1\, neither capture #(decimal-)11\, ie \g{11}; nor chr(011)\, either.
Because unbraced \x stops parsing after two hex digits\, due both to compatibility with pre-Unicode days when then it would've otherwise generated a character whose code point would exceed U+0100 but also because long strings really need \x{BADBEEF}ish delimiters for safety and clarity\, I putting no thought into it cavalierly imagined that \0 might in this behave analogously to how \x does.
Not that that's how it works now\, nor how I should DESIRE it to work. I'm just explaining my (lack-of-)thought-process; foolish hobgoblins of little minds and all\, you know.
*PLEASE* misconstrue none of my chatterish kibitzing on this thread as somehow disapproving of more reliable\, more predictable\, more understandable\, and more explicable behavior.
Those are all admirable goals\, and I support them--full stop.
Probably this naïveté derives from having no direct experience with "bytes" (native\, non-utf[-]?8 wide/long chars) of length >8 bits. Even on the shorter end of that stick\, I've only a wee\, ancient bit of experience with bytes \<8 bits. That is\, long ago and far away\, we used to pack six\, 6-bit "RAD-50" chars into a 36-bit word under EXEC-8\, and sometimes used them even from DEC systems.
(See RAD-50 in BASIC-PLUS for one thing Larry *didn't* borrow from there; I guess we get pack()'s BER w* whatzitzes instead.)
Karl has clearly identified an area crying out for improvement (read: an indisputable bug)\, and even better\, he's sacrificed his own mental elbow-grease to address the problem for the greater good of us all.
I can't see how to ask more--and so I strongly applaud his generous contribution to the greater good of the making the world a better place.
I'm still a little skittish though\, because as far as I noticed\, perhaps peremptorily\, Karl's patch provided for /\0777/ or /\400/ scenarios alone: ie\, regex m/atche/s.
I meant only to say that addressing patterns alone while leaving both strings and the CLI for -0 setting $/ out of the changes risked introducing a petty inconsistency: a conceptual break between `perl -0777` as an unimplementable "octal" byte spec that therefore means undef()ing $/.
Plus\, there's how the docs equate $/ = "\0777" to undef $/.
This seems troublesome\, and I'd wonder how it worked if that *were* a legal sequence. And I wonder the ramifications of "breaking" it; it's really quite possible they're even worth doing\, but I don't know.
Again\, I've never been somewhere a character or byte's ever been more than 8 bits *NOR LESS THAN* 6\, so I don't know what expectations or experience in such hypothetical places might be.
I'm sure some out there have broader experience than mine\, and hope to hear from them.
^\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-^ | SUMMARY of Exposition Above | \+=============================\+
* I agree there's a bug.
* I believe Karl has produced a reasonable patch to fix it.
* I wonder what *else* might/should also change in tandem with estimable amendment so as to:
? avoid evoking or astonishing any hobgoblins of foolish inconsistency \(ie​: breaking expectations\) ? what \(if any?\) backwards\-compat story might need spinning \(ie​: breaking code\, albeit cum credible Apologia\)
Hope this makes some sense now. :(
--tom
PS: And what about *Perl VI* in this treatment of "\0ctals"\, eh?!
On approximately 10/25/2008 7:47 PM\, came the following characters from the keyboard of karl williamson:
And\, I forgot to mention that setting $v = "\400" is the same thing as setting it to 256 or \x{100}.
Thus the only inconsistency I am aware of is in the area I patched. But\, as it's becoming increasingly obvious as I subscribe to this list\, my knowledge of the Perl language is minuscule.
I don't understand Glenn's point about using octal to fill whatever a byte is on a machine\, but no more. Suppose there were a nine bit byte machine (which one could argue is the maximum the C language specification allows from their limiting a byte to be filled by 3 octal digits). What would those extra high-bit-set patterns represent if not characters? And what could one do with them if not?
Surely on a 9 bit byte machine\, if there were any benefit character-wise of creating such characters\, the users of those machines would understand that\, and use the octal escape necessary to create the appropriate ordinals for those characters.
I rather expect that the primary use of 9 bit byte values would have been to initialize bytes to binary values\, rather than deal in characters: I've never seen any character encodings that speak of 9-bit character values\, until I just found http://tools.ietf.org/html/rfc4042 via a Google search. As the article states:
By comparison\, UTF-9 uses one to two nonets to represent codepoints in the BMP\, three nonets to represent [UNICODE] codepoints outside the BMP\, and three or four nonets to represent non-[UNICODE] codepoints. There are no wasted bits\, and as the examples in this document demonstrate\, the computational processing is minimal.
So\, it seems to me that one either limits an octal constant to \377\, or one allows it up to \777 with them all potentially mapping into the characters (or code points if you prefer) whose ordinal number corresponds to the constant. If we limit them\, there is the possibility that existing code will break\, as in many places now they aren't limited. I don't know where all those places are. If my patch is accepted\, then it gets rid of one place where there is an inconsistency; and I know of no others.
Tom did point out two inconsistencies with octal values of 0777 below\, and I suspect there will still be people that have latent bugs due to thinking like Tom did\, that \400 should maybe be two characters\, or might yet produce unexpected behaviors when coding \400 - \777 accidentally\, and not understanding why their string suddenly has the UTF-8 bit set!
But it is also true that there could be some code depending on this behavior\, using it to generate 9 bit numbers\, as \400 is shorter than \x{100} and would fit better on one line!
It might perhaps be handy to enhance the documentation to point out that octal escapes in the range \400 through \777 do not fit within a single byte on 8-bit platforms\, and can be used to generate characters with Unicode codepoints in the full range from \0 through \777\, but that because Unicode codepoints are expressed in hexadecimal in all the documentation for Unicode\, using octal notation for Unicode characters is unlikely to be quickly understood by the average programmer.
I still think it would be better to treat octal escapes greater than \377 as errors\, or to deprecate octal escape syntax completely.
Maybe we should let some others weigh in on the matter.
I waited two days\, and no one else has weighed in... it would be nice if several additional opinions could be obtained.
Tom Christiansen wrote:
Dear Glenn and Karl\,
This would somewhat follow how "\x123" stops at \x12 (?) rather than (ever(?)) generate a single string of length one containing a char >8bits in length\, giving instead a two-char string "\x{12}3"\, which is different than the longer string \x{123} would produce after encoding/decoding for UTF-8 etc output.
I agree that a Perl-self-consistent behaviour could be achieved by limiting octal escapes to values in the range \0 - \377; and treating \400 as two characters; however\, that would never be consistent with what K&R defined... and there are still today a fair number of programmers alive for whom K&R was their first programming book\, or at least their first to introduce the octal escape. Whether Larry considered that when making the unbraced \x escape\, I couldn't say... but the unbraced \x escape clearly is limited to values is the range \0 - \377: does that mean that Larry defined Perl to be an 8-bit-byte architecture? Are there other examples\, documentation\, or history that prove that Perl has never run on a 9-bit architecture?
I think the font of folly is found in the way that an *unbraced* \x takes TWO AND EXACTLY TWO CHARS following it... ...whereas \\
takes 1\, 2\, *or* 3 (and what's this about MS-pain about it taking more\, anyway?) octal digits\, and that therein lay the rubbish that afflicted me.
I'm not enamored of having either Perl or MS-VC++ 6.0 (and probably later versions\, and maybe earlier too) define a wide-character meaning for octal escapes in the range \400 - \777. I think that except on an architecture with 9-bit bytes (or larger) their value in creating obfuscated Unicode codepoints is far exceeded by the confusion that would result when accidentally used. Hence\, my suggestion for an error.
There's no way {say\, braces} to delimit the octal escape's characters from what follows it\, which seems to be the crux of the problem here. We can't put Z\<> or \& strings\, per POD or troff respectively; we have to break them up.
So you can't say "\{40}0" or "\0{40}0" or whatnot as you can can with "\x{20}" and "\x{20}0".
While it might be possible to invent \{ooo} syntax\, I think it would be better to enhance the documentation for octal escapes to recommend using hex escapes instead\, pointing out the deficiencies and ambiguities in using octal escapes.
Deprecating octal escapes might be an even better solution... leaving the \n syntax to be unambiguously used in substitutions.
Probably this naïveté derives from having no direct experience with "bytes" (native\, non-utf[-]?8 wide/long chars) of length >8 bits. Even on the shorter end of that stick\, I've only a wee\, ancient bit of experience with bytes \<8 bits. That is\, long ago and far away\, we used to pack six\, 6-bit "RAD-50" chars into a 36-bit word under EXEC-8\, and sometimes used them even from DEC systems.
My understanding of 36-bit architectures was that generally characters were stored as 9-bit bytes (but only using 7 of the bits\, since that is all ASCII needed)\, or to subset ASCII so that a "useful" subset of ASCII was available in 6 bits\, which was more efficient in systems with limited RAM. (Looking back\, it seems that RAM (core\, then) was always limited!). However\, that was just from documentation I read once for a CDC (I think) machine\, not from personal experience with one.
(See RAD-50 in BASIC-PLUS for one thing Larry *didn't* borrow from there; I guess we get pack()'s BER w* whatzitzes instead.)
Karl has clearly identified an area crying out for improvement (read: an indisputable bug)\, and even better\, he's sacrificed his own mental elbow-grease to address the problem for the greater good of us all.
I can't see how to ask more--and so I strongly applaud his generous contribution to the greater good of the making the world a better place.
I'm still a little skittish though\, because as far as I noticed\, perhaps peremptorily\, Karl's patch provided for /\0777/ or /\400/ scenarios alone: ie\, regex m/atche/s.
I meant only to say that addressing patterns alone while leaving both strings and the CLI for -0 setting $/ out of the changes risked introducing a petty inconsistency: a conceptual break between `perl -0777` as an unimplementable "octal" byte spec that therefore means undef()ing $/.
Plus\, there's how the docs equate $/ = "\0777" to undef $/.
Note that
$/ = "\0777"
would produce a two character string...
These uses of -0777 and \777 seem to be special cases\, but they do add inconsistency to an otherwise consistent world resulting from Karl's patch\, eh? It seems clear that octal 777 was not expected to be used as a valid character value at the time it was defined with these other meanings; yet the acceptance of octal escapes greater than \377 as characters has produced the expectation\, and even the reality\, for Unicode\, that it can sometimes be interpreted that way. This is\, perhaps\, a still-remaining inconsistency in the syntax. Should \777 be rejected as a character\, to eliminate this inconsistency? Or why not the whole range from \400 - \777?
This seems troublesome\, and I'd wonder how it worked if that *were* a legal sequence. And I wonder the ramifications of "breaking" it; it's really quite possible they're even worth doing\, but I don't know.
Again\, I've never been somewhere a character or byte's ever been more than 8 bits *NOR LESS THAN* 6\, so I don't know what expectations or experience in such hypothetical places might be.
I'm sure some out there have broader experience than mine\, and hope to hear from them.
^\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-^ | SUMMARY of Exposition Above | \+=============================\+
* I agree there's a bug.
* I believe Karl has produced a reasonable patch to fix it.
* I wonder what *else* might/should also change in tandem with estimable amendment so as to:
? avoid evoking or astonishing any hobgoblins of foolish inconsistency \(ie​: breaking expectations\) ? what \(if any?\) backwards\-compat story might need spinning \(ie​: breaking code\, albeit cum credible Apologia\)
Hope this makes some sense now. :(
--tom
PS: And what about *Perl VI* in this treatment of "\0ctals"\, eh?!
Not sure what Perl VI is? An implementation of VI in Perl? An interface between Perl and VI? I'm not a VI user\, nor likely to become one.
A protocol is complete when there is nothing left to remove. -- Stuart Cheshire\, Apple Computer\, regarding Zero Configuration Networking
On Mon\, 27 Oct 2008 14:19:37 PDT Glenn Linderman \perl@​NevCal\.com wrote\, replying to Karl:
Tom did point out two inconsistencies with octal values of 0777 below\, and I suspect there will still be people that have latent bugs due to thinking like Tom did\, that \400 should maybe be two characters\, or might yet produce unexpected behaviors when coding \400 - \777 accidentally\, and not understanding why their string suddenly has the UTF-8 bit set!
Maybe we should let some others weigh in on the matter.
I waited two days\, and no one else has weighed in... it would be nice if several additional opinions could be obtained.
Indeed. I\, too\, have been waiting.
--tom
In article \49063069\.8090001@​NevCal\.com\, perl@NevCal.com (Glenn Linderman) wrote:
I rather expect that the primary use of 9 bit byte values would have been to initialize bytes to binary values\, rather than deal in characters: I've never seen any character encodings that speak of 9-bit character values\, until I just found http://tools.ietf.org/html/rfc4042 via a Google search.
Did you notice the release date of that RFC?
/Bo Lindbergh
On Sat\, Oct 25\, 2008 at 07:23:29PM -0600\, Tom Christiansen wrote:
Dear Glenn and Karl\,
+=============================+ | SUMMARY of Exposition Below | v-----------------------------v
* I fully agree there's a bug.
* I believe Karl has produced a reasonable patch to fix it.
* I wonder what *else* might/should also change in tandem with this estimable amendment so as to:
? avoid evoking or astonishing any hobgoblins of foolish inconsistency \(ie​: breaking bad expectations\) ? what \(if any?\) backwards\-compat story might need spinning \(ie\, breaking code\, albeit cum credible Apologia\)
I am seldomly in favour of new warnings for existing code\, but perhaps use of \NNN\, with NNN > 377 in a regexp should trigger a warning\, as its behaviour is surprising - not to mention some code out there may rely on the current (buggy) behaviour.
Abigail
Abigail wrote:
On Sat\, Oct 25\, 2008 at 07:23:29PM -0600\, Tom Christiansen wrote:
Dear Glenn and Karl\,
+=============================+ | SUMMARY of Exposition Below | v-----------------------------v
* I fully agree there's a bug.
* I believe Karl has produced a reasonable patch to fix it.
* I wonder what *else* might/should also change in tandem with this estimable amendment so as to:
? avoid evoking or astonishing any hobgoblins of foolish inconsistency \(ie​: breaking bad expectations\) ? what \(if any?\) backwards\-compat story might need spinning \(ie\, breaking code\, albeit cum credible Apologia\)
I am seldomly in favour of new warnings for existing code\, but perhaps use of \NNN\, with NNN > 377 in a regexp should trigger a warning\, as its behaviour is surprising - not to mention some code out there may rely on the current (buggy) behaviour.
Abigail
So what's the answer? I don't think it should be an error for >=400.
I think a warning would be ok\, and I know enough to put such a warning into regcomp.c. And at compile time\, one could handle >8-bit machines by testing sizeof(char). But what to do besides warn? Assume they meant unicode\, as character classes do now\, my patch would do? And I don't know the rest of the code. The only other place that grok_oct() is called is from the oct() function. My limited knowledge of how perl works suggests that a call to this is put on the stack when evaluating a double-quotish constant. I don't know enough about the code at this time to know how to add a warning to that.
I think we're all agreed that there is a bug. And that there should be consistency of handling\, unlike now. I await your responses.
On approximately 11/3/2008 7:04 PM\, came the following characters from the keyboard of karl williamson:
Abigail wrote:
On Sat\, Oct 25\, 2008 at 07:23:29PM -0600\, Tom Christiansen wrote:
Dear Glenn and Karl\,
+=============================+ | SUMMARY of Exposition Below | v-----------------------------v
* I fully agree there's a bug.
* I believe Karl has produced a reasonable patch to fix it.
* I wonder what *else* might/should also change in tandem with this estimable amendment so as to:
? avoid evoking or astonishing any hobgoblins of foolish inconsistency \(ie​: breaking bad expectations\) ? what \(if any?\) backwards\-compat story might need spinning \(ie\, breaking code\, albeit cum credible Apologia\)
I am seldomly in favour of new warnings for existing code\, but perhaps use of \NNN\, with NNN > 377 in a regexp should trigger a warning\, as its behaviour is surprising - not to mention some code out there may rely on the current (buggy) behaviour.
Abigail
So what's the answer? I don't think it should be an error for >=400.
I think a warning would be ok\, and I know enough to put such a warning into regcomp.c. And at compile time\, one could handle >8-bit machines by testing sizeof(char). But what to do besides warn? Assume they meant unicode\, as character classes do now\, my patch would do? And I don't know the rest of the code. The only other place that grok_oct() is called is from the oct() function. My limited knowledge of how perl works suggests that a call to this is put on the stack when evaluating a double-quotish constant. I don't know enough about the code at this time to know how to add a warning to that.
I think we're all agreed that there is a bug. And that there should be consistency of handling\, unlike now. I await your responses.
I think it should be an error on an 8-bit machine (but you'd need to test MAXCHAR not sizeof(char) which is 1)\, because it is a useless way of specifying a Unicode codepoint — useless\, because it can only be used for a small fraction of the possible codepoints\, and the documentation for codepoints is all in hexadecimal\, not octal.
But if you don't think it should be an error\, then my number two position is to let it silently be a Unicode codepoint\, which is what your patch already does.
I'm not in favor of warnings for this sort of thing\, but regexp and quoted strings should handle it the same way\, and either both or neither should produce the warning.
I'm quite happy to let the pumpking decide when there are questions of whether or not to add new warnings or errors.
A protocol is complete when there is nothing left to remove. -- Stuart Cheshire\, Apple Computer\, regarding Zero Configuration Networking
2008/10/25 karl williamson \public@​khwilliamson\.com:
So we have an existing bug. sometimes \400 matches \400\, and sometimes it matches \01\00\, depending on what I would call spooky action at a distance. (This means that \777 sometimes already matches \777 now.) I'm trying to get rid of these consistencies.
Ive been out of touch for a while so I'm not entirely up to speed on what exactly you have in mind. But I think that you can not eliminate the inconsistencies in octal/backref escapes. Especially some aspects of spooky action at a distance\, as that interacts with the number of capture buffers in the LAST pattern matched.
So basically the docs should say\, if they do not already\,"it is *strongly* recommended that you do NOT use octal in any form in a regex" (except perhaps in a charclass definition).
And I would argue that in 5.12 we should make them warn\, and then in 5.14 make them illegal or ONLY mean capture buffers.
Consider /\1/ means the first capture buffer of the previous match\, \17 means the _seventeenth_ capture buffer of the previous match IFF the previous match contains more 17 or more capture buffers\, otherwise it means \x{F}.
In short: resolving the inconsistencies in octal notation in regex notation would appear to be impossible.
Yves
-- perl -Mre=debug -e "/just|another|perl|hacker/"
2008/11/4 demerphq \demerphq@​gmail\.com:
2008/10/25 karl williamson \public@​khwilliamson\.com:
So we have an existing bug. sometimes \400 matches \400\, and sometimes it matches \01\00\, depending on what I would call spooky action at a distance. (This means that \777 sometimes already matches \777 now.) I'm trying to get rid of these consistencies.
Ive been out of touch for a while so I'm not entirely up to speed on what exactly you have in mind. But I think that you can not eliminate the inconsistencies in octal/backref escapes. Especially some aspects of spooky action at a distance\, as that interacts with the number of capture buffers in the LAST pattern matched.
So basically the docs should say\, if they do not already\,"it is *strongly* recommended that you do NOT use octal in any form in a regex" (except perhaps in a charclass definition).
And I would argue that in 5.12 we should make them warn\, and then in 5.14 make them illegal or ONLY mean capture buffers.
I agree with Yves here.
I don't think it's worth changing the meaning of \400 in double quoted strings\, or making it warn. However\, in regexps\, it's too dangerously inconsistent and should be deprecated. First\, a deprecation warning seems in order.
However\, I see some value in still allowing [\000-\377] character ranges\, for example. Do we really want to deprecate that as well? This doesn't seem necessary.
Consider /\1/ means the first capture buffer of the previous match\, \17 means the _seventeenth_ capture buffer of the previous match IFF the previous match contains more 17 or more capture buffers\, otherwise it means \x{F}.
In short: resolving the inconsistencies in octal notation in regex notation would appear to be impossible.
Error messages are a mess\, too. This one is correct: $ perl -wE '/\8/' Reference to nonexistent group in regex; marked by \<-- HERE in m/\8 \<-- HERE / at -e line 1.
This one shows clearly that we're using a regexp that matches "\x{1}8"\, but why is there a duplicated warning? Double magic? $ perl -wE '/\18/' Illegal octal digit '8' ignored at -e line 1. Illegal octal digit '8' ignored at -e line 1. Use of uninitialized value $_ in pattern match (m//) at -e line 1.
On approximately 11/12/2008 7:03 AM\, came the following characters from the keyboard of Rafael Garcia-Suarez:
However\, I see some value in still allowing [\000-\377] character ranges\, for example. Do we really want to deprecate that as well? This doesn't seem necessary.
I personally see no value in octal notation now that Unicode uses hex\, and most programmers are familiar with it. The octal notation was somewhat useful when there were 9-bit/18-bit machines\, and when hex was foreign (what are those letters doing in my numbers?). I daresay that hex is about the second thing most programmers learn\, these days. "This is a computer... this is hexadecimal numbering system... there are lots of computer languages..."
Another approach would be to change the escape from \nnn to \o{nnnnn...}
\o is available (but is half of the 5 pairs that are left)
The {} provide explicit delimiters\, so octal numbers could then achieve parity with hex in the range of numbers available.
If people think octal is still worth supporting\, this looks like a better syntax to support it wholeheartedly.
Python 3.0 has moved to 0onnnnn for its octal integers (zero oh digit-sequence) after concluding that leading zeros alone are just too problematical\, so the "o" indicator has a precedent (albeit recent) in addition to reasonably intuitively meaning octal to anyone that understands the hexadecimal notation and has ever heard of octal. The 0o syntax could also be added to Perl integer constants outside of strings/regices.
The above items could be added to the language immediately\, during the deprecation cycle for \nnn octal notation\, giving people an extremely simple way to convert their octal constants: inside of strings/regices\, insert o after \ and wrap the digits with {}; outside of strings/regices\, insert o after leading 0.
This would allow people that still wish to use octal a way to do so\, without ambiguity.
A protocol is complete when there is nothing left to remove. -- Stuart Cheshire\, Apple Computer\, regarding Zero Configuration Networking
Glenn Linderman \perl@​NevCal\.com wrote:
| On approximately 11/12/2008 7:03 AM\, came the following characters | from the keyboard of Rafael Garcia-Suarez:
«« However\, I see some value in still allowing [\000-\377] character »» «« ranges\, for example. Do we really want to deprecate that as well? »» «« This doesn't seem necessary. »»
| The [below] items could be added to the language immediately\, during the | deprecation cycle for \nnn octal notation\, giving people an extremely | simple way to convert their octal constants: inside of strings/regices\, | insert o after \ and wrap the digits with {}; outside of strings/regices\, | insert o after leading 0.
I have obliged to change my Perl code at three and *only* three instances over the last TWENTY-ONE YEARS:
1) When log() became a keyword\, the inverse of exp(). If parens had been mandatory on function calls of no arguments\, a very wise practice\, this wouldn't have been a problem. This was in 1988\, or perhaps 1989.
2) When perl5 made arrays interpolate in "@strings" unconditionally. This was in 1994. This was the right thing to do.
3) When perl5.010 finally blew away $* (and $#). This too was the right thing to do. This was early this year\, 2008\, and it was in the following singleton program\, written before /m existed:
#!/usr/local/bin/perl $/ = ''; while (\<>) { #$* = 1; s/^-- ?$//m if eof; s/^[-+]{2}\w+$//m if eof; next unless split(/\n/); $max = 0; #$* = 0; for (@_) { 1 while s/\t+/' 'x (length($&) * 8 - length($`) % 8)/e; $max = ($max > length) ? $max : length; } $edge = "+" . "-" x ($max+2) . "+\n"; print $edge; for (@_) { printf "| %-${max}s |\n"\, $_; } print $edge\, "\n"; }
I find the notion of rendering illegal the existing octal syntax of "\33" is an *EXTRAÖRDINARILY* bad idea\, a position I am prepared to defend at laborious length--and\, if necessary\, appeal to the Decider-in-Chief\, who's always done everything possible to *NOT* break others' code without *VERY* *STRONG* reason. I submit that that very high bar has *NOT* been met; far from it. I'm rather hoping I shan't have to do any of that\, but I certainly shall if I must.
There's no reason at all to delete it: because regexes have \g{1} now\, and strings need never be written "\333" if you mean "\33" . "3".
There is GREAT reason *not* to delete it\, as the quantity of code you would see casually rendered illegal is incomprehensibly large\, with the work involved in updating code\, databases\, config files\, and educating programmers and users incalculable great. To add insult to injury\, this work you would see thrust upon others\, not taken on yourself.
There is nothing fundamentally broken here\, as there was for $*. This is trying to create a language where it is impossible to "think bad thoughts". One cannot succeed at that.
| I personally see no value in octal notation now that Unicode uses hex\, ^^^^^^^^^^ ^^^^^^^^ Good to see the prefatory warning that this your *personal* view. :-) vvvvvvvv As for "Unicode using hex"\, me\, I've always thought of it as using bits. Rather\, I think of the various standards specifying code points in the U+XXXXXX notation to mean code point at that hexadecimal number. Not the same thing at all. That why I always write
sub uchar(_) { pack( "U*"\, shift() ) }
because that way all of these
say "chr $_ is " => uchar for 181\, 223\, 231\, 240\, 241\, 254; say "chr $_ is " => uchar for 0265\,0337\,0347\,0360\,0361\,0376; say "chr $_ is " => uchar for 0xb5\,0xdf\,0xe7\,0xf0\,0xf1\,0xfe; say "chr $_ is " => uchar for 0b10110101\,0b11011111\,0b11100111\, 0b11110000\,0b11110001\,0b11111110;
correctly say:
chr 181 is µ chr 223 is ß chr 231 is ç chr 240 is ð chr 241 is ñ chr 254 is þ
and similarly
say "uc "\, uchar\, " is "\, uc uchar for 181\, 0xDF\, 0347\, 3*2**4*5\, 0361\, 0b11111110;
says
uc µ is M uc ß is SS uc ç is Ç uc ð is Ð uc ñ is Ñ uc þ is Þ
Because I'd be really annoyed if
sub uchar(_) { pack( "U*"\, hex shift() ) } say "chr $_ has ord " => ord uchar for 181\, 223\, 231\, 240\, 241\, 254;
were giving me answers like:
chr 181 has ord 385 chr 223 has ord 547 chr 231 has ord 561 chr 240 has ord 576 chr 241 has ord 577 chr 254 has ord 596
| and most programmers are familiar with it. [···] I daresay that hex | is about the second thing most programmers learn\, these days. "This | is a computer... this is hexadecimal numbering system... there are | lots of computer languages..."
Hm\, ok. If you say so. Hadn't noticed it myself.
| Another approach would be to change the escape from \nnn to | \o{nnnnn...} [···] The {} provide explicit delimiters\, so octal | numbers could then achieve parity with hex in the range of numbers | available. If people think octal is still worth supporting\, this looks | like a better syntax to support it wholeheartedly.
That's not needed\, unless you really want to promote octal for Unicode strings. In a pattern\, \g{1} now handles the situation you're talking about. For DQ-strings\, one can always avoid it.
Type "man ascii"; note that the table given first is octal.
| Python 3.0 has moved to 0onnnnn for its octal integers (zero oh digit- | sequence) after concluding that leading zeros alone are just too | problematical\, so the "o" indicator has a precedent (albeit recent) in | addition to reasonably intuitively meaning octal to anyone that | understands the hexadecimal notation and has ever heard of octal. The | 0o syntax could also be added to Perl integer constants outside of | strings/regices.
My only trouble with the 0o notation is on fonts without cross 0's\, and its gratuitous superfluousness.
--tom
--
+------------------------------------------------------------+ | SINGULAR PLURAL | +-------------+----------------------------------------------+ | NOMINATIVE | magnus rex magni reges | | VOCATIVE | magne rex magni reges | | GENITIVE | magni regis magnorum regum | | ACCUSATIVE | magnum regem magnos reges | | DATIVE | magno regi magnis regibus | | ABLATIVE | magno rege magnis regibus | | LOCATIVE | magni regi (or rege) magnis regibus | +-------------+----------------------------------------------+
% man ascii
ASCII(7) OpenBSD Reference Manual ASCII(7)
NAME ascii - octal\, hexadecimal and decimal ASCII character sets
DESCRIPTION The octal set:
000 nul 001 soh 002 stx 003 etx 004 eot 005 enq 006 ack 007 bel 010 bs 011 ht 012 nl 013 vt 014 np 015 cr 016 so 017 si 020 dle 021 dc1 022 dc2 023 dc3 024 dc4 025 nak 026 syn 027 etb 030 can 031 em 032 sub 033 esc 034 fs 035 gs 036 rs 037 us 040 sp 041 ! 042 " 043 # 044 $ 045 % 046 & 047 ' 050 ( 051 ) 052 * 053 + 054 \, 055 - 056 . 057 / 060 0 061 1 062 2 063 3 064 4 065 5 066 6 067 7 070 8 071 9 072 : 073 ; 074 \< 075 = 076 > 077 ? 100 @ 101 A 102 B 103 C 104 D 105 E 106 F 107 G 110 H 111 I 112 J 113 K 114 L 115 M 116 N 117 O 120 P 121 Q 122 R 123 S 124 T 125 U 126 V 127 W 130 X 131 Y 132 Z 133 [ 134 \ 135 ] 136 ^ 137 _ 140 ` 141 a 142 b 143 c 144 d 145 e 146 f 147 g 150 h 151 i 152 j 153 k 154 l 155 m 156 n 157 o 160 p 161 q 162 r 163 s 164 t 165 u 166 v 167 w 170 x 171 y 172 z 173 { 174 | 175 } 176 ~ 177 del
The hexadecimal set:
00 nul 01 soh 02 stx 03 etx 04 eot 05 enq 06 ack 07 bel 08 bs 09 ht 0a nl 0b vt 0c np 0d cr 0e so 0f si 10 dle 11 dc1 12 dc2 13 dc3 14 dc4 15 nak 16 syn 17 etb 18 can 19 em 1a sub 1b esc 1c fs 1d gs 1e rs 1f us 20 sp 21 ! 22 " 23 # 24 $ 25 % 26 & 27 ' 28 ( 29 ) 2a * 2b + 2c \, 2d - 2e . 2f / 30 0 31 1 32 2 33 3 34 4 35 5 36 6 37 7 38 8 39 9 3a : 3b ; 3c \< 3d = 3e > 3f ? 40 @ 41 A 42 B 43 C 44 D 45 E 46 F 47 G 48 H 49 I 4a J 4b K 4c L 4d M 4e N 4f O 50 P 51 Q 52 R 53 S 54 T 55 U 56 V 57 W 58 X 59 Y 5a Z 5b [ 5c \ 5d ] 5e ^ 5f _ 60 ` 61 a 62 b 63 c 64 d 65 e 66 f 67 g 68 h 69 i 6a j 6b k 6c l 6d m 6e n 6f o 70 p 71 q 72 r 73 s 74 t 75 u 76 v 77 w 78 x 79 y 7a z 7b { 7c | 7d } 7e ~ 7f del
The decimal set:
0 nul 1 soh 2 stx 3 etx 4 eot 5 enq 6 ack 7 bel 8 bs 9 ht 10 nl 11 vt 12 np 13 cr 14 so 15 si 16 dle 17 dc1 18 dc2 19 dc3 20 dc4 21 nak 22 syn 23 etb 24 can 25 em 26 sub 27 esc 28 fs 29 gs 30 rs 31 us 32 sp 33 ! 34 " 35 # 36 $ 37 % 38 & 39 ' 40 ( 41 ) 42 * 43 + 44 \, 45 - 46 . 47 / 48 0 49 1 50 2 51 3 52 4 53 5 54 6 55 7 56 8 57 9 58 : 59 ; 60 \< 61 = 62 > 63 ? 64 @ 65 A 66 B 67 C 68 D 69 E 70 F 71 G 72 H 73 I 74 J 75 K 76 L 77 M 78 N 79 O 80 P 81 Q 82 R 83 S 84 T 85 U 86 V 87 W 88 X 89 Y 90 Z 91 [ 92 \ 93 ] 94 ^ 95 _ 96 ` 97 a 98 b 99 c 100 d 101 e 102 f 103 g 104 h 105 i 106 j 107 k 108 l 109 m 110 n 111 o 112 p 113 q 114 r 115 s 116 t 117 u 118 v 119 w 120 x 121 y 122 z 123 { 124 | 125 } 126 ~ 127 del
FILES /usr/share/misc/ascii
HISTORY An ascii manual page appeared in Version 2 AT&T UNIX.
OpenBSD 4.4 May 31\, 2007 2
On Wed\, Nov 12\, 2008 at 06:43:25PM -0700\, Tom Christiansen wrote:
Glenn Linderman \perl@​NevCal\.com wrote: | The [below] items could be added to the language immediately\, during the | deprecation cycle for \nnn octal notation [...]
I find the notion of rendering illegal the existing octal syntax of "\33" is an *EXTRAÖRDINARILY* bad idea\, a position I am prepared to defend at laborious length--and\, if necessary\, appeal to the Decider-in-Chief [...]
I am happy to mark my return to p5p by singing in harmony with Tom C.
Perl's octal escapes are of venerable origin\, coming as they do from C -- not the newfangled ANSI and ISO dialects\, let alone Bjarne's heresy\, but the earliest and purest syntax\, which sprang fully-formed from Ken's\, Brian's and Dennis's foreheads. Breaking octal escapes would piss off lots of people\, and break lots of code\, for no sufficiently valuable purpose. -- Chip Salzenberg \chip@​pobox\.com
On approximately 11/12/2008 5:43 PM\, came the following characters from the keyboard of Tom Christiansen:
Glenn Linderman \perl@​NevCal\.com wrote:
I find the notion of rendering illegal the existing octal syntax of "\33" is an *EXTRAÖRDINARILY* bad idea\, a position I am prepared to defend at laborious length--and\, if necessary\, appeal to the Decider-in-Chief\, who's always done everything possible to *NOT* break others' code without *VERY* *STRONG* reason. I submit that that very high bar has *NOT* been met; far from it. I'm rather hoping I shan't have to do any of that\, but I certainly shall if I must.
Sure\, I figured someone would say that. It might as well be you :)
There's no reason at all to delete it: because regexes have \g{1} now\, and strings need never be written "\333" if you mean "\33" . "3".
That argument is specious; it is exactly the same as me saying that you don't need to write "\333" if you mean "\x{1b}3".
My understanding is that in a regex\, if you have 3 matches\, that "\333" might be more ambiguous than you are assuming.
There is GREAT reason *not* to delete it\, as the quantity of code you would see casually rendered illegal is incomprehensibly large\, with the work involved in updating code\, databases\, config files\, and educating programmers and users incalculable great. To add insult to injury\, this work you would see thrust upon others\, not taken on yourself.
Yep\, that's a great reason.
There is nothing fundamentally broken here\, as there was for $*. This is trying to create a language where it is impossible to "think bad thoughts". One cannot succeed at that.
So you wish to convert people to using \g{3} but if \333 is not outlawed\, it is still ambiguous.
| I personally see no value in octal notation now that Unicode uses hex\, ^^^^^^^^^^ ^^^^^^^^ Good to see the prefatory warning that this your *personal* view. :-) vvvvvvvv
Yep. I was rather sure that someone would bring up octal notation used for Unix file permission bits\, where it is somewhat helpful in reading the bits. But the -rwxrwxrwx notation is better anyway.
As for "Unicode using hex"\, me\, I've always thought of it as using bits. Rather\, I think of the various standards specifying code points in the U+XXXXXX notation to mean code point at that hexadecimal number. Not the same thing at all.
Indeed\, it does mean that\, but I fail to see the distinction that you laboriously coded. The number is the number regardless of the notation; however\, the documentation for the number is in hex\, so that form is much easier to find and use. Unlike the ASCII chart\, Unicode charts are generally not produced in octal or decimal\, only in hex.
| Another approach would be to change the escape from \nnn to | \o{nnnnn...} [···] The {} provide explicit delimiters\, so octal | numbers could then achieve parity with hex in the range of numbers | available. If people think octal is still worth supporting\, this looks | like a better syntax to support it wholeheartedly.
That's not needed\, unless you really want to promote octal for Unicode strings.
Really\, the only thing Unicode has to do with this is the fact that it inspired Perl to support characters with ord > 255. Once that support is there (and it is)\, it need not be used for Unicode characters\, and can\, in fact\, be used for binary number sequences\, and there exists code that uses it just that way\, and if the coder of such code were enamored of octal\, they might prefer to use octal notation to express the values of the binary numbers greater than 511 (as well as those that are smaller than or equal to 511).
In a pattern\, \g{1} now handles the situation you're talking about. For DQ-strings\, one can always avoid it.
Indeed\, one can always avoid the ambiguous notation. Using \g{n} for pattern matches\, and \x{n} for characters. Note the lack of octal notation. Even if the existing octal notation is left intact\, perhaps it should be documented as "not preferred" so that people get the habit of using unambiguous notations. On the other hand\, providing an additional notation that is unambiguously octal seems like it could be useful\, if there really are people out there that like octal.
Type "man ascii"; note that the table given first is octal.
So what? Google Ascii chart and the first hit is asciitable.com which gives decimal and hex before octal.
| Python 3.0 has moved to 0onnnnn for its octal integers (zero oh digit- | sequence) after concluding that leading zeros alone are just too | problematical\, so the "o" indicator has a precedent (albeit recent) in | addition to reasonably intuitively meaning octal to anyone that | understands the hexadecimal notation and has ever heard of octal. The | 0o syntax could also be added to Perl integer constants outside of | strings/regices.
My only trouble with the 0o notation is on fonts without cross 0's\, and its gratuitous superfluousness.
You can pick your font\, others can pick theirs\, so that seems to be irrelevant. The 0o notation is somewhat superfluous\, and was only suggested to be consistent with the two forms of hex notation \x{} and 0x if \o{} were to be invented. It seems to be a more consistent proposal if 0o is included\, than if not.
To summarize:
1) There is a real problem in regex notation to unambiguously interpret \n as octal or backreference. Certain octal numbers cannot be expressed\, depending on the number of backreferences in the regex.
2) A suggestion to add \o{} notation would permit octal numbers to be unambiguously specified in regex notation regardless of the number of backreferences\, over the full existing range of supported octal numbers\, and would also permit extending the range. The 0o notation seems useful for consistency with the existing hex notations if the \o{} notation is added.
3) Given 2\, it becomes possible to deprecate and eventually remove \n as octal notation\, either in regex or also in strings; it becomes possible to remove 0n as octal from numeric constants.
Point 2 would help address point 1\, providing an alternate octal notation that isn't ambiguous with backrefs. Point 3 would eliminate the ambiguity by making it an error. Point 3 may break existing code\, admittedly.
I'd be just as happy removing octal notation from everywhere except oct and %o\, but I make these other suggestions because I figure some people still use\, and want to use\, octal notation.
A protocol is complete when there is nothing left to remove. -- Stuart Cheshire\, Apple Computer\, regarding Zero Configuration Networking
Body was too large for import. Click here for the attachment in RT
2008/11/13 Tom Christiansen \tchrist@​perl\.com:
My understanding is that in a regex\, if you have 3 matches\, that "\333" might be more ambiguous than you are assuming.
There is GREAT reason *not* to delete it\, as the quantity of code you would see casually rendered illegal is incomprehensibly large\, with the work involved in updating code\, databases\, config files\, and educating programmers and users incalculably great. To add insult to injury\, this work you would see thrust upon others\, not taken on yourself.
Yep\, that's a great reason.
I'm glad you agree\, easily suffices to shut off the rathole.
And in case it doesn't\, the output below will convince anyone that we ***CANNOT*** remove \0ctal notation. Larry would never allow you to break so many people's code. It would be the worst thing Perl has ever done to its users. It verges upon the insane.
First please separate what Glenn said from what Rafael and I said\, which is that it might be a good idea to deprecate octal IN REGULAR EXPRESSIONS.
I spoke perhaps more harshly than I meant originally\, which is what kicked this off. I should have said "strongly discouraged" and not "deprecated".
Obviously from a back compat viewpoint we can't actually remove octal completely FROM THE REGEX ENGINE. At the very least there is a large amount of code that either generates octal sequences or contains them IN REGULAR EXPRESSSIONS.
But we sure can say n the docs that "it is recommended that you do not use octal in regular expressions in new code as it is ambiguous as to how they will be interpreted\, especially low value octal (excepting \0) can easily be mistaken for a backreference".
Grepping for \\\d *ONLY* in the indented code segments of the standard pods:
Oh cmon! You of all people must know a whole whack of ways to count them. You dont have to include them all in a mail. Gmail didn't even let me see the full list. The list also is a bit off-topic* as very few of those are actually in regular expressions\, and amusingly the second item in your list isn't octal. Illustrating the problem nicely.
Personally I dislike ambiguous syntax and think it should in general be avoided\, and that maybe we should do something to make it easier to see when there is ambiguous syntax. And I especially dislike ambiguous syntax that can be made to change meaning by action at a distance. If I concatenate a pattern that contains an octal sequence to a pattern that contains a bunch of capture buffers the meaning of the "octal" changes. That is bad.
Assuming that grok_oct() consumes at most 3 octal digits\, I think we can apply Karls patch. However I do think we should recommend against using octal IN REGULAR EXPRESSIONS. And should note that while you CAN use octal to represent codepoints up to 511 it is strongly recommended that you don't.
Also I have a concern that Karls patch merely modifies the behaviour in the regular expression engine. It doesn't do the same for other strings. If it is going to be legal it should be legal everywhere.
Anyway theres no need to flood the list with grep output or proclaim that if people don't get your point that you will appeal to the BDFL. We are all nice rational people here and in general if you point out the flaws in our logic we will admit it. And you have made your point\, and would have made your point regardless of the hyperbole and drama.
Cheers\, yves * Glen changed the topic of this subthread somewhat by taking an idea and seeing how far he could run with it. But the original topic was octal IN REGULAR EXPRESSIONS\, so lets keep it on that subject. -- perl -Mre=debug -e "/just|another|perl|hacker/"
On approximately 11/12/2008 11:30 PM\, came the following characters from the keyboard of demerphq:
First please separate what Glenn said from what Rafael and I said\, which is that it might be a good idea to deprecate octal IN REGULAR EXPRESSIONS.
I spoke perhaps more harshly than I meant originally\, which is what kicked this off. I should have said "strongly discouraged" and not "deprecated".
Obviously from a back compat viewpoint we can't actually remove octal completely FROM THE REGEX ENGINE. At the very least there is a large amount of code that either generates octal sequences or contains them IN REGULAR EXPRESSSIONS.
But we sure can say n the docs that "it is recommended that you do not use octal in regular expressions in new code as it is ambiguous as to how they will be interpreted\, especially low value octal (excepting \0) can easily be mistaken for a backreference".
So providing an alternate octal syntax\, such as \o{n} might be a nice way of encouraging the avoidance of ambiguity\, while providing an alternative that enhances the ability to use octal notation for those that like it. Suggesting not using the current octal notation forces people to convert bases\, which may not be a pleasant choice for them.
Oh cmon! You of all people must know a whole whack of ways to count them. You dont have to include them all in a mail. Gmail didn't even let me see the full list. The list also is a bit off-topic* as very few of those are actually in regular expressions\, and amusingly the second item in your list isn't octal. Illustrating the problem nicely.
Yes\, that is illustrative. And that one is not the only one in Tom's list that is a backref\, either.
* Glen changed the topic of this subthread somewhat by taking an idea and seeing how far he could run with it. But the original topic was octal IN REGULAR EXPRESSIONS\, so lets keep it on that subject.
Exactly. Exploring the boundaries of an idea can be educational\, which can help make better decisions.
I remain a proponent of adding \o{n} and 0onnnn notations to perl\, because they add capability to octal notation for people that like and use octal\, and the few situations where octal is more interpretable than hex; they add consistency to the language (compared to hex and binary notations); as well\, the notation would allow coders to remove ambiguity from regex notation.
Clearly deprecating or removing the existing\, ambiguous in regex notation octal syntax\, whether in perl as a whole\, or only within regex notation\, would force some people to change code if they choose to upgrade to that version of perl.
As Chip mentioned off-line\, perhaps I am looking for a "use strict" option that would allow people to choose to be forced to change such code. Or a "use re" option that would do similarly. Or a "use re" option that would warn only when the notation actually is ambiguous (i.e. count the captures\, and warn about \n notation that is in the range of the number of captures. Of course\, \8 and \9 are not ambiguous\, but \10 could be\, etc.). Documenting such an option right along with regex syntax and highlighting the ambiguity\, could help convince people to either use a new octal notation\, or use the ambiguity detector\, or both.
Saying "don't do that" without offering a palatable alternative\, isn't always very effective. With an alternative syntax\, I'm sure Tom could write a regex in octal notation to convert existing octal notation in existing regex in existing documentation to the new notation. But watch out for those backrefs\, Tom! :)
A protocol is complete when there is nothing left to remove. -- Stuart Cheshire\, Apple Computer\, regarding Zero Configuration Networking
demerphq wrote:
2008/11/13 Tom Christiansen \tchrist@​perl\.com:
My understanding is that in a regex\, if you have 3 matches\, that "\333" might be more ambiguous than you are assuming.
There is GREAT reason *not* to delete it\, as the quantity of code you would see casually rendered illegal is incomprehensibly large\, with the work involved in updating code\, databases\, config files\, and educating programmers and users incalculably great. To add insult to injury\, this work you would see thrust upon others\, not taken on yourself. Yep\, that's a great reason. I'm glad you agree\, easily suffices to shut off the rathole.
And in case it doesn't\, the output below will convince anyone that we ***CANNOT*** remove \0ctal notation. Larry would never allow you to break so many people's code. It would be the worst thing Perl has ever done to its users. It verges upon the insane.
First please separate what Glenn said from what Rafael and I said\, which is that it might be a good idea to deprecate octal IN REGULAR EXPRESSIONS.
I spoke perhaps more harshly than I meant originally\, which is what kicked this off. I should have said "strongly discouraged" and not "deprecated".
Obviously from a back compat viewpoint we can't actually remove octal completely FROM THE REGEX ENGINE. At the very least there is a large amount of code that either generates octal sequences or contains them IN REGULAR EXPRESSSIONS.
But we sure can say n the docs that "it is recommended that you do not use octal in regular expressions in new code as it is ambiguous as to how they will be interpreted\, especially low value octal (excepting \0) can easily be mistaken for a backreference".
Grepping for \\\d *ONLY* in the indented code segments of the standard pods:
Oh cmon! You of all people must know a whole whack of ways to count them. You dont have to include them all in a mail. Gmail didn't even let me see the full list. The list also is a bit off-topic* as very few of those are actually in regular expressions\, and amusingly the second item in your list isn't octal. Illustrating the problem nicely.
Personally I dislike ambiguous syntax and think it should in general be avoided\, and that maybe we should do something to make it easier to see when there is ambiguous syntax. And I especially dislike ambiguous syntax that can be made to change meaning by action at a distance. If I concatenate a pattern that contains an octal sequence to a pattern that contains a bunch of capture buffers the meaning of the "octal" changes. That is bad.
Assuming that grok_oct() consumes at most 3 octal digits\, I think we can apply Karls patch. However I do think we should recommend against using octal IN REGULAR EXPRESSIONS. And should note that while you CAN use octal to represent codepoints up to 511 it is strongly recommended that you don't.
Also I have a concern that Karls patch merely modifies the behaviour in the regular expression engine. It doesn't do the same for other strings. If it is going to be legal it should be legal everywhere.
grok_oct() itself consumes as many octal digits as there are in its parameter\, as long as the result doesn't overflow a UV. It is used for general purpose octal conversion\, such as from the oct() function.
My patch was to bring consistency to the handling of \400-\777. Outside re's\, putting them into a string variable will cause the string to be converted to utf8\, and so they will be converted into two utf8 bytes as part of that string. Similarly\, using any of these octal values in an re charclass will cause the re to be converted to utf8\, and will match the corresponding unicode code point. But when values in this range appear in an re outside a charclass there an inconsistency. On an 8-bit character machine (if there aren't 256 or so parenthetical sub expressions in the re) they will match a two character sequence\, but not the same utf8 sequence matched if they had instead appeared in a charclass. I'm not sure what would happen on a 9-bit machine. It might very well be what Glenn suggests\, the corresponding 9 bits.
Tom has pointed out that \777 is a reserved value in some contexts.
It seems to me to be a bad idea to remove acceptance of octal numbers in re's.
It seems like a good idea to add something to the language so one can express them unambiguously. Even I with my limited knowledge of regcomp.c could do it easily (fools rush in...).
And it seems like an even better idea to handle them consistently. I see two ways to do that 1) accept my patch; or 2) forbid or warn about the use of those larger than a single character in the machine architecture in both strings and re's\, including char classes.
Perhaps I've forgotten something in this thread. If so\, I'm sorry.
2008/11/13 karl williamson \public@​khwilliamson\.com:
demerphq wrote: [snip]
Assuming that grok_oct() consumes at most 3 octal digits\, I think we can apply Karls patch. However I do think we should recommend against using octal IN REGULAR EXPRESSIONS. And should note that while you CAN use octal to represent codepoints up to 511 it is strongly recommended that you don't.
Also I have a concern that Karls patch merely modifies the behaviour in the regular expression engine. It doesn't do the same for other strings. If it is going to be legal it should be legal everywhere.
grok_oct() itself consumes as many octal digits as there are in its parameter\, as long as the result doesn't overflow a UV. It is used for general purpose octal conversion\, such as from the oct() function.
Somewhere tho we have to have a limit on the number of digits dont we?
(I'm very tired right now and haven't looked)
My patch was to bring consistency to the handling of \400-\777. Outside re's\, putting them into a string variable will cause the string to be converted to utf8\, and so they will be converted into two utf8 bytes as part of that string. Similarly\, using any of these octal values in an re charclass will cause the re to be converted to utf8\, and will match the corresponding unicode code point. But when values in this range appear in an re outside a charclass there an inconsistency. On an 8-bit character machine (if there aren't 256 or so parenthetical sub expressions in the re) they will match a two character sequence\, but not the same utf8 sequence matched if they had instead appeared in a charclass. I'm not sure what would happen on a 9-bit machine. It might very well be what Glenn suggests\, the corresponding 9 bits.
Tom has pointed out that \777 is a reserved value in some contexts.
Oh? I missed that.
It seems to me to be a bad idea to remove acceptance of octal numbers in re's.
Yes I think thats been well demonstrated.
It seems like a good idea to add something to the language so one can express them unambiguously. Even I with my limited knowledge of regcomp.c could do it easily (fools rush in...).
Sometimes being able to do something is simply not having the fear that it might be too hard. :-)
And it seems like an even better idea to handle them consistently. I see two ways to do that 1) accept my patch; or
Thats pretty much a given. I just haven't had the time yet.
And well while it was being contentiously debated I wanted to wait and see a bit. :-)
2) forbid or warn about the use of those larger than a single character in the machine architecture in both strings and re's\, including char classes.
I need to think about this one.
Perhaps I've forgotten something in this thread. If so\, I'm sorry.
Please don't be sorry. For me you a welcome breath of fresh air. It's wonderful to have you on board.
Yves
-- perl -Mre=debug -e "/just|another|perl|hacker/"
demerphq wrote:
2008/11/13 karl williamson \public@​khwilliamson\.com:
demerphq wrote: [snip]
Assuming that grok_oct() consumes at most 3 octal digits\, I think we can apply Karls patch. However I do think we should recommend against using octal IN REGULAR EXPRESSIONS. And should note that while you CAN use octal to represent codepoints up to 511 it is strongly recommended that you don't.
Also I have a concern that Karls patch merely modifies the behaviour in the regular expression engine. It doesn't do the same for other strings. If it is going to be legal it should be legal everywhere.
grok_oct() itself consumes as many octal digits as there are in its parameter\, as long as the result doesn't overflow a UV. It is used for general purpose octal conversion\, such as from the oct() function.
Somewhere tho we have to have a limit on the number of digits dont we?
(I'm very tired right now and haven't looked)
You pass it a maximum length\, and the regcomp.c passes it 3. The oct function passes it the length it actually is.
My patch was to bring consistency to the handling of \400-\777. Outside re's\, putting them into a string variable will cause the string to be converted to utf8\, and so they will be converted into two utf8 bytes as part of that string. Similarly\, using any of these octal values in an re charclass will cause the re to be converted to utf8\, and will match the corresponding unicode code point. But when values in this range appear in an re outside a charclass there an inconsistency. On an 8-bit character machine (if there aren't 256 or so parenthetical sub expressions in the re) they will match a two character sequence\, but not the same utf8 sequence matched if they had instead appeared in a charclass. I'm not sure what would happen on a 9-bit machine. It might very well be what Glenn suggests\, the corresponding 9 bits.
Tom has pointed out that \777 is a reserved value in some contexts.
Oh? I missed that.
It seems to me to be a bad idea to remove acceptance of octal numbers in re's.
Yes I think thats been well demonstrated.
It seems like a good idea to add something to the language so one can express them unambiguously. Even I with my limited knowledge of regcomp.c could do it easily (fools rush in...).
Sometimes being able to do something is simply not having the fear that it might be too hard. :-)
And it seems like an even better idea to handle them consistently. I see two ways to do that 1) accept my patch; or
Thats pretty much a given. I just haven't had the time yet.
And well while it was being contentiously debated I wanted to wait and see a bit. :-)
2) forbid or warn about the use of those larger than a single character in the machine architecture in both strings and re's\, including char classes.
I need to think about this one.
Perhaps I've forgotten something in this thread. If so\, I'm sorry.
Please don't be sorry. For me you a welcome breath of fresh air. It's wonderful to have you on board.
Yves
Thankyou
On Tue\, Oct 28\, 2008 at 01:38:52PM +0100\, Bo Lindbergh wrote:
In article \49063069\.8090001@​NevCal\.com\, perl@NevCal.com (Glenn Linderman) wrote:
I rather expect that the primary use of 9 bit byte values would have been to initialize bytes to binary values\, rather than deal in characters: I've never seen any character encodings that speak of 9-bit character values\, until I just found http://tools.ietf.org/html/rfc4042 via a Google search.
Did you notice the release date of that RFC?
Classic pwnage. Next up: Socket::CarrierPigeon -- Chip Salzenberg \chip@​pobox\.com
Replying to Chip Salzenberg's message of "Wed\, 12 Nov 2008 18:18:57 PST" and to Karl Williamson's of "Thu\, 13 Nov 2008 11:38:48 MST":
SUMMARY:
* There exist in octal character notation both implementation bugs as well as built-in\, by-design bugs\, particular when used in regular expressions.
* A few of these we've brought on ourselves\, because we relaxed the octal-char definition in ways that they designers of these things never did\, and so some of our troubles with them are our own fault.
* The implementation bugs we can fix\, if we're careful and consistent\, but design bugs we cannot.
* Nor can we eliminate the notation altogether\, due to the existing massive code base that relies upon it.
* The best we can do is generate\, under certain circumstances\, a warning related to an ambiguous \XXX being interpreted as either a backreference or a character.
That's probably as far as many people may care to read\, and that's fine.
However\, I do provide new info below that comes straight from the horse's mouth about the historical ambiguity--and I mean those horses once stabled at Murray Hill\, not at JPL.
First\, what came before:
:rafael> I don't think it's worth changing the meaning of \400 in :rafael> double quoted strings\, or making it warn. However\, in :rafael> regexps\, it's too dangerously inconsistent and should be :rafael> deprecated. First\, a deprecation warning seems in order.
:rafael> However\, I see some value in still allowing [\000-\377] :rafael> character ranges\, for example. Do we really want to :rafael> deprecate that as well? This doesn't seem necessary.
:yves>> Consider /\1/ means the first capture buffer of the previous :yves>> match\, \17 means the _seventeenth_ capture buffer of the :yves>> previous match IFF the previous match contains more 17 or :yves>> more capture buffers\, otherwise it means \x{F}.
:yves>> In short: resolving the inconsistencies in octal notation in :yves>> regex notation would appear to be impossible.
:rafael> Error messages are a mess\, too. This one is correct: :rafael> $ perl -wE '/\8/' :rafael> Reference to nonexistent group in regex; marked by \<-- HERE in m/\8 :rafael> \<-- HERE / at -e line 1.
:rafael> This one shows clearly that we're using a regexp that matches :rafael> "\x{1}8"\, but why is there a duplicated warning? Double magic?
:rafael> $ perl -wE '/\18/' :rafael> Illegal octal digit '8' ignored at -e line 1. :rafael> Illegal octal digit '8' ignored at -e line 1.
And also:
In-Reply-To: Chip's of "Wed\, 12 Nov 2008 18:18:57 PST." \20081113021857\.GJ2062@​tytlal\.topaz\.cx
glenn>>> The [below] items could be added to the language immediately\, glenn>>> during the deprecation cycle for \nnn octal notation [...]
tchrist>> I find the notion of rendering illegal the existing octal tchrist>> syntax of "\33" is an *EXTRAÖRDINARILY* bad idea\, a position I tchrist>> am prepared to defend at laborious length--and\, if necessary\, tchrist>> appeal to the Decider-in-Chief [...]
chip> I am happy to mark my return to p5p by singing in harmony with chip> Tom C.
chip> Perl's octal escapes are of venerable origin\, coming as they do chip> from C -- not the newfangled ANSI and ISO dialects\, let alone chip> Bjarne's heresy\, but the earliest and purest syntax\, which sprang chip> fully-formed from Ken's\, Brian's and Dennis's foreheads. Breaking chip> octal escapes would piss off lots of people\, and break lots of chip> code\, for no sufficiently valuable purpose.
I'm at USENIX right now\, and while Ken and Dennis aren't here\, Andrew Hume *is*. Andrew long worked in the fabled research group group there at Murray Hill\, along with Brian and Rob and the rest of that seminal crew who charted much of this out. Andrew wrote the first Plan9 grep program\, gre(1)\, which was interesting because it internally broke up the pattern into nice DFA parts and unnice backtracking parts and attacked them separately. Rob and Ken later wrote purely DFA versions (no backtracking\, no backreferencing) when they added UTF-8 support.
So absent Ken\, Andrew is probably the next best to ask this of\, as he is particularly well-versed with regexes in all aspects: historical\, current\, standardized\, etc. It's he whom we refer to in the Camel's pattern-matching section when we write in the footnote:
It has been said(*) that programs that write programs are the happiest programs in the world.
* By Andrew Hume\, the famous Unix philosopher.
I've just come from speaking with Andrew about all this\, of whom I posed the question:
1. *Why* did Ken (et alios) select the same notation for backreferences in regex as for octally notated characters?
2. *How* did you guys cope with its inherent ambiguity?
Andrew said that they coped with the ambiguity mostly by always requiring exactly 3 octal digits for characters. You couldn't say \33 for ESC; you had to say \033. If someone wanted the third capture buffer\, they write \3; if they wanted ETX\, they wrote \003; \3 and \003 therefore *meant* *different* *things* in regexes\, and the first was disallowed in strings.
Andrew admits that this is not a perfect fix\, as a theoretical hole remains\, but he asserts that in practice\, forcing 3-digits for octal chars covered *almost* all the real-world situations where ambiguity might in practice raise its heisenhead. Although the early pattern-matchers only had \1 .. \9 for captures\, later ones dispensed with this restriction. But still the 3-digit rule seemed safe enough.
So that hole was deemed small enough\, and also infrequent and unlikely (at least in in non-program-generated programs) that Ken&Co. just lived with it\, preferring clarity and brevity (simple to read and write) over a more complex yet bullet-proof notation.
Andrew said\, sure\, it's a bit messy\, or untidy\, but if you're looking for pristine perfection\, you're looking for the wrong thing. Or something like that.
The only exception to this was \0\, which saw frequent enough use that making folks always specify \000 to mean NUL was deemed unduly onerous. Also\, the original pattern-matchers didn't handle nulls\, plus some of them treated \0 as "the whole match"\, much as we now use (?0) to recurse on the whole pattern.
One last thing: Andrew\, upon being told about the TRIE regex optimization\, suggests we might look into splay trees for this instead. He thinks they have properties that might make them even faster/smaller\, but says we'd have to benchmark the two carefully\, because it was just an informed hunch.
Now Henry isn't here\, so I can't ask him about the source of his that Larry long ago started out from. Important aspects of that include that Henry admitted only \1 .. \9 for backrefs *AND* how the 3-digit octal-character backslash escapes shall have already been processed by the time the regex compiler has to think about things. That means it didn't have to think about both. This is somewhat how \U is handled during variable interpolation\, not by the regex compiler.
Some of the Spencerian sources and derivatives are available at
http://arglist.com/regex/
Some can be quite\, educative.
One this I found especially amusing was this change log comment:
Fix for a serious bug that affected REs using many [] (including REG_ICASE REs because of the way they are implemented)\, *sometimes*\, depending on memory-allocation patterns.
Sound familiar\, anybody :-) [HINT: think of /(\337)\1/i ]
You can look up more on the history of regexes\, from Ken's original 1968 paper to Rob and Ken's 1992 specking out of UTF-8\, at:
http://swtch.com/~rsc/regexp/
Historical sources of interest here include
Ken's original paper to CACM\, 4 dense pages: http://doi.acm.org/10.1145/363347.363387
Ken's UTF-8 version of grep\, w/o backtracking: http://swtch.com/usr/local/plan9/src/cmd/grep/
Rob's regexp (no backtracking) library that handles UTF-8: http://swtch.com/plan9port/unix/ Its section 3 manpage: http://swtch.com/plan9port/unix/man/regexp93.html Its section 7 manpage: http://swtch.com/plan9port/unix/man/regexp97.html Its code: http://swtch.com/plan9port/unix/libregexp9.tgz
Rob's paper on Structured Regular Expressions: http://doc.cat-v.org/bell_labs/structural_regexps/se.pdf
Rob's "sam" editor http://netlib.bell-labs.com/sys/doc/sam/sam.html
Code to implement Perl's regexp rules: http://swtch.com/~rsc/regexp/nfa-perl.y.txt
One thing I found amusing in Rob's sam paper was:
The regular expression code in sam is an interpreted\, rather than compiled on-the-fly\, implementation of Thompson's non-deterministic finite automaton algorithm.[12] The syntax and semantics of the expressions are as in the UNIX program egrep\, including alternation\, closures\, character classes\, and so on. The only changes in the notation are two additions: \n is translated to\, and matches\, a newline character\, and @ matches any character. In egrep\, the character . matches any character except newline\, and in sam the same rule seemed safest\, to prevent idioms like .* from spanning newlines. Egrep expressions are arguably too complicated for an interactive editor -- certainly it would make sense if all the special characters were two- character sequences\, so that most of the punctuation characters wouldn't have peculiar meanings -- but for an interesting command language\, full regular expressions are necessary\, and egrep defines the full regular expression syntax for UNIX programs. Also\, it seemed superfluous to define a new syntax\, since various UNIX programs (ed\, egrep and vi) define too many already.
There's a bunch going on with standardization\, widechars\, utf-8\, etc\, right now. If only UTF-8 had been around earlier ("What\, 1992 isn't early enough?")\, a lot of trouble would have been averted. That Perl settled on UTF-8 internally early on was applauded by the Association's current standards rep as clearly the right way to go.
It's really sad that it looks like the C std committee look to be going to accept Microsoft's char16 datatype for wide characters. This locks you into UCS-2/UTF-16\, whihc means surrogates to get off the primary plane\, and a very long/bad recovery if you get poke your head in the wrong place in the stream. This is going to make problems for people. Java has the problem. EXIF has the problem.
And now on to Karl's message.
karl> yves wrote:
yves>> 2008/11/13 Tom Christiansen \tchrist@​perl\.com:
glenn>>>> My understanding is that in a regex\, if you have 3 matches\, glenn>>>> that "\333" might be more ambiguous than you are assuming.
It could mean
\g{3} followed by "33" ubyte 219: "@{ [pack C => 219] }" uchar 219: "@{ [pack U => 219] }"
tchrist>>>>> There is GREAT reason *not* to delete it\, as the quantity of tchrist>>>>> code you would see casually rendered illegal is tchrist>>>>> incomprehensibly large\, with the work involved in updating tchrist>>>>> code\, databases\, config files\, and educating programmers and tchrist>>>>> users incalculably great. To add insult to injury\, this work tchrist>>>>> you would see thrust upon others\, not taken on yourself.
glenn>>>> Yep\, that's a great reason.
tchrist>>> I'm glad you agree\, easily suffices to shut off the rathole.
tchrist>>> And in case it doesn't\, the output below will convince anyone tchrist>>> that we ***CANNOT*** remove \0ctal notation. Larry would never tchrist>>> allow you to break so many people's code. It would be the worst tchrist>>> thing Perl has ever done to its users. It verges upon the tchrist>>> insane.
yves>> First please separate what Glenn said from what Rafael and I said\, yves>> which is that it might be a good idea to deprecate octal IN REGULAR yves>> EXPRESSIONS.
I believe I have now done as you have asked. It's useful for more than just accuracy of attribution\, too.
yves>> I spoke perhaps more harshly than I meant originally\, which yves>> is what kicked this off. I should have said "strongly yves>> discouraged" and not "deprecated".
It was indeed Glenn's suggestion that these first be deprecated and then in the release following\, AND THEN REMOVED ALTOGETHER\, that I found to be utterly untenable.
His later messages seem to say that he was just trying to fly a strawman to see how far he could push it just to test the boundaries via hypotheticals. If so\, that seems to say it wasn't an honest suggestion made in good faith\, just something there to "stir the bucket" (or the hornets' nest). Perhaps he finds this useful as a general principle; but here\, I do not.
yves>> Obviously from a back compat viewpoint we can't actually yves>> remove octal completely FROM THE REGEX ENGINE. At the very yves>> least there is a large amount of code that either generates yves>> octal sequences or contains them IN REGULAR EXPRESSSIONS.
You say "obviously"\, and I think it obvious\, too\, but either Glenn advocate did not or was not arguing in good faith\, only secretly playing devil's advocate. That's far too complicated for me.
I take what people say for what they mean and vice versa\, without attempting doublethink\, triplethink\, etc. It's not my strength\, and it's waste of time to try to figure out what people mean in case they are intentionally saying things they DON'T mean without labelling those statements as clearly of that nature.
I don't appreciate it\, and that is the very most courteous way I can think of expressing a sentiment I have plenty of less courteous words for.
yves>> But we sure can say n the docs that "it is recommended that yves>> you do not use octal in regular expressions in new code as it yves>> is ambiguous as to how they will be interpreted\, especially yves>> low value octal (excepting \0) can easily be mistaken for a yves>> backreference".
It seems that we got into trouble by allowing one- and two-digit octal character escapes. Tbhis is not something that the original designers (Ken; Dennis and Brian; Rob) ever did\, and thereby circumvented much of our trouble.
Perhaps what should happen is that we should encourage 3-digit octal notation only.
tchrist>> Grepping for \\\d *ONLY* in the indented code segments of tchrist>> the standard pods:
yves>> Oh cmon! You of all people must know a whole whack of ways to yves>> count them.
I did this because I'd taken Gless at his literal word\, and I wanted everyone to see how extensive this use was. I also wanted to demonstrate the historical difference that seems to have cropped up as we went from C programmers as our main programmer base\, to non-C-programmers. This meant that we started to get 1- and 2-digit octal escapes where we'd never before had them.
yves>> The list also is a bit off-topic* as very few of those are yves>> actually in regular expressions\, and amusingly the second yves>> item in your list isn't octal. Illustrating the problem yves>> nicely.
I was perfectly aware it was a reference. I didn't dump the data on you dumbly. I could have summarized it\, described trends\, but this doesn't have the impact of seeing the raw data\, which is what I was aiming for to bat down the crazy idea of forcing uncountably many broken programs. Having to change my code due to a Perl upgrade thrice in 21 years is nothing like what Glenn feigned contemplating.
yves>> Personally I dislike ambiguous syntax
As do I. Larry is actually a lot more comfortable with it than I am\, because he realizes due to his work with natural language that humans are good with ambiguity and that one can\, if one is clever enough\, use surrounding clues to figure out what was meant.
yves>> and think it should in general be avoided\, and that maybe we yves>> should do something to make it easier to see when there is yves>> ambiguous syntax.
That seems pretty reasonable\, too.
yves>> And I especially dislike ambiguous syntax that can be made to yves>> change meaning by action at a distance. If I concatenate a yves>> pattern that contains an octal sequence to a pattern that yves>> contains a bunch of capture buffers the meaning of the "octal" yves>> changes. That is bad.
Yes\, it is bad\, but there are worse problems. You can't do in a general and useful way do it at all\, because which capture buffer means what is going to renumber. The new \g{-1} helps a good bit here\, as does \g{BUFNAME}\, but it's still a sticky problem requiring more overall knowledge than you'd like it to require.
yves>> Assuming that grok_oct() consumes at most 3 octal digits\, I think yves>> we can apply Karls patch. However I do think we should recommend yves>> against using octal IN REGULAR EXPRESSIONS. And should note that yves>> while you CAN use octal to represent codepoints up to 511 it is yves>> strongly recommended that you don't.
I'd like to see three-digit octal always mean an 8-bit character\, and discourage things like \3 and \33. I don't think we should bother extending octal to allow for code points above "\377". That it "works" at all there is a problem.
I included the older code because you'll see a pattern in it. For example:
scripts/badman: grep(/[^\001]+\001[^\001]+\001${ext}\001/ || /[^\001]+${ext}\001/\, scripts/badman: if ( /^([^\001]*)\002/ || /^([^\002]*)\001/ ) { scripts/badman: if (/\001/) { scripts/badman: if ($last eq "\033") { scripts/badman: last if $idx_topic eq "\004"; scripts/badman: last if $idx_topic eq "\004" || $idx_topic eq '0'; scripts/badman: s/\033\+/\001/; scripts/badman: s/\033\\,/\002/; scripts/badman: @tmplist = split(/\002/\, $entry); scripts/badman: $winsize = "\0" x 8;
Notice someting? Being once-and-always a C programmer at heart\, by habit I always used to use a single digit for a NUL *only* and 3 digits otherwise. The Camel's use of
camel3-examples:if ( ("fred" & "\1\2\3\4") =~ /[^\0]/ ) { ... }
is just something I do *not* like. It does no good to warn in strings\, where there are no backrefs (save in s///)\, but I'm not sure the level of warning appropriate for regexes.
Speaking of which\, isn't it time to tell the s/(...)/\1\1/g people to get their acts together?
yves>> Also I have a concern that Karls patch merely modifies the yves>> behaviour in the regular expression engine. It doesn't do the yves>> same for other strings. If it is going to be legal it should yves>> be legal everywhere.
Yes\, this is a real issue\, the first one I raised.
karl> grok_oct() itself consumes as many octal digits as there are karl> in its parameter\, as long as the result doesn't overflow a UV. karl> It is used for general purpose octal conversion\, such as from karl> the oct() function.
Hm.
The oct() function is already bizarre enough. It's weird that it takes bits of any sort\, converts them to decimal\, then treats that decimal as octal\, and returns a string of new digits. Even calling it dec2oct might have helped.
oct("0755") oct("755")
ok\, but
oct(755) oct(700 + 50 + 5)
doing the same thing is\, well\, just not what people are thinking it does.
karl> Tom has pointed out that \777 is a reserved value in some karl> contexts.
It's the $/ issue.
karl> It seems to me to be a bad idea to remove acceptance of octal karl> numbers in re's.
And to me. Remember\, I'm the one who gets testy about
$h{date} = "foo"; $h{time} = "bar";
$h{time()} = "oh\, right";
$h{033} = "foo"; $h{27} .= "bar"; # now foobar
$h{"033"} = "not again";
karl> It seems like a good idea to add something to the language so karl> one can express them unambiguously. Even I with my limited karl> knowledge of regcomp.c could do it easily (fools rush in...).
I could live with \o{...} if I had to\, but I'm nervous where it goes.
karl> And it seems like an even better idea to handle them karl> consistently. I see two ways to do that 1) accept my patch; or karl> 2) forbid or warn about the use of those larger than a single karl> character in the machine architecture in both strings and karl> re's\, including char classes. Perhaps I've forgotten something in karl> this thread. If so\, I'm sorry.
Karl\, you've nothing to be sorry about. Your courtesy\, conscientiousness\, and can-do attitude are very welcome.
--tom
Tom Christiansen wrote:
Replying to Chip Salzenberg's message of "Wed\, 12 Nov 2008 18:18:57 PST" and to Karl Williamson's of "Thu\, 13 Nov 2008 11:38:48 MST":
SUMMARY:
* There exist in octal character notation both implementation bugs as well as built-in\, by-design bugs\, particular when used in regular expressions.
* A few of these we've brought on ourselves\, because we relaxed the octal-char definition in ways that they designers of these things never did\, and so some of our troubles with them are our own fault.
* The implementation bugs we can fix\, if we're careful and consistent\, but design bugs we cannot.
* Nor can we eliminate the notation altogether\, due to the existing massive code base that relies upon it.
* The best we can do is generate\, under certain circumstances\, a warning related to an ambiguous \XXX being interpreted as either a backreference or a character.
That's probably as far as many people may care to read\, and that's fine.
However\, I do provide new info below that comes straight from the horse's mouth about the historical ambiguity--and I mean those horses once stabled at Murray Hill\, not at JPL.
First\, what came before:
:rafael> I don't think it's worth changing the meaning of \400 in :rafael> double quoted strings\, or making it warn. However\, in :rafael> regexps\, it's too dangerously inconsistent and should be :rafael> deprecated. First\, a deprecation warning seems in order.
:rafael> However\, I see some value in still allowing [\000-\377] :rafael> character ranges\, for example. Do we really want to :rafael> deprecate that as well? This doesn't seem necessary.
:yves>> Consider /\1/ means the first capture buffer of the previous :yves>> match\, \17 means the _seventeenth_ capture buffer of the :yves>> previous match IFF the previous match contains more 17 or :yves>> more capture buffers\, otherwise it means \x{F}.
:yves>> In short: resolving the inconsistencies in octal notation in :yves>> regex notation would appear to be impossible.
:rafael> Error messages are a mess\, too. This one is correct: :rafael> $ perl -wE '/\8/' :rafael> Reference to nonexistent group in regex; marked by \<-- HERE in m/\8 :rafael> \<-- HERE / at -e line 1.
:rafael> This one shows clearly that we're using a regexp that matches :rafael> "\x{1}8"\, but why is there a duplicated warning? Double magic?
:rafael> $ perl -wE '/\18/' :rafael> Illegal octal digit '8' ignored at -e line 1. :rafael> Illegal octal digit '8' ignored at -e line 1.
And also:
In-Reply-To: Chip's of "Wed\, 12 Nov 2008 18:18:57 PST." \20081113021857\.GJ2062@​tytlal\.topaz\.cx
glenn>>> The [below] items could be added to the language immediately\, glenn>>> during the deprecation cycle for \nnn octal notation [...]
tchrist>> I find the notion of rendering illegal the existing octal tchrist>> syntax of "\33" is an *EXTRAÖRDINARILY* bad idea\, a position I tchrist>> am prepared to defend at laborious length--and\, if necessary\, tchrist>> appeal to the Decider-in-Chief [...]
chip> I am happy to mark my return to p5p by singing in harmony with chip> Tom C.
chip> Perl's octal escapes are of venerable origin\, coming as they do chip> from C -- not the newfangled ANSI and ISO dialects\, let alone chip> Bjarne's heresy\, but the earliest and purest syntax\, which sprang chip> fully-formed from Ken's\, Brian's and Dennis's foreheads. Breaking chip> octal escapes would piss off lots of people\, and break lots of chip> code\, for no sufficiently valuable purpose.
I'm at USENIX right now\, and while Ken and Dennis aren't here\, Andrew Hume *is*. Andrew long worked in the fabled research group group there at Murray Hill\, along with Brian and Rob and the rest of that seminal crew who charted much of this out. Andrew wrote the first Plan9 grep program\, gre(1)\, which was interesting because it internally broke up the pattern into nice DFA parts and unnice backtracking parts and attacked them separately. Rob and Ken later wrote purely DFA versions (no backtracking\, no backreferencing) when they added UTF-8 support.
So absent Ken\, Andrew is probably the next best to ask this of\, as he is particularly well-versed with regexes in all aspects: historical\, current\, standardized\, etc. It's he whom we refer to in the Camel's pattern-matching section when we write in the footnote:
It has been said\(\*\) that programs that write programs are the happiest programs in the world\. \* By Andrew Hume\, the famous Unix philosopher\.
I've just come from speaking with Andrew about all this\, of whom I posed the question:
1. *Why* did Ken (et alios) select the same notation for backreferences in regex as for octally notated characters?
2. *How* did you guys cope with its inherent ambiguity?
Andrew said that they coped with the ambiguity mostly by always requiring exactly 3 octal digits for characters. You couldn't say \33 for ESC; you had to say \033. If someone wanted the third capture buffer\, they write \3; if they wanted ETX\, they wrote \003; \3 and \003 therefore *meant* *different* *things* in regexes\, and the first was disallowed in strings. [snip]
As a point of reference\, the C standard has always allowed 1\, 2\, or 3 octal digits as a character constant (and asymmetrically as many as your want for hex). But\, as an old C programmer\, I just wouldn't think of specifying one without exactly 3 digits. I was somewhat surprised when I was researching acceptable Perl syntax to discover that a leading 0 was not required\, and now I discover that it also wasn't required in C all along. (Although\, I learned C before it was standardized and changed by that. A leading 0 may have been so required before standardization. I now regret throwing my first edition K&R away when the standard came out.) The only character I would likely have used that could be expressed in one octal digit would be BEL\, and I would automatically express it as \007. (I think the \a came along later\, though I may just not have been aware of it.) So C programmers likely will use 3 digits for octal character constants\, for what that's worth
On Thu\, Nov 13\, 2008 at 11:02:50PM -0700\, karl williamson wrote:
changed by that. A leading 0 may have been so required before standardization. I now regret throwing my first edition K&R away when the standard came out.)
p35 K&R first edition:
"... In addition\, an arbitrary byte-sized bit pattern can be generated by writing
'\ddd'
where ddd is one to three octal digits\, as in:
#define FORMFEED '\014' /* ASCII form feed */
"
On Thu\, Nov 13\, 2008 at 11:02:50PM -0700\, karl williamson wrote:
changed by that. A leading 0 may have been so required before standardization. I now regret throwing my first edition K&R away when the standard came out.)
p35 K&R first edition:
"... In addition\, an arbitrary byte-sized bit pattern can be generated by writing
'\ddd'
where ddd is one to three octal digits\, as in:
#define FORMFEED '\014' /* ASCII form feed */
"
Andrew said that using all three digits usually dodged the ambiguity. Perhaps I misconstrued to mean that you had to use them all\, whereas he may have meant that one should.
He just walked past me\, but I'm not ready to chase down that subject again.
Talking about collation sequences was another big topic here\, but well\, more on that later.
--tom
2008/11/14 Tom Christiansen \tchrist@​perl\.com:
Replying to Chip Salzenberg's message of "Wed\, 12 Nov 2008 18:18:57 PST" and to Karl Williamson's of "Thu\, 13 Nov 2008 11:38:48 MST":
SUMMARY:
* There exist in octal character notation both implementation bugs as well as built-in\, by-design bugs\, particular when used in regular expressions.
* A few of these we've brought on ourselves\, because we relaxed the octal-char definition in ways that they designers of these things never did\, and so some of our troubles with them are our own fault.
* The implementation bugs we can fix\, if we're careful and consistent\, but design bugs we cannot.
* Nor can we eliminate the notation altogether\, due to the existing massive code base that relies upon it.
Yes\, this is absolutely clear (now). I misspoke when I suggested this.
* The best we can do is generate\, under certain circumstances\, a warning related to an ambiguous \XXX being interpreted as either a backreference or a character.
As you have said\, \g{} makes these much less important.
That's probably as far as many people may care to read\, and that's fine.
However\, I do provide new info below that comes straight from the horse's mouth about the historical ambiguity--and I mean those horses once stabled at Murray Hill\, not at JPL.
First\, what came before:
:rafael> I don't think it's worth changing the meaning of \400 in :rafael> double quoted strings\, or making it warn. However\, in :rafael> regexps\, it's too dangerously inconsistent and should be :rafael> deprecated. First\, a deprecation warning seems in order.
:rafael> However\, I see some value in still allowing [\000-\377] :rafael> character ranges\, for example. Do we really want to :rafael> deprecate that as well? This doesn't seem necessary.
:yves>> Consider /\1/ means the first capture buffer of the previous :yves>> match\, \17 means the _seventeenth_ capture buffer of the :yves>> previous match IFF the previous match contains more 17 or :yves>> more capture buffers\, otherwise it means \x{F}.
I misspoke here too. Backrefs are to captures in the current pattern not the previous.
Meaning that the real danger is that when one concatenates patterns\, or programmatically manipulates them which wasnt a problem for grep or editors or what not. But is a problem when regex'es become integrated into the language as tightly as they are in perl.
[snip]
:rafael> This one shows clearly that we're using a regexp that matches :rafael> "\x{1}8"\, but why is there a duplicated warning? Double magic?
:rafael> $ perl -wE '/\18/' :rafael> Illegal octal digit '8' ignored at -e line 1. :rafael> Illegal octal digit '8' ignored at -e line 1.
The double warning comes because Perl does two passes\, and if the pattern was /\18\x{100}/ maybe even three.
And each time we try to grok_oct() on the same sequence and so generate the same warning. Anyway\, its a bug that needs to be fixed. Sigh.
And also:
In-Reply-To: Chip's of "Wed\, 12 Nov 2008 18:18:57 PST." \20081113021857\.GJ2062@​tytlal\.topaz\.cx
glenn>>> The [below] items could be added to the language immediately\, glenn>>> during the deprecation cycle for \nnn octal notation [...]
tchrist>> I find the notion of rendering illegal the existing octal tchrist>> syntax of "\33" is an *EXTRAÖRDINARILY* bad idea\, a position I tchrist>> am prepared to defend at laborious length--and\, if necessary\, tchrist>> appeal to the Decider-in-Chief [...]
chip> I am happy to mark my return to p5p by singing in harmony with chip> Tom C.
chip> Perl's octal escapes are of venerable origin\, coming as they do chip> from C -- not the newfangled ANSI and ISO dialects\, let alone chip> Bjarne's heresy\, but the earliest and purest syntax\, which sprang chip> fully-formed from Ken's\, Brian's and Dennis's foreheads. Breaking chip> octal escapes would piss off lots of people\, and break lots of chip> code\, for no sufficiently valuable purpose.
Don't worry both of you. Just pointing out how much could break snapped some sense into my head. Mea-culpa and all that.
I'm at USENIX right now\, and while Ken and Dennis aren't here\, Andrew Hume *is*. Andrew long worked in the fabled research group group there at Murray Hill\, along with Brian and Rob and the rest of that seminal crew who charted much of this out. Andrew wrote the first Plan9 grep program\, gre(1)\, which was interesting because it internally broke up the pattern into nice DFA parts and unnice backtracking parts and attacked them separately. Rob and Ken later wrote purely DFA versions (no backtracking\, no backreferencing) when they added UTF-8 support.
I'll have to take a look at gre as it sounds like it is right along the lines of what we need. Afaui we can't go to full DFA construction in perl\, at least not for every pattern\, simply because our patterns support recursive constructs\, which afaik cannot be represented as DFA's.
So absent Ken\, Andrew is probably the next best to ask this of\, as he is particularly well-versed with regexes in all aspects: historical\, current\, standardized\, etc. It's he whom we refer to in the Camel's pattern-matching section when we write in the footnote:
It has been said(*) that programs that write programs are the happiest programs in the world.
\* By Andrew Hume\, the famous Unix philosopher\.
Its interesting you quote that as its primarily when programs are writing other programs that the octal/backref problem occurs. There is no ambiguity in octal/backrefs in static patterns\, a given escape sequence is either one or the other. But when you concatenate two patterns together....
[snip]
So that hole was deemed small enough\, and also infrequent and unlikely (at least in in non-program-generated programs) that Ken&Co. just lived with it\, preferring clarity and brevity (simple to read and write) over a more complex yet bullet-proof notation.
Andrew said\, sure\, it's a bit messy\, or untidy\, but if you're looking for pristine perfection\, you're looking for the wrong thing. Or something like that.
Especially in Perl. :-)
The only exception to this was \0\, which saw frequent enough use that making folks always specify \000 to mean NUL was deemed unduly onerous. Also\, the original pattern-matchers didn't handle nulls\, plus some of them treated \0 as "the whole match"\, much as we now use (?0) to recurse on the whole pattern.
One last thing: Andrew\, upon being told about the TRIE regex optimization\, suggests we might look into splay trees for this instead. He thinks they have properties that might make them even faster/smaller\, but says we'd have to benchmark the two carefully\, because it was just an informed hunch.
Hmm\, maybe its worth researching into that a bit. The trie logic could definitely be improved. We use compressed transitions tables when we probably shouldn't. Making each transition significantly more expensive than it should be -- mostly because of the concern of unicode being able to make the number of transitions grow explosively large.
Now Henry isn't here\, so I can't ask him about the source of his that Larry long ago started out from. Important aspects of that include that Henry admitted only \1 .. \9 for backrefs *AND* how the 3-digit octal-character backslash escapes shall have already been processed by the time the regex compiler has to think about things. That means it didn't have to think about both. This is somewhat how \U is handled during variable interpolation\, not by the regex compiler.
Some of the Spencerian sources and derivatives are available at
http://arglist.com/regex/
Some can be quite\, educative.
One this I found especially amusing was this change log comment:
Fix for a serious bug that affected REs using many [] (including REG_ICASE REs because of the way they are implemented)\, *sometimes*\, depending on memory-allocation patterns.
Sound familiar\, anybody :-) [HINT: think of /(\337)\1/i ]
I'm probably too stupid to get this one. Feel up to spelling it out to me offlist?
You can look up more on the history of regexes\, from Ken's original 1968 paper to Rob and Ken's 1992 specking out of UTF-8\, at:
http://swtch.com/~rsc/regexp/
Historical sources of interest here include
Ken's original paper to CACM\, 4 dense pages: http://doi.acm.org/10.1145/363347.363387
Ken's UTF-8 version of grep\, w/o backtracking: http://swtch.com/usr/local/plan9/src/cmd/grep/
Rob's regexp (no backtracking) library that handles UTF-8: http://swtch.com/plan9port/unix/ Its section 3 manpage: http://swtch.com/plan9port/unix/man/regexp93.html Its section 7 manpage: http://swtch.com/plan9port/unix/man/regexp97.html Its code: http://swtch.com/plan9port/unix/libregexp9.tgz
Rob's paper on Structured Regular Expressions: http://doc.cat-v.org/bell_labs/structural_regexps/se.pdf
Rob's "sam" editor http://netlib.bell-labs.com/sys/doc/sam/sam.html
Code to implement Perl's regexp rules: http://swtch.com/~rsc/regexp/nfa-perl.y.txt
Sigh. So much to learn. So little time. The latter sounds interesting\, I haven't looked but i wonder how it handles recursive patterns.
[snip]
tchrist>>> And in case it doesn't\, the output below will convince anyone tchrist>>> that we ***CANNOT*** remove \0ctal notation. Larry would never tchrist>>> allow you to break so many people's code. It would be the worst tchrist>>> thing Perl has ever done to its users. It verges upon the tchrist>>> insane.
yves>> First please separate what Glenn said from what Rafael and I said\, yves>> which is that it might be a good idea to deprecate octal IN REGULAR yves>> EXPRESSIONS.
I apologise for the shouting.
[snip]
yves>> Obviously from a back compat viewpoint we can't actually yves>> remove octal completely FROM THE REGEX ENGINE. At the very yves>> least there is a large amount of code that either generates yves>> octal sequences or contains them IN REGULAR EXPRESSSIONS.
You say "obviously"\, and I think it obvious\, too\, but either Glenn advocate did not or was not arguing in good faith\, only secretly playing devil's advocate. That's far too complicated for me.
Well at the time i made the suggestion (about the regex engine) that we do so (in the regex engine) I was not thinking clearly. Again I apologize.
[snip]
yves>> But we sure can say n the docs that "it is recommended that yves>> you do not use octal in regular expressions in new code as it yves>> is ambiguous as to how they will be interpreted\, especially yves>> low value octal (excepting \0) can easily be mistaken for a yves>> backreference".
It seems that we got into trouble by allowing one- and two-digit octal character escapes. Tbhis is not something that the original designers (Ken; Dennis and Brian; Rob) ever did\, and thereby circumvented much of our trouble.
Perhaps what should happen is that we should encourage 3-digit octal notation only.
At this point tho the main advantage of using octal at all\, and the reason it is used in many places that I have seen\, seem to be brevity. So encouraging people to use 3 digits is not really much of a gain\, as it means there is no compelling reason to use octal instead of \xHH.
I do think that Glenn did have at least one good point in his mail\, I think he was right when he suggested that not too many of the "newer generation" are comfortable with octal\, outside perhaps *nix sysadmins who seem to absorb it from chmod and related tools.
tchrist>> Grepping for \\\d *ONLY* in the indented code segments of tchrist>> the standard pods:
yves>> Oh cmon! You of all people must know a whole whack of ways to yves>> count them.
I was just mad because your mail was truncated by gmail. Wheras a count along with a few selected items would have made the same point\, been shorter\, and I would have known for sure that I saw the full content of your mail. TBH I have no idea if there was any commentary after the list. If there was I never saw it.
[snip]
I was perfectly aware it was a reference. I didn't dump the data on you dumbly. I could have summarized it\, described trends\, but this doesn't have the impact of seeing the raw data\, which is what I was aiming for to bat down the crazy idea of forcing uncountably many broken programs. Having to change my code due to a Perl upgrade thrice in 21 years is nothing like what Glenn feigned contemplating.
Understood\, but hopefully you see my point that i mention above in this reply as well. Yes I could probably use a different mail client. But well\, it seems that there is something about mail programs that makes them particularly hateful\, and gmail seems to mostly be the least painful option I have encountered so far. At least for my needs.
yves>> Personally I dislike ambiguous syntax
As do I. Larry is actually a lot more comfortable with it than I am\, because he realizes due to his work with natural language that humans are good with ambiguity and that one can\, if one is clever enough\, use surrounding clues to figure out what was meant.
Yes\, and its one of the cool things about Perl in my book (along with the amazingly well integrated regex features ;-)
yves>> and think it should in general be avoided\, and that maybe we yves>> should do something to make it easier to see when there is yves>> ambiguous syntax.
That seems pretty reasonable\, too.
Yeah I think thats where this is heading: some kind of regex lint.
yves>> And I especially dislike ambiguous syntax that can be made to yves>> change meaning by action at a distance. If I concatenate a yves>> pattern that contains an octal sequence to a pattern that yves>> contains a bunch of capture buffers the meaning of the "octal" yves>> changes. That is bad.
Yes\, it is bad\, but there are worse problems. You can't do in a general and useful way do it at all\, because which capture buffer means what is going to renumber. The new \g{-1} helps a good bit here\, as does \g{BUFNAME}\, but it's still a sticky problem requiring more overall knowledge than you'd like it to require.
Just for the record: the \g{} syntax was added to make it possible to safely use backrefs in generated patterns by eliminating the ambiguity of the old syntax\, and to normalize the various capture buffer syntaxes implemented in other languages. The .Net syntax\, Python syntax\, and Java syntaxes all were/are different\, (although those implementation that use PCRE now support \g{} too :-)\, and despite Perl 5.10 supporting them all the \g{} thing seemed a good idea. The relative backref syntax was specifically added to make it easier to construct patterns that used backrefs. \g{BUFFNAME} actually doesnt help much\, although I recall Abigail had some thoughts on how to make it more powerful.
yves>> Assuming that grok_oct() consumes at most 3 octal digits\, I think yves>> we can apply Karls patch. However I do think we should recommend yves>> against using octal IN REGULAR EXPRESSIONS. And should note that yves>> while you CAN use octal to represent codepoints up to 511 it is yves>> strongly recommended that you don't.
I'd like to see three-digit octal always mean an 8-bit character\, and discourage things like \3 and \33. I don't think we should bother extending octal to allow for code points above "\377". That it "works" at all there is a problem.
Well if we don't allow it then we have to forbid it. I think at this point allowing it is the least worst option.
I included the older code because you'll see a pattern in it. For example:
scripts/badman: grep(/[^\001]+\001[^\001]+\001${ext}\001/ || /[^\001]+${ext}\001/\, scripts/badman: if ( /^([^\001]*)\002/ || /^([^\002]*)\001/ ) { scripts/badman: if (/\001/) { scripts/badman: if ($last eq "\033") { scripts/badman: last if $idx_topic eq "\004"; scripts/badman: last if $idx_topic eq "\004" || $idx_topic eq '0'; scripts/badman: s/\033\+/\001/; scripts/badman: s/\033\\,/\002/; scripts/badman: @tmplist = split(/\002/\, $entry); scripts/badman: $winsize = "\0" x 8;
Notice someting? Being once-and-always a C programmer at heart\, by habit I always used to use a single digit for a NUL *only* and 3 digits otherwise. The Camel's use of
camel3-examples:if ( ("fred" & "\1\2\3\4") =~ /[^\0]/ ) { ... }
is just something I do *not* like. It does no good to warn in strings\, where there are no backrefs (save in s///)\, but I'm not sure the level of warning appropriate for regexes.
Speaking of which\, isn't it time to tell the s/(...)/\1\1/g people to get their acts together?
Probably. Do you have a list? :-)
yves>> Also I have a concern that Karls patch merely modifies the yves>> behaviour in the regular expression engine. It doesn't do the yves>> same for other strings. If it is going to be legal it should yves>> be legal everywhere.
Yes\, this is a real issue\, the first one I raised.
Well Karl suggested it is legal\, and iiuir does the right thing (iow causing the string to be upgraded).
karl> grok_oct() itself consumes as many octal digits as there are karl> in its parameter\, as long as the result doesn't overflow a UV. karl> It is used for general purpose octal conversion\, such as from karl> the oct() function.
Hm.
The oct() function is already bizarre enough. It's weird that it takes bits of any sort\, converts them to decimal\, then treats that decimal as octal\, and returns a string of new digits. Even calling it dec2oct might have helped.
oct("0755") oct("755")
ok\, but
oct(755) oct(700 + 50 + 5)
doing the same thing is\, well\, just not what people are thinking it does.
I guess this is just a side effect of numbers and strings being effectively interchangable.
karl> Tom has pointed out that \777 is a reserved value in some karl> contexts.
It's the $/ issue.
I dont understand\, can you expand on that a bit?
karl> It seems to me to be a bad idea to remove acceptance of octal karl> numbers in re's.
And to me. Remember\, I'm the one who gets testy about
$h{date} = "foo"; $h{time} = "bar";
$h{time()} = "oh\, right";
I had to think about this one.
$h{033} = "foo"; $h{27} .= "bar"; # now foobar
$h{"033"} = "not again";
Ah. You gotta love those bizarre passageways and secret doors in perl dontcha? Just like one of those castles in a good horror flick.
karl> It seems like a good idea to add something to the language so karl> one can express them unambiguously. Even I with my limited karl> knowledge of regcomp.c could do it easily (fools rush in...).
I could live with \o{...} if I had to\, but I'm nervous where it goes.
I dont see any problem with this really. But given we have \x{} I dont really see the point\, but if it made the octal folks happy then so be it.
karl> And it seems like an even better idea to handle them karl> consistently. I see two ways to do that 1) accept my patch; or karl> 2) forbid or warn about the use of those larger than a single karl> character in the machine architecture in both strings and karl> re's\, including char classes. Perhaps I've forgotten something in karl> this thread. If so\, I'm sorry.
Karl\, you've nothing to be sorry about. Your courtesy\, conscientiousness\, and can-do attitude are very welcome.
Definitely. And mails like this are too. Very much so.
cheers\, Yves
-- perl -Mre=debug -e "/just|another|perl|hacker/"
On Thu\, Nov 13\, 2008 at 05:19:43PM -0700\, Tom Christiansen quoted the Camel:
It has been said\(\*\) that programs that write programs are the happiest programs in the world\. \* By Andrew Hume\, the famous Unix philosopher\.
Programs that write programs may be happy ... but people who have to read and\, heaven forfend\, _maintain_ programs written by programs are\, without doubt\, the damned of Computania. -- Chip Salzenberg \chip@​pobox\.com
On approximately 11/13/2008 4:19 PM\, came the following characters from the keyboard of Tom Christiansen:
:rafael> This one shows clearly that we're using a regexp that matches :rafael> "\x{1}8"\, but why is there a duplicated warning? Double magic?
:rafael> $ perl -wE '/\18/' :rafael> Illegal octal digit '8' ignored at -e line 1. :rafael> Illegal octal digit '8' ignored at -e line 1.
This one confuses me: clearly \18\, in this regex\, is not a backref\, because there are no captures\, and clearly 8 is not an octal digit\, so by the 1\, 2\, or 3 octal digits rule\, the 8 should be silently ignored\, and the expression should be equivalent to /\x{1}8/ by my reading of the documentation.
I've just come from speaking with Andrew about all this\, of whom I posed the question:
1. *Why* did Ken (et alios) select the same notation for backreferences in regex as for octally notated characters?
2. *How* did you guys cope with its inherent ambiguity?
Andrew said that they coped with the ambiguity mostly by always requiring exactly 3 octal digits for characters. You couldn't say \33 for ESC; you had to say \033. If someone wanted the third capture buffer\, they write \3; if they wanted ETX\, they wrote \003; \3 and \003 therefore *meant* *different* *things* in regexes\, and the first was disallowed in strings.
Apparently that was something that was done only in grep\, et alia. K&R clearly states 1-3 octal characters for an octal string constant.
glenn>>>> My understanding is that in a regex\, if you have 3 matches\, glenn>>>> that "\333" might be more ambiguous than you are assuming.
It could mean
\\g\{3\} followed by "33" ubyte 219​: "@​\{ \[pack C => 219\] \}" uchar 219​: "@​\{ \[pack U => 219\] \}"
It was indeed Glenn's suggestion that these first be deprecated and then in the release following\, AND THEN REMOVED ALTOGETHER\, that I found to be utterly untenable.
His later messages seem to say that he was just trying to fly a strawman to see how far he could push it just to test the boundaries via hypotheticals. If so\, that seems to say it wasn't an honest suggestion made in good faith\, just something there to "stir the bucket" (or the hornets' nest). Perhaps he finds this useful as a general principle; but here\, I do not.
It was an attempt to see if there was an acceptable solution that could resolve the ambiguity by removing the offending syntax; I didn't really think it would fly on its own\, but because occasionally major versions to accept backward incompatible changes\, I thought there might be some chance\, given an alternative syntax that is more consistent with syntax for constants in other number bases. But I did rather expect that the weight of existing ambiguous code would kill the idea of removing the \nnn syntax; I still have some hope that a new\, more useful octal syntax might be made available\, in addition to the limited\, ambiguous one. Then documentation could nudge people towards using the new syntax.
The fact that you have only been forced into 3 source changes by incompatible Perl changes may well indicate a canny intuition on your part as to the parts of the language that might change incompatibly\, or perhaps contentment with a subset of the language that happens not to have changed\, more than there not being such changes\, as you imply in your arguments. On the other hand\, for the short few years that I've been following this list\, it has been true that most of the incompatible changes I've seen accepted were in areas that were rather buggy\, and that there was little way forward other than an incompatible change.
I did this because I'd taken Gless at his literal word\, and I wanted everyone to see how extensive this use was. I also wanted to demonstrate the historical difference that seems to have cropped up as we went from C programmers as our main programmer base\, to non-C-programmers. This meant that we started to get 1- and 2-digit octal escapes where we'd never before had them.
You're rather confused here; as stated above\, C has accepted one\, two and three digit octal escapes from the beginning. Your quotes from Andrew regarding \0 and three-digit escapes only are clearly referring to other programs\, such as grep\, and perhaps egrep and awk? I'm not going to take the time to research what programs Andrew was referring to\, if you didn't clarify that in your discussion with him\, you can research it. But the wording in K&R clearly permits varying length octal constants.
yves>> Assuming that grok_oct() consumes at most 3 octal digits\, I think yves>> we can apply Karls patch. However I do think we should recommend yves>> against using octal IN REGULAR EXPRESSIONS. And should note that yves>> while you CAN use octal to represent codepoints up to 511 it is yves>> strongly recommended that you don't.
I'd like to see three-digit octal always mean an 8-bit character\, and discourage things like \3 and \33. I don't think we should bother extending octal to allow for code points above "\377". That it "works" at all there is a problem.
I included the older code because you'll see a pattern in it. For example:
scripts/badman​: grep\(/\[^\\001\]\+\\001\[^\\001\]\+\\001$\{ext\}\\001/ || /\[^\\001\]\+$\{ext\}\\001/\, scripts/badman​: if \( /^\(\[^\\001\]\*\)\\002/ || /^\(\[^\\002\]\*\)\\001/ \) \{ scripts/badman​: if \(/\\001/\) \{ scripts/badman​: if \($last eq "\\033"\) \{ scripts/badman​: last if $idx\_topic eq "\\004"; scripts/badman​: last if $idx\_topic eq "\\004" || $idx\_topic eq '0'; scripts/badman​: s/\\033\\\+/\\001/; scripts/badman​: s/\\033\\\,/\\002/; scripts/badman​: @​tmplist = split\(/\\002/\, $entry\); scripts/badman​: $winsize = "\\0" x 8;
Notice someting? Being once-and-always a C programmer at heart\, by habit I always used to use a single digit for a NUL *only* and 3 digits otherwise. The Camel's use of
camel3\-examples​:if \( \("fred" & "\\1\\2\\3\\4"\) =~ /\[^\\0\]/ \) \{ \.\.\. \}
is just something I do *not* like. It does no good to warn in strings\, where there are no backrefs (save in s///)\, but I'm not sure the level of warning appropriate for regexes.
Again\, you are seriously confusing things in your arguments; the C programmer was quite welcome to use 1\, 2\, or 3 octal digits. So if you were\, indeed\, once-and-always a C programmer at heart\, that alone wouldn't have convinced you to always use 3 digits. So clearly you were also something else\, besides a C programmer\, if you picked up such a habit.
Now I learned C and Unix at approximately the same time (as did most of the older generation of C programmers)\, and it is true that the exactly-three-digit octal escape was used in other programs\, although I couldn't name them now\, likely grep was one of them; I encountered the phenomenon sometime in the first few months\, certainly. So I'm not suggesting that you didn't acquire the habit of using 3 digit octal escapes (except for NUL) in the early days and may well persist in using it\, only that it wasn't C that taught you that\, and it wasn't Perl that taught you that\, and the documented K&R C and Perl definitions for octal escapes are remarkably similar.
Karl\, you've nothing to be sorry about. Your courtesy\, conscientiousness\, and can-do attitude are very welcome.
Indeed; I think we all appreciate new blood working on the code and a willing spirit.
A protocol is complete when there is nothing left to remove. -- Stuart Cheshire\, Apple Computer\, regarding Zero Configuration Networking
Chip Salzenberg schreef:
Tom Christiansen quoted the Camel:
It has been said\(\*\) that programs that write programs are the happiest programs in the world\.
* By Andrew Hume\, the famous Unix philosopher.
Programs that write programs may be happy ... but people who have to read and\, heaven forfend\, _maintain_ programs written by programs are\, without doubt\, the damned of Computania.
My best coding memories are still attached to metaprogramming\, specifically Clarion for DOS. http://en.wikipedia.org/wiki/Clarion_(programming_language)
Maintenance was real easy: change the model (templates) and possibly some include files (embeds)\, hit the generate button\, and the full application was regenerated and compiled. We even did that while demonstrating the application: when an idea from the audience was real good\, we could adapt the model\, regenerate\, and proceed the demo. s/rails/infrastructure/
-- Affijn\, Ruud
"Gewoon is een tijger."
On Sat\, Nov 15\, 2008 at 02:55:43PM +0100\, Dr.Ruud wrote:
Chip Salzenberg schreef:
Tom Christiansen quoted the Camel:
It has been said\(\*\) that programs that write programs are the happiest programs in the world\.
* By Andrew Hume\, the famous Unix philosopher.
Programs that write programs may be happy ... but people who have to read and\, heaven forfend\, _maintain_ programs written by programs are\, without doubt\, the damned of Computania.
My best coding memories are still attached to metaprogramming\, specifically Clarion for DOS. http://en.wikipedia.org/wiki/Clarion_(programming_language)
Ah\, but you neither had to read nor maintain\, only regenerate. That's a fine model; it's more like a compiler\, with debugging of its output only an occasional burden. Code-generation wizards\, on the other hand\, which require you to own and maintain their output ... they are the creation of somebody kinda like Satan. Only evil. -- Chip Salzenberg \chip@​pobox\.com
Chip wrote:
Ah\, but you neither had to read nor maintain\, only regenerate. That's a fine model; it's more like a compiler\, with debugging of its output only an occasional burden. Code-generation wizards\, on the other hand\, which require you to own and maintain their output ... they are the creation of somebody kinda like Satan. Only evil.
[WARNING: off topic]
For numerical analysis\, we *had* to use FORTRAN. I hated it.
But I had my revenge.
See\, I talked the prof into letting me use Ratfor\, a pre-processor that allowed C-like syntax and linked to libc for printf etc.
However\, as his TAs knew only FORTRAN\, not C\, they insisted on the post-processed output turned in instead. I therefore complied.
Boy\, they really hated me. :-)
I've similar stories in my undergrad compiler-class with lex and yacc\, since I already knew them long before taking that class (nor were they even covered in that class).
Again\, TAs gnashed their teeth\, but the profs thought it neat that I used these "advanced tools" to get my work done.
Actually\, it's worse than that.
I wrote awk programs that wrote lex programs to generate the state tables
needed for the larger C programs we had to do. So it was these DOUBLY-meta-
programmed outputs that the TAs received as part of the larger C programs.
They quickly gave up trying to understand them\, and just put top marks on
them provided they passed the test suites.
I think these would count as "evil" to you\, eh\, Chip? :-)
As a precocious undergrad (moi?)\, I never did get on well with the boring dumb TAs\, who weren't bright enough to score RAships. The profs were cool\, though. As a grad student\, I didn't have to suffer being a TA myself (I was a PA\, and got to run the dept computers instead)\, but I did have to suffer their dullness of lightbulbery. An advanced degree is no guarantee of any sense or skill or experience\, especially but not uniquely at the MS level. Give me experience over paper any day.
--tom
Chip Salzenberg schreef:
Dr.Ruud:
Chip Salzenberg:
Tom Christiansen quoted the Camel:
It has been said\(\*\) that programs that write programs are the happiest programs in the world\.
* By Andrew Hume\, the famous Unix philosopher.
Programs that write programs may be happy ... but people who have to read and\, heaven forfend\, _maintain_ programs written by programs are\, without doubt\, the damned of Computania.
My best coding memories are still attached to metaprogramming\, specifically Clarion for DOS. http://en.wikipedia.org/wiki/Clarion_(programming_language)
Ah\, but you neither had to read nor maintain\, only regenerate.
Well\, that is never entirely true of course. Out of the produced source code you learned how to put even cleverer tricks into the model file which would give you even more hooks to tweak with\, etc. It was great fun.
That's a fine model; it's more like a compiler\, with debugging of its output only an occasional burden. Code-generation wizards\, on the other hand\, which require you to own and maintain their output ... they are the creation of somebody kinda like Satan. Only evil.
At last you make me think of brain-dead macro recorders. Yes\, I hated those\, then forgot all about them.
-- Affijn\, Ruud
"Gewoon is een tijger."
2008/11/16 Chip Salzenberg \chip@​pobox\.com:
On Sat\, Nov 15\, 2008 at 02:55:43PM +0100\, Dr.Ruud wrote:
Chip Salzenberg schreef:
Tom Christiansen quoted the Camel:
It has been said\(\*\) that programs that write programs are the happiest programs in the world\.
* By Andrew Hume\, the famous Unix philosopher.
Programs that write programs may be happy ... but people who have to read and\, heaven forfend\, _maintain_ programs written by programs are\, without doubt\, the damned of Computania.
My best coding memories are still attached to metaprogramming\, specifically Clarion for DOS. http://en.wikipedia.org/wiki/Clarion_(programming_language)
Ah\, but you neither had to read nor maintain\, only regenerate. That's a fine model; it's more like a compiler\, with debugging of its output only an occasional burden. Code-generation wizards\, on the other hand\, which require you to own and maintain their output ... they are the creation of somebody kinda like Satan. Only evil.
Pragmatic programmer has an interesting comment on this: "Never use a code wizard you didn't write yourself"
Whether this applies to SQL or not is a debatable. :-)
Yves
-- perl -Mre=debug -e "/just|another|perl|hacker/"
In-Reply-To Message from demerphq \demerphq@​gmail\.com of "Sat\, 15 Nov 2008 00:23:34 +0100."
2008/11/14 Tom Christiansen \tchrist@​perl\.com:
Replying to Chip Salzenberg's message of "Wed\, 12 Nov 2008 18:18:57 PST" and to Karl Williamson's of "Thu\, 13 Nov 2008 11:38:48 MST":
SUMMARY:
* There exist in octal character notation both implementation bugs as well as built-in\, by-design bugs\, particular when used in regular expressions.
* A few of these we've brought on ourselves\, because we relaxed the octal-char definition in ways that they designers of these things never did\, and so some of our troubles with them are our own fault.
* The implementation bugs we can fix\, if we're careful and consistent\, but design bugs we cannot.
* Nor can we eliminate the notation altogether\, due to the existing massive code base that relies upon it.
Yes\, this is absolutely clear (now). I misspoke when I suggested this.
* The best we can do is generate\, under certain circumstances\, a warning related to an ambiguous \XXX being interpreted as either a backreference or a character.
As you have said\, \g{} makes these much less important.
glenn>>> The [below] items could be added to the language glenn>>> immediately\, during the deprecation cycle for \nnn octal glenn>>> notation [...]
tchrist>> I find the notion of rendering illegal the existing octal tchrist>> syntax of "\33" is an *EXTRAÖRDINARILY* bad idea\, a tchrist>> position
tchrist>> am prepared to defend at laborious length--and\, if tchrist>> necessary\, appeal to the Decider-in-Chief [...]
chip> I am happy to mark my return to p5p by singing in harmony chip> with Tom C.
Don't worry both of you. Just pointing out how much could break snapped some sense into my head. Mea-culpa and all that.
I'll have to take a look at gre as it sounds like it is right along the lines of what we need. Afaui we can't go to full DFA construction in perl\, at least not for every pattern\, simply because our patterns support recursive constructs\, which afaik cannot be represented as DFA's.
I don't think they can. But an adaptive mechanism for those patterns that
Andrew said\, sure\, it's a bit messy\, or untidy\, but if you're looking for pristine perfection\, you're looking for the wrong thing. Or something like that.
Especially in Perl. :-)
One last thing: Andrew\, upon being told about the TRIE regex optimization\, suggests we might look into splay trees for this instead. He thinks they have properties that might make them even faster/smaller\, but says we'd have to benchmark the two carefully\, because it was just an informed hunch.
Hmm\, maybe its worth researching into that a bit. The trie logic could definitely be improved. We use compressed transitions tables when we probably shouldn't. Making each transition significantly more expensive than it should be -- mostly because of the concern of unicode being able to make the number of transitions grow explosively large.
I see\, I think.
One this I found especially amusing was this change log comment:
Fix for a serious bug that affected REs using many \[\] \(including REG\_ICASE REs because of the way they are implemented\)\, \*sometimes\*\, depending on memory\-allocation patterns\.
Sound familiar\, anybody :-) [HINT: think of /(\337)\1/i ]
I'm probably too stupid to get this one. Feel up to spelling it out to me offlist?
The problem is one of tricky folding\, which you know plenty about already. ß => SS\, etc.
( NEVER cut yourself down; there are always plenty of others who are only too happy to do that for you\, and you shouldn't help them. )
Oh\, the other thing you forgot was $/. It's how perl -0777 is
equiv to undef $/\, "because \777 is an illegal octal octet".
That's also precedent for restricting it to \377.
Sigh. So much to learn. So little time. The latter sounds interesting\, I haven't looked but i wonder how it handles recursive patterns.
Doubtful\, but haven't looked.
I apologise for the shouting.
Well at the time i made the suggestion (about the regex engine) that we do so (in the regex engine) I was not thinking clearly. Again I apologize.
Accepted\, twice.
I was just mad because your mail was truncated by gmail.
Truncates\, eh? I just bounce instead.
--tom
In-Reply-To: A snarky message from Glenn Linderman \perl@​NevCal\.com of "Fri\, 14 Nov 2008 20:52:25 PST." \491E5589\.6000705@​NevCal\.com
§ If I didn't know better (and I don't)\, I'd wonder whether Dan Bernstein or Richard Stallman or many another net.pest with an incurable chip on their shoulder hadn't sneaked into your login just that they might abuse and pick on people with uncalled-for and barely-veiled visciousness.
§ *I* don't need it\, and I bet nobody else does\, either.
§ And that's *all* the time I'm going to waste on *you*\, Mister Linderman.
--tom
On Sun\, Nov 16\, 2008 at 01:33:59PM +0100\, demerphq wrote:
2008/11/16 Chip Salzenberg \chip@​pobox\.com:
Code-generation wizards\, on the other hand\, which require you to own and maintain their output ... they are the creation of somebody kinda like Satan. Only evil.
Pragmatic programmer has an interesting comment on this: "Never use a code wizard you didn't write yourself"
That's a fine motto\, which like many\, must have appended: "... unless the pay is good."
Whether this applies to SQL or not is a debatable. :-)
I think not. Otherwise\, SQL would be "code"\, and we know better. :-\, -- Chip Salzenberg \chip@​pobox\.com
On Sat\, Nov 15\, 2008 at 05:11:53PM -0700\, Tom Christiansen wrote:
See\, I talked the prof into letting me use Ratfor\, a pre-processor that allowed C-like syntax and linked to libc for printf etc. However\, as his TAs knew only FORTRAN\, not C\, they insisted on the post-processed output turned in instead. I therefore complied. Boy\, they really hated me. :-)
[...more evil elided...]
I think these would count as "evil" to you\, eh\, Chip? :-)
Indeed so. :-) -- Chip Salzenberg \chip@​pobox\.com
On approximately 11/16/2008 5:52 PM\, came the following characters from the keyboard of Tom Christiansen:
In-Reply-To: A snarky message from Glenn Linderman \perl@​NevCal\.com of "Fri\, 14 Nov 2008 20:52:25 PST." \491E5589\.6000705@​NevCal\.com
§ If I didn't know better (and I don't)\, I'd wonder whether Dan Bernstein or Richard Stallman or many another net.pest with an incurable chip on their shoulder hadn't sneaked into your login just that they might abuse and pick on people with uncalled-for and barely-veiled visciousness.
§ *I* don't need it\, and I bet nobody else does\, either.
§ And that's *all* the time I'm going to waste on *you*\, Mister Linderman.
--tom
Well\, excuse me all to pieces. I point out a fallacy in your reasoning\, and I get this? You didn't hesitate to point out fallacies in my reasoning\, with great flourish and tons of raw data. One of the problems here is that the problem being discussed is almost beyond the ability of a human brain to understand it\, because of the many facets in provides -- and we are likely all going to do some fallacious reasoning.
My only purpose in commenting on the Unicode threads is to try to make Perl a useful language for writing programs in Unicode.
While octal escapes are only tangentially related to Unicode (because Unicode characters up to 255 or maybe 511 can be represented using octal escapes in regex and or string constants)\, they have other problems\, being discussed here.
Anyway\, for the record\, I don't know\, and have never met or communicated with Dan Bernstein or Richard Stallman\, as far as I can remember. I do use emacs though.
Your apology will be accepted if you make one; you have contributed a lot to Perl\, and no doubt still can. Certainly way more than I have or can.
A protocol is complete when there is nothing left to remove. -- Stuart Cheshire\, Apple Computer\, regarding Zero Configuration Networking
I've done some more research and thought about this\, and have come up with the enclosed straw proposa1. I hope I haven't shortchanged anyone's previous ideas.
I now know enough about perl internals that I could implement all of this as-is (or with modifications) in short order.
1. There shall be a new pragma "legacy"
Things like "use legacy 'octals'" can be used for various behaviors we change but that we want to allow the old way of doing things to still be possible. The list of legacy operations can be expanded in the future as necessary.
2. A new syntax \o{...} will be created for octal constants in regular expressions\, so that a writer may choose to avoid the existing ambiguities.
3. This syntax will be also accepted in any string constant\, for consistency.
4. In 5.12\, the maximum octal constant accepted as part of strings (as opposed to numbers) not using the above syntax will be \377 unless the writer uses the new legacy pragma to override it. A writer in 5.10 can use "no legacy" to get this behavior earlier.
5. In 5.10\, the bug with octals in regular expressions above \377 not doing at all the expected thing will be changed to generate an error. Since it doesn't work right\, we don't have to worry about breaking existing code.
6. The maximum octal constant using \o{...} will not be limited. Numbers over \377 will be treated as corresponding Unicode code points. (I don't see a reason not to allow this.)
6. In 5.10.1 an #if will be inserted into perl so that it won't compile unless UCHAR_MAX is 255.
UCHAR_MAX is in the standard limits.h header file. It gives the largest unsigned char that the installation can handle.
The C language requires UCHAR_MAX to be at least 255. In reading the code (and I haven't read that much of it)\, I have found several places where it likely doesn't work if UCHAR_MAX is greater than 255. There are several left shifts where the bits are assumed to vanish above the 8th\, and several cases of arrays of size 256 which use an index of an unsigned char.
I have a friend who has been on the ISO C committee for a long time. (BTW\, anybody can join the committee if they're willing to pay the steep annual fee.) He told me that companies that tried making larger character sizes retracted that because of portability problems of data files written by them. The only non-DSP one that he knows of still in use is the Unisys 2200 mainframe\, which has 9-bit chars in a 36 bit word. (He thinks the DSP ones have 16 bit address space\, so perl couldn't run on them.)
By doing this now\, we could be certain that no one wanted perl on an architecture that someone cared to have an octal above \377\, and that probably doesn't work well anyway on some constructs. I don't know if these would show up in our tests or not.
If someone complained\, and their perl had been mostly working\, we could then remove the #if\, and work to fix the bugs\, and change our minds about what to do about octals.
Migrated from rt.perl.org#59342 (status was 'resolved')
Searchable as RT59342$