Closed p5pRT closed 12 years ago
rt@cpan:
https://rt.cpan.org/Ticket/Display.html?id=66952
git bisect:
c11ff9433950cda8448b773418d1cb2592eea29d is the first bad commit commit c11ff9433950cda8448b773418d1cb2592eea29d Author: Karl Williamson \public@​khwilliamson\.com Date: Thu Feb 17 14:43:10 2011 -0700
handy.h: isIDFIRST_utf8() changed to use XIDStart
Previously this used a home-grown definition of an identifier start\, stemming from a bug in some early Unicode versions. This led to some problems\, fixed by #74022.
But the home-grown solution did not track Unicode\, and allowed for characters\, like marks\, to begin words when they shouldn't. This change brings this macro into compliance with Unicode going-forward.
perl -V:
Summary of my perl5 (revision 5 version 13 subversion 9) configuration: Commit id: c11ff9433950cda8448b773418d1cb2592eea29d Platform: osname=linux\, osvers=2.6.32-5-xen-amd64\, archname=x86_64-linux uname='linux k81 2.6.32-5-xen-amd64 #1 smp wed jan 12 05:46:49 utc 2011 x86_64 gnulinux ' config_args='-Dprefix=/home/src/perl/repoperls/installed-perls/perl/v5.13.9-536-gc11ff94 -Dinstallusrbinperl=n -Uversiononly -Dusedevel -des -Ui_db -Uuseithreads -Uuselongdouble -DDEBUGGING=-g' hint=recommended\, useposix=true\, d_sigaction=define useithreads=undef\, usemultiplicity=undef useperlio=define\, d_sfio=undef\, uselargefiles=define\, usesocks=undef use64bitint=define\, use64bitall=define\, uselongdouble=undef usemymalloc=n\, bincompat5005=undef Compiler: cc='cc'\, ccflags ='-fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64'\, optimize='-O2 -g'\, cppflags='-fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include' ccversion=''\, gccversion='4.5.2'\, gccosandvers='' intsize=4\, longsize=8\, ptrsize=8\, doublesize=8\, byteorder=12345678 d_longlong=define\, longlongsize=8\, d_longdbl=define\, longdblsize=16 ivtype='long'\, ivsize=8\, nvtype='double'\, nvsize=8\, Off_t='off_t'\, lseeksize=8 alignbytes=8\, prototype=define Linker and Libraries: ld='cc'\, ldflags =' -fstack-protector -L/usr/local/lib' libpth=/usr/local/lib /lib /usr/lib /lib64 /usr/lib64 libs=-lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lc -lgdbm_compat perllibs=-lnsl -ldl -lm -lcrypt -lutil -lc libc=/lib/libc-2.11.2.so\, so=so\, useshrplib=false\, libperl=libperl.a gnulibc_version='2.11.2' Dynamic Linking: dlsrc=dl_dlopen.xs\, dlext=so\, d_dlsymun=undef\, ccdlflags='-Wl\,-E' cccdlflags='-fPIC'\, lddlflags='-shared -O2 -g -L/usr/local/lib -fstack-protector'
Characteristics of this binary (from libperl): Compile-time options: PERL_DONT_CREATE_GVSV PERL_MALLOC_WRAP PERL_USE_DEVEL USE_64_BIT_ALL USE_64_BIT_INT USE_LARGE_FILES USE_PERLIO USE_PERL_ATOF Built under linux Compiled at Mar 27 2011 00:05:05 @INC: /home/src/perl/repoperls/installed-perls/perl/v5.13.9-536-gc11ff94/lib/site_perl/5.13.9/x86_64-linux /home/src/perl/repoperls/installed-perls/perl/v5.13.9-536-gc11ff94/lib/site_perl/5.13.9 /home/src/perl/repoperls/installed-perls/perl/v5.13.9-536-gc11ff94/lib/5.13.9/x86_64-linux /home/src/perl/repoperls/installed-perls/perl/v5.13.9-536-gc11ff94/lib/5.13.9 .
-- andreas
I tracked this down\, and I don't know what to do about it. The problem is in the file Syllable.pm. Three lines there are:
33: ລ => 'l'\, 34: ຼ => 'l'\, 35: ຫ => 'h'\,
(This is in the middle of initializing a hash.) The problem is on line 34\, the first non-blank character is U+0EBC: LAO SEMIVOWEL SIGN LO. The rules for initiating a hash AFAIK state that the fat comma operator quotes the lhs\, but only if that lhs is an identifier\, otherwise it must be explicitly quoted. U+0EBC does not begin an identifier\, and so legally must be quoted. In previous versions of Perl\, the checking was not done properly\, and so this was allowed through. There are a number of other characters in the hash that have the same problem; this is just the first. If I quote the character\, it parses.
My tendency is to not go back but to move towards the Unicode standard\, as we are now catching problematic constructs that we weren't. I think blead need not change\, but the CPAN module should change to quote those characters properly\, and we add something in the perldelta that we now find illegal constructs that previously we didn't.
But others may disagree.
On 03/26/2011 11:37 PM\, (Andreas J. Koenig) (via RT) wrote:
# New Ticket Created by (Andreas J. Koenig) # Please include the string: [perl #87110] # in the subject line of all future correspondence about this issue. #\<URL: http://rt.perl.org/rt3/Ticket/Display.html?id=87110>
rt@cpan: --------
https://rt.cpan.org/Ticket/Display.html?id=66952
git bisect: -----------
c11ff9433950cda8448b773418d1cb2592eea29d is the first bad commit commit c11ff9433950cda8448b773418d1cb2592eea29d Author: Karl Williamson\public@​khwilliamson\.com Date: Thu Feb 17 14:43:10 2011 -0700
handy\.h​: isIDFIRST\_utf8\(\) changed to use XIDStart Previously this used a home\-grown definition of an identifier start\, stemming from a bug in some early Unicode versions\. This led to some problems\, fixed by \#74022\. But the home\-grown solution did not track Unicode\, and allowed for characters\, like marks\, to begin words when they shouldn't\. This change brings this macro into compliance with Unicode going\-forward\.
perl -V: --------
Summary of my perl5 (revision 5 version 13 subversion 9) configuration: Commit id: c11ff9433950cda8448b773418d1cb2592eea29d Platform: osname=linux\, osvers=2.6.32-5-xen-amd64\, archname=x86_64-linux uname='linux k81 2.6.32-5-xen-amd64 #1 smp wed jan 12 05:46:49 utc 2011 x86_64 gnulinux ' config_args='-Dprefix=/home/src/perl/repoperls/installed-perls/perl/v5.13.9-536-gc11ff94 -Dinstallusrbinperl=n -Uversiononly -Dusedevel -des -Ui_db -Uuseithreads -Uuselongdouble -DDEBUGGING=-g' hint=recommended\, useposix=true\, d_sigaction=define useithreads=undef\, usemultiplicity=undef useperlio=define\, d_sfio=undef\, uselargefiles=define\, usesocks=undef use64bitint=define\, use64bitall=define\, uselongdouble=undef usemymalloc=n\, bincompat5005=undef Compiler: cc='cc'\, ccflags ='-fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64'\, optimize='-O2 -g'\, cppflags='-fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include' ccversion=''\, gccversion='4.5.2'\, gccosandvers='' intsize=4\, longsize=8\, ptrsize=8\, doublesize=8\, byteorder=12345678 d_longlong=define\, longlongsize=8\, d_longdbl=define\, longdblsize=16 ivtype='long'\, ivsize=8\, nvtype='double'\, nvsize=8\, Off_t='off_t'\, lseeksize=8 alignbytes=8\, prototype=define Linker and Libraries: ld='cc'\, ldflags =' -fstack-protector -L/usr/local/lib' libpth=/usr/local/lib /lib /usr/lib /lib64 /usr/lib64 libs=-lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lc -lgdbm_compat perllibs=-lnsl -ldl -lm -lcrypt -lutil -lc libc=/lib/libc-2.11.2.so\, so=so\, useshrplib=false\, libperl=libperl.a gnulibc_version='2.11.2' Dynamic Linking: dlsrc=dl_dlopen.xs\, dlext=so\, d_dlsymun=undef\, ccdlflags='-Wl\,-E' cccdlflags='-fPIC'\, lddlflags='-shared -O2 -g -L/usr/local/lib -fstack-protector'
Characteristics of this binary (from libperl): Compile-time options: PERL_DONT_CREATE_GVSV PERL_MALLOC_WRAP PERL_USE_DEVEL USE_64_BIT_ALL USE_64_BIT_INT USE_LARGE_FILES USE_PERLIO USE_PERL_ATOF Built under linux Compiled at Mar 27 2011 00:05:05 @INC: /home/src/perl/repoperls/installed-perls/perl/v5.13.9-536-gc11ff94/lib/site_perl/5.13.9/x86_64-linux /home/src/perl/repoperls/installed-perls/perl/v5.13.9-536-gc11ff94/lib/site_perl/5.13.9 /home/src/perl/repoperls/installed-perls/perl/v5.13.9-536-gc11ff94/lib/5.13.9/x86_64-linux /home/src/perl/repoperls/installed-perls/perl/v5.13.9-536-gc11ff94/lib/5.13.9 .
The RT System itself - Status changed from 'new' to 'open'
Karl wrote:
On 03/26/2011 11:37 PM\, (Andreas J. Koenig) (via RT) wrote:
handy\.h​: isIDFIRST\_utf8\(\) changed to use XIDStart Previously this used a home\-grown definition of an identifier start\, stemming from a bug in some early Unicode versions\. This led to some problems\, fixed by \#74022\. But the home\-grown solution did not track Unicode\, and allowed for characters\, like marks\, to begin words when they shouldn't\. This change brings this macro into compliance with Unicode going\-forward\.
I tracked this down\, and I don't know what to do about it. The problem is in the file Syllable.pm. Three lines there are:
33: ລ => 'l'\, 34: ຼ => 'l'\, 35: ຫ => 'h'\,
(This is in the middle of initializing a hash.)
That's *very* interesting. I have a program I wrote just this morning that notably has this in it:
use Unicode::UCD; # UAX#24 et alios use Unicode::Normalize qw[ NFC NFD ]; # UAX#15 use Unicode::Unihan; # UAX#38 use Unicode::GCString; # UAX#29 use Unicode::LineBreak qw(:all); # UAX#14-C2
use Lingua::JA::Romanize::Japanese; use Lingua::ZH::Romanize::Pinyin; use Lingua::KO::Romanize::Hangul; use Lingua::KO::Hangul::Util qw[ :all ];
And I am getting strange failures. These aren't compile-time failures\, but certain functions returning things they shouldn't be at times.
(time passes)
Oh drat\, it probably isn't the same problem. Lingua::LO::Romanize is some supergigantic module written by a completely different author from those above. I ^C'd it's installed after it was taking forever with recursive dependencies.
The problem is on line 34\, the first non-blank character is U+0EBC: LAO SEMIVOWEL SIGN LO. The rules for initiating a hash AFAIK state that the fat comma operator quotes the lhs\, but only if that lhs is an identifier\, otherwise it must be explicitly quoted. U+0EBC does not begin an identifier\, and so legally must be quoted. In previous versions of Perl\, the checking was not done properly\, and so this was allowed through. There are a number of other characters in the hash that have the same problem; this is just the first. If I quote the character\, it parses.
My tendency is to not go back but to move towards the Unicode standard\, as we are now catching problematic constructs that we weren't. I think blead need not change\, but the CPAN module should change to quote those characters properly\, and we add something in the perldelta that we now find illegal constructs that previously we didn't.
But others may disagree.
They may\, but they might be wrong. :)
As you see from my UAX comments above\, standards comformance seems pretty important to me.
I'd be really nervous about skipping the quotes on non-ASCII strings\, just because I can't pretend to have memorized all the ID_Start vs ID_Continue etc code points. Then again\, I would probably be slighty more upset if I suddenly got error messages about a variables that are no longer considered legit. At least at first. I hope I would come to see the light.
--tom
On Sun Mar 27 11:34:13 2011\, public@khwilliamson.com wrote:
I tracked this down\, and I don't know what to do about it. The problem is in the file Syllable.pm. Three lines there are:
33: ລ => 'l'\, 34: ຼ => 'l'\, 35: ຫ => 'h'\,
(This is in the middle of initializing a hash.) The problem is on line 34\, the first non-blank character is U+0EBC: LAO SEMIVOWEL SIGN LO. The rules for initiating a hash AFAIK state that the fat comma operator quotes the lhs\, but only if that lhs is an identifier\, otherwise it must be explicitly quoted. U+0EBC does not begin an identifier\, and so legally must be quoted. In previous versions of Perl\, the checking was not done properly\, and so this was allowed through. There are a number of other characters in the hash that have the same problem; this is just the first. If I quote the character\, it parses.
My tendency is to not go back but to move towards the Unicode standard\, as we are now catching problematic constructs that we weren't. I think blead need not change\, but the CPAN module should change to quote those characters properly\, and we add something in the perldelta that we now find illegal constructs that previously we didn't.
But others may disagree.
That was why I raised some mild concern via private e-mail about syntax drift when you proposed switching to (X)IDStart.
Now that concern is not so mild\, as my seemingly unfounded fears have proven real.
While it would be nice to follow Unicode for identifiers\, I think it’s too late. Since we are dealing with Perl’s syntax here\, we will be causing subtle breakage whenever we change anything.
For those who are not familiar with the issue\, Perl’s identifiers\, for historical reasons\, use a set of characters that differs slightly from Unicode’s identifiers. Perl allows any character *not* in an identifier to be a string delimiter\, as in q·foo· (which particular example is a single identifier in Unicode). That means that every addition to the set of identifier characters subtracts from the set of delimiter characters\, and vice versa. So every change breaks existing syntax.
For those who are not familiar with the issue\, Perl’s identifiers\, for historical reasons\, use a set of characters that differs slightly from Unicode’s identifiers. Perl allows any character *not* in an identifier to be a string delimiter\, as in q·foo· (which particular example is a single identifier in Unicode).
I *have* been surprised by that one before. It's "the Catalan problem". This is legal:
use utf8; my $metaŀlúrgica = 1;
but this is not:
use utf8; my $metal·lúrgica = 2;
The second is the NFKD of the first.
I know NFKD is not necessarily NFD\, but I kinda don't like that one NF is legal and another is not.
It's even more annoying because supposedly the creation of "ŀ" was an error in the first place\, and that one should always (try to) write "l·" instead. But we aren't allowed to.
That means that every addition to the set of identifier characters subtracts from the set of delimiter characters\, and vice versa. So every change breaks existing syntax.
I hadn't thought of it that way before. Ug.
In 5.12\, these 15 IDC codepoints were \W:
1 · 00B7 GC=Po MIDDLE DOT 2 · 0387 GC=Po GREEK ANO TELEIA 3 ፩ 1369 GC=No ETHIOPIC DIGIT ONE 4 ፪ 136A GC=No ETHIOPIC DIGIT TWO 5 ፫ 136B GC=No ETHIOPIC DIGIT THREE 6 ፬ 136C GC=No ETHIOPIC DIGIT FOUR 7 ፭ 136D GC=No ETHIOPIC DIGIT FIVE 8 ፮ 136E GC=No ETHIOPIC DIGIT SIX 9 ፯ 136F GC=No ETHIOPIC DIGIT SEVEN 10 ፰ 1370 GC=No ETHIOPIC DIGIT EIGHT 11 ፱ 1371 GC=No ETHIOPIC DIGIT NINE 12 ℘ 2118 GC=So SCRIPT CAPITAL P 13 ℮ 212E GC=So ESTIMATED SYMBOL 14 ゛ 309B GC=Sk KATAKANA-HIRAGANA VOICED SOUND MARK 15 ゜ 309C GC=Sk KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK
And in blead\, there *are* no such code points -- which seems like is the way it ought to be.
But wait!
Even in blead\, this appears to be (gotten) wrong:
use utf8; our $metaŀlúrgica = 1; our $metal·lúrgica = 2;
So apparently I *can't* trust the apparent empty intersection of \W and \p{IDC}. Socould anyone please tell me which \w characters are not legal in Perl identifiers? And why? :( Because this matters to people doing all sort of different things\, including laundering.
Those\, for the record\, were:
our $meta\N{LATIN SMALL LETTER L WITH MIDDLE DOT}l\N{LATIN SMALL LETTER U WITH ACUTE}rgica = 1; our $metal\N{MIDDLE DOT}l\N{LATIN SMALL LETTER U WITH ACUTE}rgica = 2;
And even though blead says there are no IDC code points that are \W chars\, perl's parser seems to have its own ideas.
Ok fine\, so it doesn't hang the parser anymore like it did in 5.12\,
but this is certainly wrong:
% blead metal
Unrecognized character \xC2; marked by \<-- HERE after our $metal\<-- HERE near column 11 at metal line 9.
There is no \x{C2} there at all:
our $meta\x{140}l\x{FA}rgica = 1; our $metal\x{B7}l\x{FA}rgica = 2;
Only when considered as raw octets is there a C2 sitting around:
our $meta\xC5\x80l\xC3\xBArgica = 1; our $metal\xC2\xB7l\xC3\xBArgica = 2;
Here's another\, even uglier problem:
use utf8; our $über_metaŀlúrgica = 5; our $über_metal·lúrgica = 6;
Buys me:
Unrecognized character \xC2; marked by \<-- HERE after Ìber_metal\<-- HERE near column 18 at metal line 13.
which is
Unrecognized character \xC2; marked by \<-- HERE after \x{CC}\x{88}ber_metal\<-- HERE near column 18 at metal line 13.
That's just completely evil and wrong:
(1) It has generated an illegible message using code points that simply do not appear in my code at *any* point.
* \x{88} is CHARACTER TABULATION SET\, which is a non-printing control character. We should never emit raw non-printing control characters like this.
* That Ì it's blathering on about is code point \x{CC}\, which is no more in my code than is \x{C2}; see next.
(2) What C2 code point? There is no C2 code point. It's a B7 code point. That's because it's the first of these:
our $u\x{308}ber_metal\x{B7}lu\x{301}rgica = 6;
Not the second:
our $u\xCC\x88ber_metal\xC2\xB7lu\xCC\x81rgica = 6;
(3) That B7 code point is quite plainly at column 16\, *not* at column 18; I think that makes it a calumny. :) Notice: | # 1 | 2 3 4 #1234567890123456789012345678901234567890 our $über_metal·lúrgica = 6; #1234567890123456789012345678901234567890 # 1 | 2 3 4 | | \<-- that's column 16
Perl has forgotten how to count.
That's because this time I wrote it in NFD not NFC:
our $u\N{COMBINING DIAERESIS}ber_metal\N{MIDDLE DOT}lu\N{COMBINING ACUTE ACCENT}rgica = 6;
Which should make no difference whatsoever. Instead\, it generates really really bogus messages.
If it's going to bother talking about columns\, it certainly has to do so *AT LEAST* from the point of view of graphemes\, not of mere code-point counts. That's actually not enough: there are other concerns that just graphemes\, so it really has to use the sort of thing that you get from Unicode::GCString::columns().
But I'll save that for another day. :)
--tom
On Mar 27\, 2011\, at 2:38 PM\, Tom Christiansen wrote:
For those who are not familiar with the issue\, Perl’s identifiers\, for historical reasons\, use a set of characters that differs slightly from Unicode’s identifiers. Perl allows any character *not* in an identifier to be a string delimiter\, as in q·foo· (which particular example is a single identifier in Unicode).
I *have* been surprised by that one before. It's "the Catalan problem".
(Actually what I typed was not a middle dot\, but an ἄνω τελεία\, which is a punctuation mark. Unicode considers them canonically equivalent which causes all sorts of headaches when programs normalise things without asking\, especially when the two glyphs look different in some fonts. But that’s unrelated to perl.)
This is legal:
use utf8; my $metaŀlúrgica = 1;
but this is not:
use utf8; my $metal·lúrgica = 2;
The second is the NFKD of the first.
I know NFKD is not necessarily NFD\, but I kinda don't like that one NF is legal and another is not.
It's even more annoying because supposedly the creation of "ŀ" was an error in the first place\, and that one should always (try to) write "l·" instead. But we aren't allowed to.
As you demonstrate\, the current situation is a mess.
I think this is something that Jesse will have to decide. It’s a choice between switching to Unicode identifier definitions (which Perl did not use at first\, because back then they weren’t very good) or preserving backward compatibility. At least now we know the backward-compatibility issue is not theoretical.
That means that every addition to the set of identifier characters subtracts from the set of delimiter characters\, and vice versa. So every change breaks existing syntax.
I hadn't thought of it that way before. Ug.
In 5.12\, these 15 IDC codepoints were \W:
1 · 00B7 GC=Po MIDDLE DOT 2 · 0387 GC=Po GREEK ANO TELEIA 3 ፩ 1369 GC=No ETHIOPIC DIGIT ONE 4 ፪ 136A GC=No ETHIOPIC DIGIT TWO 5 ፫ 136B GC=No ETHIOPIC DIGIT THREE 6 ፬ 136C GC=No ETHIOPIC DIGIT FOUR 7 ፭ 136D GC=No ETHIOPIC DIGIT FIVE 8 ፮ 136E GC=No ETHIOPIC DIGIT SIX 9 ፯ 136F GC=No ETHIOPIC DIGIT SEVEN
10 ፰ 1370 GC=No ETHIOPIC DIGIT EIGHT 11 ፱ 1371 GC=No ETHIOPIC DIGIT NINE 12 ℘ 2118 GC=So SCRIPT CAPITAL P 13 ℮ 212E GC=So ESTIMATED SYMBOL 14 ゛ 309B GC=Sk KATAKANA-HIRAGANA VOICED SOUND MARK 15 ゜ 309C GC=Sk KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK
And in blead\, there *are* no such code points -- which seems like is the way it ought to be.
But wait!
Even in blead\, this appears to be (gotten) wrong:
use utf8; our $metaŀlúrgica = 1; our $metal·lúrgica = 2;
So apparently I *can't* trust the apparent empty intersection of \W and \p{IDC}. Socould anyone please tell me which \w characters are not legal in Perl identifiers?
All of \w is legal in Perl idenfiers. The middle dot is not in \w.
And why? :( Because this matters to people doing all sort of different things\, including laundering.
Those\, for the record\, were:
our $meta\N{LATIN SMALL LETTER L WITH MIDDLE DOT}l\N{LATIN SMALL LETTER U WITH ACUTE}rgica = 1; our $metal\N{MIDDLE DOT}l\N{LATIN SMALL LETTER U WITH ACUTE}rgica = 2;
And even though blead says there are no IDC code points that are \W chars\, perl's parser seems to have its own ideas.
Ok fine\, so it doesn't hang the parser anymore like it did in 5.12\, but this is certainly wrong:
% blead metal Unrecognized character \xC2; marked by \<-- HERE after our $metal\<-- HERE near column 11 at metal line 9.
That\, and the rest of your message\, demonstrate a separate problem—one with error-reporting.
There is no \x{C2} there at all:
our $meta\x{140}l\x{FA}rgica = 1; our $metal\x{B7}l\x{FA}rgica = 2;
Only when considered as raw octets is there a C2 sitting around:
our $meta\xC5\x80l\xC3\xBArgica = 1; our $metal\xC2\xB7l\xC3\xBArgica = 2;
Here's another\, even uglier problem:
use utf8; our $über\_metaŀlúrgica = 5; our $über\_metal·lúrgica = 6;
Buys me:
Unrecognized character \xC2; marked by \<-- HERE after Ìber_metal\<-- HERE near column 18 at metal line 13.
which is
Unrecognized character \xC2; marked by \<-- HERE after \x{CC}\x{88}ber_metal\<-- HERE near column 18 at metal line 13.
That's just completely evil and wrong:
(1) It has generated an illegible message using code points that simply do not appear in my code at *any* point.
\* \\x\{88\} is CHARACTER TABULATION SET\, which is a non\-printing control character\. We should never emit raw non\-printing control characters like this\. \* That Ì it's blathering on about is code point \\x\{CC\}\, which is no more in my code than is \\x\{C2\}; see next\.
(2) What C2 code point? There is no C2 code point. It's a B7 code point. That's because it's the first of these:
our $u\\x\{308\}ber\_metal\\x\{B7\}lu\\x\{301\}rgica = 6; Not the second​: our $u\\xCC\\x88ber\_metal\\xC2\\xB7lu\\xCC\\x81rgica = 6;
(3) That B7 code point is quite plainly at column 16\, *not* at column 18; I think that makes it a calumny. :) Notice: # 1 2 3 4 #1234567890123456789012345678901234567890 our $über_metal·lúrgica = 6; #1234567890123456789012345678901234567890 # 1 2 3 4 \<-- that's column 16 Perl has forgotten how to count\. That's because this time I wrote it in NFD not NFC​: our $u\\N\{COMBINING DIAERESIS\}ber\_metal\\N\{MIDDLE DOT\}lu\\N\{COMBINING ACUTE ACCENT\}rgica = 6; Which should make no difference whatsoever\. Instead\, it generates really really bogus messages\.
If it's going to bother talking about columns\, it certainly has to do so *AT LEAST* from the point of view of graphemes\, not of mere code-point counts. That's actually not enough: there are other concerns that just graphemes\, so it really has to use the sort of thing that you get from Unicode::GCString::columns().
But I'll save that for another day. :)
--tom
All of \w is legal in Perl idenfiers. The middle dot is not in \w.
I appear to have gotten confused by somehow thinking that \w holds identifier characters. But it doesn't. Yes\, I know tr18's RL1.2a definition for \w: there's no talk of IDS/IDC stuff there at all.
% blead -E 'say "\x{B7}" =~ /\p{IDC}/ || 0' 1
% blead -E 'say "\x{B7}" =~ /\w/ || 0' 0
Sorry 'bout that!
--tom
khw confirms that the module author will be updating his module to work with 5.14. Consequently\, this module no longer blocks the release of 5.14
On Sun 27.Mar'11 at 13:15:08 -0700\, Father Chrysostomos via RT wrote:
On Sun Mar 27 11:34:13 2011\, public@khwilliamson.com wrote:
I tracked this down\, and I don't know what to do about it. The problem is in the file Syllable.pm. Three lines there are:
33: ລ => 'l'\, 34: ຼ => 'l'\, 35: ຫ => 'h'\,
(This is in the middle of initializing a hash.) The problem is on line 34\, the first non-blank character is U+0EBC: LAO SEMIVOWEL SIGN LO. The rules for initiating a hash AFAIK state that the fat comma operator quotes the lhs\, but only if that lhs is an identifier\, otherwise it must be explicitly quoted. U+0EBC does not begin an identifier\, and so legally must be quoted. In previous versions of Perl\, the checking was not done properly\, and so this was allowed through. There are a number of other characters in the hash that have the same problem; this is just the first. If I quote the character\, it parses.
My tendency is to not go back but to move towards the Unicode standard\, as we are now catching problematic constructs that we weren't. I think blead need not change\, but the CPAN module should change to quote those characters properly\, and we add something in the perldelta that we now find illegal constructs that previously we didn't.
But others may disagree.
That was why I raised some mild concern via private e-mail about syntax drift when you proposed switching to (X)IDStart.
Now that concern is not so mild\, as my seemingly unfounded fears have proven real.
While it would be nice to follow Unicode for identifiers\, I think it’s too late. Since we are dealing with Perl’s syntax here\, we will be causing subtle breakage whenever we change anything.
For those who are not familiar with the issue\, Perl’s identifiers\, for historical reasons\, use a set of characters that differs slightly from Unicode’s identifiers. Perl allows any character *not* in an identifier to be a string delimiter\, as in q·foo· (which particular example is a single identifier in Unicode). That means that every addition to the set of identifier characters subtracts from the set of delimiter characters\, and vice versa. So every change breaks existing syntax.
How plausible is it (at a technical level) to make this behavior user-configurable? Going forward\, I'd love it if we could match the Unicode definition\, but I'd at least like us to be able to provide an out for legacy code.
On 04/04/2011 03:22 AM\, Jesse Vincent wrote:
On Sun 27.Mar'11 at 13:15:08 -0700\, Father Chrysostomos via RT wrote:
On Sun Mar 27 11:34:13 2011\, public@khwilliamson.com wrote:
I tracked this down\, and I don't know what to do about it. The problem is in the file Syllable.pm. Three lines there are:
33: ລ => 'l'\, 34: ຼ => 'l'\, 35: ຫ => 'h'\,
(This is in the middle of initializing a hash.) The problem is on line 34\, the first non-blank character is U+0EBC: LAO SEMIVOWEL SIGN LO. The rules for initiating a hash AFAIK state that the fat comma operator quotes the lhs\, but only if that lhs is an identifier\, otherwise it must be explicitly quoted. U+0EBC does not begin an identifier\, and so legally must be quoted. In previous versions of Perl\, the checking was not done properly\, and so this was allowed through. There are a number of other characters in the hash that have the same problem; this is just the first. If I quote the character\, it parses.
My tendency is to not go back but to move towards the Unicode standard\, as we are now catching problematic constructs that we weren't. I think blead need not change\, but the CPAN module should change to quote those characters properly\, and we add something in the perldelta that we now find illegal constructs that previously we didn't.
But others may disagree.
That was why I raised some mild concern via private e-mail about syntax drift when you proposed switching to (X)IDStart.
Now that concern is not so mild\, as my seemingly unfounded fears have proven real.
While it would be nice to follow Unicode for identifiers\, I think it’s too late. Since we are dealing with Perl’s syntax here\, we will be causing subtle breakage whenever we change anything.
For those who are not familiar with the issue\, Perl’s identifiers\, for historical reasons\, use a set of characters that differs slightly from Unicode’s identifiers. Perl allows any character *not* in an identifier to be a string delimiter\, as in q·foo· (which particular example is a single identifier in Unicode). That means that every addition to the set of identifier characters subtracts from the set of delimiter characters\, and vice versa. So every change breaks existing syntax.
How plausible is it (at a technical level) to make this behavior user-configurable? Going forward\, I'd love it if we could match the Unicode definition\, but I'd at least like us to be able to provide an out for legacy code.
Do you mean for 5.14? I would have to think about it. Now it is a #define constant.
On Mon\, Apr 04\, 2011 at 08:20:16AM -0600\, Karl Williamson wrote:
How plausible is it (at a technical level) to make this behavior user-configurable? Going forward\, I'd love it if we could match the Unicode definition\, but I'd at least like us to be able to provide an out for legacy code.
Do you mean for 5.14? I would have to think about it. Now it is a #define constant.
For 5.14\, I think we're stuck with what we've got. But sprout's concern seems valid and I want to see what our options for as we continue along this path in the future. --
Release 0.09 fixed this.
@cpansprout - Status changed from 'open' to 'resolved'
Migrated from rt.perl.org#87110 (status was 'resolved')
Searchable as RT87110$