Perl / perl5

🐪 The Perl programming language
https://dev.perl.org/perl5/
Other
1.98k stars 559 forks source link

Bleadperl v5.13.9-536-gc11ff94 breaks JOKKE/Lingua-LO-Romanize-0.08.tar.gz #11213

Closed p5pRT closed 12 years ago

p5pRT commented 13 years ago

Migrated from rt.perl.org#87110 (status was 'resolved')

Searchable as RT87110$

p5pRT commented 13 years ago

From @andk

rt@​cpan​:


https://rt.cpan.org/Ticket/Display.html?id=66952

git bisect​:


c11ff9433950cda8448b773418d1cb2592eea29d is the first bad commit commit c11ff9433950cda8448b773418d1cb2592eea29d Author​: Karl Williamson \public@​khwilliamson\.com Date​: Thu Feb 17 14​:43​:10 2011 -0700

  handy.h​: isIDFIRST_utf8() changed to use XIDStart

  Previously this used a home-grown definition of an identifier start\,   stemming from a bug in some early Unicode versions. This led to some   problems\, fixed by #74022.

  But the home-grown solution did not track Unicode\, and allowed for   characters\, like marks\, to begin words when they shouldn't. This change   brings this macro into compliance with Unicode going-forward.

perl -V​:


Summary of my perl5 (revision 5 version 13 subversion 9) configuration​:   Commit id​: c11ff9433950cda8448b773418d1cb2592eea29d   Platform​:   osname=linux\, osvers=2.6.32-5-xen-amd64\, archname=x86_64-linux   uname='linux k81 2.6.32-5-xen-amd64 #1 smp wed jan 12 05​:46​:49 utc 2011 x86_64 gnulinux '   config_args='-Dprefix=/home/src/perl/repoperls/installed-perls/perl/v5.13.9-536-gc11ff94 -Dinstallusrbinperl=n -Uversiononly -Dusedevel -des -Ui_db -Uuseithreads -Uuselongdouble -DDEBUGGING=-g'   hint=recommended\, useposix=true\, d_sigaction=define   useithreads=undef\, usemultiplicity=undef   useperlio=define\, d_sfio=undef\, uselargefiles=define\, usesocks=undef   use64bitint=define\, use64bitall=define\, uselongdouble=undef   usemymalloc=n\, bincompat5005=undef   Compiler​:   cc='cc'\, ccflags ='-fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64'\,   optimize='-O2 -g'\,   cppflags='-fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include'   ccversion=''\, gccversion='4.5.2'\, gccosandvers=''   intsize=4\, longsize=8\, ptrsize=8\, doublesize=8\, byteorder=12345678   d_longlong=define\, longlongsize=8\, d_longdbl=define\, longdblsize=16   ivtype='long'\, ivsize=8\, nvtype='double'\, nvsize=8\, Off_t='off_t'\, lseeksize=8   alignbytes=8\, prototype=define   Linker and Libraries​:   ld='cc'\, ldflags =' -fstack-protector -L/usr/local/lib'   libpth=/usr/local/lib /lib /usr/lib /lib64 /usr/lib64   libs=-lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lc -lgdbm_compat   perllibs=-lnsl -ldl -lm -lcrypt -lutil -lc   libc=/lib/libc-2.11.2.so\, so=so\, useshrplib=false\, libperl=libperl.a   gnulibc_version='2.11.2'   Dynamic Linking​:   dlsrc=dl_dlopen.xs\, dlext=so\, d_dlsymun=undef\, ccdlflags='-Wl\,-E'   cccdlflags='-fPIC'\, lddlflags='-shared -O2 -g -L/usr/local/lib -fstack-protector'

Characteristics of this binary (from libperl)​:   Compile-time options​: PERL_DONT_CREATE_GVSV PERL_MALLOC_WRAP PERL_USE_DEVEL   USE_64_BIT_ALL USE_64_BIT_INT USE_LARGE_FILES   USE_PERLIO USE_PERL_ATOF   Built under linux   Compiled at Mar 27 2011 00​:05​:05   @​INC​:   /home/src/perl/repoperls/installed-perls/perl/v5.13.9-536-gc11ff94/lib/site_perl/5.13.9/x86_64-linux   /home/src/perl/repoperls/installed-perls/perl/v5.13.9-536-gc11ff94/lib/site_perl/5.13.9   /home/src/perl/repoperls/installed-perls/perl/v5.13.9-536-gc11ff94/lib/5.13.9/x86_64-linux   /home/src/perl/repoperls/installed-perls/perl/v5.13.9-536-gc11ff94/lib/5.13.9   .

-- andreas

p5pRT commented 13 years ago

From @khwilliamson

I tracked this down\, and I don't know what to do about it. The problem is in the file Syllable.pm. Three lines there are​:

33​: ລ => 'l'\, 34​: ຼ => 'l'\, 35​: ຫ => 'h'\,

(This is in the middle of initializing a hash.) The problem is on line 34\, the first non-blank character is U+0EBC​: LAO SEMIVOWEL SIGN LO. The rules for initiating a hash AFAIK state that the fat comma operator quotes the lhs\, but only if that lhs is an identifier\, otherwise it must be explicitly quoted. U+0EBC does not begin an identifier\, and so legally must be quoted. In previous versions of Perl\, the checking was not done properly\, and so this was allowed through. There are a number of other characters in the hash that have the same problem; this is just the first. If I quote the character\, it parses.

My tendency is to not go back but to move towards the Unicode standard\, as we are now catching problematic constructs that we weren't. I think blead need not change\, but the CPAN module should change to quote those characters properly\, and we add something in the perldelta that we now find illegal constructs that previously we didn't.

But others may disagree.

On 03/26/2011 11​:37 PM\, (Andreas J. Koenig) (via RT) wrote​:

# New Ticket Created by (Andreas J. Koenig) # Please include the string​: [perl #87110] # in the subject line of all future correspondence about this issue. #\<URL​: http​://rt.perl.org/rt3/Ticket/Display.html?id=87110>

rt@​cpan​: --------

https://rt.cpan.org/Ticket/Display.html?id=66952

git bisect​: -----------

c11ff9433950cda8448b773418d1cb2592eea29d is the first bad commit commit c11ff9433950cda8448b773418d1cb2592eea29d Author​: Karl Williamson\public@&#8203;khwilliamson\.com Date​: Thu Feb 17 14​:43​:10 2011 -0700

 handy\.h&#8203;: isIDFIRST\_utf8\(\) changed to use XIDStart

 Previously this used a home\-grown definition of an identifier start\,
 stemming from a bug in some early Unicode versions\.  This led to some
 problems\, fixed by \#74022\.

 But the home\-grown solution did not track Unicode\, and allowed for
 characters\, like marks\, to begin words when they shouldn't\.  This change
 brings this macro into compliance with Unicode going\-forward\.

perl -V​: --------

Summary of my perl5 (revision 5 version 13 subversion 9) configuration​: Commit id​: c11ff9433950cda8448b773418d1cb2592eea29d Platform​: osname=linux\, osvers=2.6.32-5-xen-amd64\, archname=x86_64-linux uname='linux k81 2.6.32-5-xen-amd64 #1 smp wed jan 12 05​:46​:49 utc 2011 x86_64 gnulinux ' config_args='-Dprefix=/home/src/perl/repoperls/installed-perls/perl/v5.13.9-536-gc11ff94 -Dinstallusrbinperl=n -Uversiononly -Dusedevel -des -Ui_db -Uuseithreads -Uuselongdouble -DDEBUGGING=-g' hint=recommended\, useposix=true\, d_sigaction=define useithreads=undef\, usemultiplicity=undef useperlio=define\, d_sfio=undef\, uselargefiles=define\, usesocks=undef use64bitint=define\, use64bitall=define\, uselongdouble=undef usemymalloc=n\, bincompat5005=undef Compiler​: cc='cc'\, ccflags ='-fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64'\, optimize='-O2 -g'\, cppflags='-fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include' ccversion=''\, gccversion='4.5.2'\, gccosandvers='' intsize=4\, longsize=8\, ptrsize=8\, doublesize=8\, byteorder=12345678 d_longlong=define\, longlongsize=8\, d_longdbl=define\, longdblsize=16 ivtype='long'\, ivsize=8\, nvtype='double'\, nvsize=8\, Off_t='off_t'\, lseeksize=8 alignbytes=8\, prototype=define Linker and Libraries​: ld='cc'\, ldflags =' -fstack-protector -L/usr/local/lib' libpth=/usr/local/lib /lib /usr/lib /lib64 /usr/lib64 libs=-lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lc -lgdbm_compat perllibs=-lnsl -ldl -lm -lcrypt -lutil -lc libc=/lib/libc-2.11.2.so\, so=so\, useshrplib=false\, libperl=libperl.a gnulibc_version='2.11.2' Dynamic Linking​: dlsrc=dl_dlopen.xs\, dlext=so\, d_dlsymun=undef\, ccdlflags='-Wl\,-E' cccdlflags='-fPIC'\, lddlflags='-shared -O2 -g -L/usr/local/lib -fstack-protector'

Characteristics of this binary (from libperl)​: Compile-time options​: PERL_DONT_CREATE_GVSV PERL_MALLOC_WRAP PERL_USE_DEVEL USE_64_BIT_ALL USE_64_BIT_INT USE_LARGE_FILES USE_PERLIO USE_PERL_ATOF Built under linux Compiled at Mar 27 2011 00​:05​:05 @​INC​: /home/src/perl/repoperls/installed-perls/perl/v5.13.9-536-gc11ff94/lib/site_perl/5.13.9/x86_64-linux /home/src/perl/repoperls/installed-perls/perl/v5.13.9-536-gc11ff94/lib/site_perl/5.13.9 /home/src/perl/repoperls/installed-perls/perl/v5.13.9-536-gc11ff94/lib/5.13.9/x86_64-linux /home/src/perl/repoperls/installed-perls/perl/v5.13.9-536-gc11ff94/lib/5.13.9 .

p5pRT commented 13 years ago

The RT System itself - Status changed from 'new' to 'open'

p5pRT commented 13 years ago

From tchrist@perl.com

Karl wrote​:

On 03/26/2011 11​:37 PM\, (Andreas J. Koenig) (via RT) wrote​:

 handy\.h&#8203;: isIDFIRST\_utf8\(\) changed to use XIDStart

 Previously this used a home\-grown definition of an identifier start\,
 stemming from a bug in some early Unicode versions\.  This led to some
 problems\, fixed by \#74022\.

 But the home\-grown solution did not track Unicode\, and allowed for
 characters\, like marks\, to begin words when they shouldn't\.  This change
 brings this macro into compliance with Unicode going\-forward\.

I tracked this down\, and I don't know what to do about it. The problem is in the file Syllable.pm. Three lines there are​:

33​: ລ => 'l'\, 34​: ຼ => 'l'\, 35​: ຫ => 'h'\,

(This is in the middle of initializing a hash.)

That's *very* interesting. I have a program I wrote just this morning that notably has this in it​:

  use Unicode​::UCD; # UAX#24 et alios   use Unicode​::Normalize qw[ NFC NFD ]; # UAX#15   use Unicode​::Unihan; # UAX#38   use Unicode​::GCString; # UAX#29   use Unicode​::LineBreak qw(​:all); # UAX#14-C2

  use Lingua​::JA​::Romanize​::Japanese;   use Lingua​::ZH​::Romanize​::Pinyin;   use Lingua​::KO​::Romanize​::Hangul;   use Lingua​::KO​::Hangul​::Util qw[ :all ];

And I am getting strange failures. These aren't compile-time failures\, but certain functions returning things they shouldn't be at times.

(time passes)

Oh drat\, it probably isn't the same problem. Lingua​::LO​::Romanize is some supergigantic module written by a completely different author from those above. I ^C'd it's installed after it was taking forever with recursive dependencies.

The problem is on line 34\, the first non-blank character is U+0EBC​: LAO SEMIVOWEL SIGN LO. The rules for initiating a hash AFAIK state that the fat comma operator quotes the lhs\, but only if that lhs is an identifier\, otherwise it must be explicitly quoted. U+0EBC does not begin an identifier\, and so legally must be quoted. In previous versions of Perl\, the checking was not done properly\, and so this was allowed through. There are a number of other characters in the hash that have the same problem; this is just the first. If I quote the character\, it parses.

My tendency is to not go back but to move towards the Unicode standard\, as we are now catching problematic constructs that we weren't. I think blead need not change\, but the CPAN module should change to quote those characters properly\, and we add something in the perldelta that we now find illegal constructs that previously we didn't.

But others may disagree.

They may\, but they might be wrong. :)

As you see from my UAX comments above\, standards comformance seems pretty important to me.

I'd be really nervous about skipping the quotes on non-ASCII strings\, just because I can't pretend to have memorized all the ID_Start vs ID_Continue etc code points. Then again\, I would probably be slighty more upset if I suddenly got error messages about a variables that are no longer considered legit. At least at first. I hope I would come to see the light.

--tom

p5pRT commented 13 years ago

From @cpansprout

On Sun Mar 27 11​:34​:13 2011\, public@​khwilliamson.com wrote​:

I tracked this down\, and I don't know what to do about it. The problem is in the file Syllable.pm. Three lines there are​:

33​: ລ => 'l'\, 34​: ຼ => 'l'\, 35​: ຫ => 'h'\,

(This is in the middle of initializing a hash.) The problem is on line 34\, the first non-blank character is U+0EBC​: LAO SEMIVOWEL SIGN LO. The rules for initiating a hash AFAIK state that the fat comma operator quotes the lhs\, but only if that lhs is an identifier\, otherwise it must be explicitly quoted. U+0EBC does not begin an identifier\, and so legally must be quoted. In previous versions of Perl\, the checking was not done properly\, and so this was allowed through. There are a number of other characters in the hash that have the same problem; this is just the first. If I quote the character\, it parses.

My tendency is to not go back but to move towards the Unicode standard\, as we are now catching problematic constructs that we weren't. I think blead need not change\, but the CPAN module should change to quote those characters properly\, and we add something in the perldelta that we now find illegal constructs that previously we didn't.

But others may disagree.

That was why I raised some mild concern via private e-mail about syntax drift when you proposed switching to (X)IDStart.

Now that concern is not so mild\, as my seemingly unfounded fears have proven real.

While it would be nice to follow Unicode for identifiers\, I think it’s too late. Since we are dealing with Perl’s syntax here\, we will be causing subtle breakage whenever we change anything.

For those who are not familiar with the issue\, Perl’s identifiers\, for historical reasons\, use a set of characters that differs slightly from Unicode’s identifiers. Perl allows any character *not* in an identifier to be a string delimiter\, as in q·foo· (which particular example is a single identifier in Unicode). That means that every addition to the set of identifier characters subtracts from the set of delimiter characters\, and vice versa. So every change breaks existing syntax.

p5pRT commented 13 years ago

From tchrist@perl.com

For those who are not familiar with the issue\, Perl’s identifiers\, for historical reasons\, use a set of characters that differs slightly from Unicode’s identifiers. Perl allows any character *not* in an identifier to be a string delimiter\, as in q·foo· (which particular example is a single identifier in Unicode).

I *have* been surprised by that one before. It's "the Catalan problem". This is legal​:

  use utf8;   my $metaŀlúrgica = 1;

but this is not​:

  use utf8;   my $metal·lúrgica = 2;

The second is the NFKD of the first.

I know NFKD is not necessarily NFD\, but I kinda don't like that one NF is legal and another is not.

It's even more annoying because supposedly the creation of "ŀ" was an error in the first place\, and that one should always (try to) write "l·" instead. But we aren't allowed to.

That means that every addition to the set of identifier characters subtracts from the set of delimiter characters\, and vice versa. So every change breaks existing syntax.

I hadn't thought of it that way before. Ug.

In 5.12\, these 15 IDC codepoints were \W​:

  1 · 00B7 GC=Po MIDDLE DOT   2 · 0387 GC=Po GREEK ANO TELEIA   3 ፩ 1369 GC=No ETHIOPIC DIGIT ONE   4 ፪ 136A GC=No ETHIOPIC DIGIT TWO   5 ፫ 136B GC=No ETHIOPIC DIGIT THREE   6 ፬ 136C GC=No ETHIOPIC DIGIT FOUR   7 ፭ 136D GC=No ETHIOPIC DIGIT FIVE   8 ፮ 136E GC=No ETHIOPIC DIGIT SIX   9 ፯ 136F GC=No ETHIOPIC DIGIT SEVEN   10 ፰ 1370 GC=No ETHIOPIC DIGIT EIGHT   11 ፱ 1371 GC=No ETHIOPIC DIGIT NINE   12 ℘ 2118 GC=So SCRIPT CAPITAL P   13 ℮ 212E GC=So ESTIMATED SYMBOL   14 ゛ 309B GC=Sk KATAKANA-HIRAGANA VOICED SOUND MARK   15 ゜ 309C GC=Sk KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK

And in blead\, there *are* no such code points -- which seems like is the way it ought to be.

But wait!

Even in blead\, this appears to be (gotten) wrong​:

  use utf8;   our $metaŀlúrgica = 1;   our $metal·lúrgica = 2;

So apparently I *can't* trust the apparent empty intersection of \W and \p{IDC}. Socould anyone please tell me which \w characters are not legal in Perl identifiers? And why? :( Because this matters to people doing all sort of different things\, including laundering.

Those\, for the record\, were​:

  our $meta\N{LATIN SMALL LETTER L WITH MIDDLE DOT}l\N{LATIN SMALL LETTER U WITH ACUTE}rgica = 1;   our $metal\N{MIDDLE DOT}l\N{LATIN SMALL LETTER U WITH ACUTE}rgica = 2;

And even though blead says there are no IDC code points that are \W chars\, perl's parser seems to have its own ideas.

Ok fine\, so it doesn't hang the parser anymore like it did in 5.12\, but this is certainly wrong​:  
  % blead metal   Unrecognized character \xC2; marked by \<-- HERE after our $metal\<-- HERE near column 11 at metal line 9.

There is no \x{C2} there at all​:

  our $meta\x{140}l\x{FA}rgica = 1;   our $metal\x{B7}l\x{FA}rgica = 2;

Only when considered as raw octets is there a C2 sitting around​:

  our $meta\xC5\x80l\xC3\xBArgica = 1;   our $metal\xC2\xB7l\xC3\xBArgica = 2;

Here's another\, even uglier problem​:

  use utf8;   our $über_metaŀlúrgica = 5;   our $über_metal·lúrgica = 6;

Buys me​:

  Unrecognized character \xC2; marked by \<-- HERE after ̈ber_metal\<-- HERE near column 18 at metal line 13.

which is

  Unrecognized character \xC2; marked by \<-- HERE after \x{CC}\x{88}ber_metal\<-- HERE near column 18 at metal line 13.

That's just completely evil and wrong​:

  (1) It has generated an illegible message using code points that   simply do not appear in my code at *any* point.

  * \x{88} is CHARACTER TABULATION SET\, which is a non-printing   control character. We should never emit raw non-printing   control characters like this.

  * That Ì it's blathering on about is code point \x{CC}\, which   is no more in my code than is \x{C2}; see next.

  (2) What C2 code point? There is no C2 code point. It's a B7   code point. That's because it's the first of these​:

  our $u\x{308}ber_metal\x{B7}lu\x{301}rgica = 6;

  Not the second​:

  our $u\xCC\x88ber_metal\xC2\xB7lu\xCC\x81rgica = 6;

  (3) That B7 code point is quite plainly at column 16\, *not* at   column 18; I think that makes it a calumny. :) Notice​:   |   # 1 | 2 3 4   #1234567890123456789012345678901234567890   our $über_metal·lúrgica = 6;   #1234567890123456789012345678901234567890   # 1 | 2 3 4   |   | \<-- that's column 16

  Perl has forgotten how to count.

  That's because this time I wrote it in NFD not NFC​:

  our $u\N{COMBINING DIAERESIS}ber_metal\N{MIDDLE DOT}lu\N{COMBINING ACUTE ACCENT}rgica = 6;

  Which should make no difference whatsoever. Instead\, it generates   really really bogus messages.

If it's going to bother talking about columns\, it certainly has to do so *AT LEAST* from the point of view of graphemes\, not of mere code-point counts. That's actually not enough​: there are other concerns that just graphemes\, so it really has to use the sort of thing that you get from Unicode​::GCString​::columns().

But I'll save that for another day. :)

--tom

p5pRT commented 13 years ago

From @cpansprout

On Mar 27\, 2011\, at 2​:38 PM\, Tom Christiansen wrote​:

For those who are not familiar with the issue\, Perl’s identifiers\, for historical reasons\, use a set of characters that differs slightly from Unicode’s identifiers. Perl allows any character *not* in an identifier to be a string delimiter\, as in q·foo· (which particular example is a single identifier in Unicode).

I *have* been surprised by that one before. It's "the Catalan problem".

(Actually what I typed was not a middle dot\, but an ἄνω τελεία\, which is a punctuation mark. Unicode considers them canonically equivalent which causes all sorts of headaches when programs normalise things without asking\, especially when the two glyphs look different in some fonts. But that’s unrelated to perl.)

This is legal​:

use utf8; my $metaŀlúrgica = 1;

but this is not​:

use utf8; my $metal·lúrgica = 2;

The second is the NFKD of the first.

I know NFKD is not necessarily NFD\, but I kinda don't like that one NF is legal and another is not.

It's even more annoying because supposedly the creation of "ŀ" was an error in the first place\, and that one should always (try to) write "l·" instead. But we aren't allowed to.

As you demonstrate\, the current situation is a mess.

I think this is something that Jesse will have to decide. It’s a choice between switching to Unicode identifier definitions (which Perl did not use at first\, because back then they weren’t very good) or preserving backward compatibility. At least now we know the backward-compatibility issue is not theoretical.

That means that every addition to the set of identifier characters subtracts from the set of delimiter characters\, and vice versa. So every change breaks existing syntax.

I hadn't thought of it that way before. Ug.

In 5.12\, these 15 IDC codepoints were \W​:

1  ·   00B7  GC=Po  MIDDLE DOT
2  ·   0387  GC=Po  GREEK ANO TELEIA
3  ፩   1369  GC=No  ETHIOPIC DIGIT ONE
4  ፪   136A  GC=No  ETHIOPIC DIGIT TWO
5  ፫   136B  GC=No  ETHIOPIC DIGIT THREE
6  ፬   136C  GC=No  ETHIOPIC DIGIT FOUR
7  ፭   136D  GC=No  ETHIOPIC DIGIT FIVE
8  ፮   136E  GC=No  ETHIOPIC DIGIT SIX
9  ፯   136F  GC=No  ETHIOPIC DIGIT SEVEN

10 ፰ 1370 GC=No ETHIOPIC DIGIT EIGHT 11 ፱ 1371 GC=No ETHIOPIC DIGIT NINE 12 ℘ 2118 GC=So SCRIPT CAPITAL P 13 ℮ 212E GC=So ESTIMATED SYMBOL 14 ゛ 309B GC=Sk KATAKANA-HIRAGANA VOICED SOUND MARK 15 ゜ 309C GC=Sk KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK

And in blead\, there *are* no such code points -- which seems like is the way it ought to be.

But wait!

Even in blead\, this appears to be (gotten) wrong​:

use utf8; our $metaŀlúrgica = 1; our $metal·lúrgica = 2;

So apparently I *can't* trust the apparent empty intersection of \W and \p{IDC}. Socould anyone please tell me which \w characters are not legal in Perl identifiers?

All of \w is legal in Perl idenfiers. The middle dot is not in \w.

And why? :( Because this matters to people doing all sort of different things\, including laundering.

Those\, for the record\, were​:

our $meta\N{LATIN SMALL LETTER L WITH MIDDLE DOT}l\N{LATIN SMALL LETTER U WITH ACUTE}rgica = 1; our $metal\N{MIDDLE DOT}l\N{LATIN SMALL LETTER U WITH ACUTE}rgica = 2;

And even though blead says there are no IDC code points that are \W chars\, perl's parser seems to have its own ideas.

Ok fine\, so it doesn't hang the parser anymore like it did in 5.12\, but this is certainly wrong​:

% blead metal Unrecognized character \xC2; marked by \<-- HERE after our $metal\<-- HERE near column 11 at metal line 9.

That\, and the rest of your message\, demonstrate a separate problem—one with error-reporting.

There is no \x{C2} there at all​:

our $meta\x{140}l\x{FA}rgica = 1; our $metal\x{B7}l\x{FA}rgica = 2;

Only when considered as raw octets is there a C2 sitting around​:

our $meta\xC5\x80l\xC3\xBArgica = 1; our $metal\xC2\xB7l\xC3\xBArgica = 2;

Here's another\, even uglier problem​:

use utf8;
our $über\_metaŀlúrgica  = 5;
our $über\_metal·lúrgica = 6;

Buys me​:

Unrecognized character \xC2; marked by \<-- HERE after ̈ber_metal\<-- HERE near column 18 at metal line 13.

which is

Unrecognized character \xC2; marked by \<-- HERE after \x{CC}\x{88}ber_metal\<-- HERE near column 18 at metal line 13.

That's just completely evil and wrong​:

(1) It has generated an illegible message using code points that simply do not appear in my code at *any* point.

   \* \\x\{88\} is CHARACTER TABULATION SET\, which is a non\-printing
     control character\.  We should never emit raw non\-printing
     control characters like this\.

   \* That Ì it's blathering on about is code point \\x\{CC\}\, which 
     is no more in my code than is \\x\{C2\}; see next\.

(2) What C2 code point? There is no C2 code point. It's a B7 code point. That's because it's the first of these​:

        our $u\\x\{308\}ber\_metal\\x\{B7\}lu\\x\{301\}rgica = 6;

   Not the second&#8203;:

        our $u\\xCC\\x88ber\_metal\\xC2\\xB7lu\\xCC\\x81rgica = 6;
(3) That B7 code point is quite plainly at column 16\, *not* at column 18; I think that makes it a calumny. :) Notice​: # 1 2 3 4 #1234567890123456789012345678901234567890 our $über_metal·lúrgica = 6; #1234567890123456789012345678901234567890 # 1 2 3 4
\<-- that's column 16
   Perl has forgotten how to count\.

   That's because this time I wrote it in NFD not NFC&#8203;:

        our $u\\N\{COMBINING DIAERESIS\}ber\_metal\\N\{MIDDLE DOT\}lu\\N\{COMBINING ACUTE ACCENT\}rgica = 6;

   Which should make no difference whatsoever\.  Instead\, it generates 
   really really bogus messages\.

If it's going to bother talking about columns\, it certainly has to do so *AT LEAST* from the point of view of graphemes\, not of mere code-point counts. That's actually not enough​: there are other concerns that just graphemes\, so it really has to use the sort of thing that you get from Unicode​::GCString​::columns().

But I'll save that for another day. :)

--tom

p5pRT commented 13 years ago

From tchrist@perl.com

All of \w is legal in Perl idenfiers. The middle dot is not in \w.

I appear to have gotten confused by somehow thinking that \w holds identifier characters. But it doesn't. Yes\, I know tr18's RL1.2a definition for \w​: there's no talk of IDS/IDC stuff there at all.

  % blead -E 'say "\x{B7}" =~ /\p{IDC}/ || 0'   1

  % blead -E 'say "\x{B7}" =~ /\w/ || 0'   0

Sorry 'bout that!

--tom

p5pRT commented 13 years ago

From @obra

khw confirms that the module author will be updating his module to work with 5.14. Consequently\, this module no longer blocks the release of 5.14

p5pRT commented 13 years ago

From @obra

On Sun 27.Mar'11 at 13​:15​:08 -0700\, Father Chrysostomos via RT wrote​:

On Sun Mar 27 11​:34​:13 2011\, public@​khwilliamson.com wrote​:

I tracked this down\, and I don't know what to do about it. The problem is in the file Syllable.pm. Three lines there are​:

33​: ລ => 'l'\, 34​: ຼ => 'l'\, 35​: ຫ => 'h'\,

(This is in the middle of initializing a hash.) The problem is on line 34\, the first non-blank character is U+0EBC​: LAO SEMIVOWEL SIGN LO. The rules for initiating a hash AFAIK state that the fat comma operator quotes the lhs\, but only if that lhs is an identifier\, otherwise it must be explicitly quoted. U+0EBC does not begin an identifier\, and so legally must be quoted. In previous versions of Perl\, the checking was not done properly\, and so this was allowed through. There are a number of other characters in the hash that have the same problem; this is just the first. If I quote the character\, it parses.

My tendency is to not go back but to move towards the Unicode standard\, as we are now catching problematic constructs that we weren't. I think blead need not change\, but the CPAN module should change to quote those characters properly\, and we add something in the perldelta that we now find illegal constructs that previously we didn't.

But others may disagree.

That was why I raised some mild concern via private e-mail about syntax drift when you proposed switching to (X)IDStart.

Now that concern is not so mild\, as my seemingly unfounded fears have proven real.

While it would be nice to follow Unicode for identifiers\, I think it’s too late. Since we are dealing with Perl’s syntax here\, we will be causing subtle breakage whenever we change anything.

For those who are not familiar with the issue\, Perl’s identifiers\, for historical reasons\, use a set of characters that differs slightly from Unicode’s identifiers. Perl allows any character *not* in an identifier to be a string delimiter\, as in q·foo· (which particular example is a single identifier in Unicode). That means that every addition to the set of identifier characters subtracts from the set of delimiter characters\, and vice versa. So every change breaks existing syntax.

How plausible is it (at a technical level) to make this behavior user-configurable? Going forward\, I'd love it if we could match the Unicode definition\, but I'd at least like us to be able to provide an out for legacy code.

p5pRT commented 13 years ago

From @khwilliamson

On 04/04/2011 03​:22 AM\, Jesse Vincent wrote​:

On Sun 27.Mar'11 at 13​:15​:08 -0700\, Father Chrysostomos via RT wrote​:

On Sun Mar 27 11​:34​:13 2011\, public@​khwilliamson.com wrote​:

I tracked this down\, and I don't know what to do about it. The problem is in the file Syllable.pm. Three lines there are​:

33​: ລ => 'l'\, 34​: ຼ => 'l'\, 35​: ຫ => 'h'\,

(This is in the middle of initializing a hash.) The problem is on line 34\, the first non-blank character is U+0EBC​: LAO SEMIVOWEL SIGN LO. The rules for initiating a hash AFAIK state that the fat comma operator quotes the lhs\, but only if that lhs is an identifier\, otherwise it must be explicitly quoted. U+0EBC does not begin an identifier\, and so legally must be quoted. In previous versions of Perl\, the checking was not done properly\, and so this was allowed through. There are a number of other characters in the hash that have the same problem; this is just the first. If I quote the character\, it parses.

My tendency is to not go back but to move towards the Unicode standard\, as we are now catching problematic constructs that we weren't. I think blead need not change\, but the CPAN module should change to quote those characters properly\, and we add something in the perldelta that we now find illegal constructs that previously we didn't.

But others may disagree.

That was why I raised some mild concern via private e-mail about syntax drift when you proposed switching to (X)IDStart.

Now that concern is not so mild\, as my seemingly unfounded fears have proven real.

While it would be nice to follow Unicode for identifiers\, I think it’s too late. Since we are dealing with Perl’s syntax here\, we will be causing subtle breakage whenever we change anything.

For those who are not familiar with the issue\, Perl’s identifiers\, for historical reasons\, use a set of characters that differs slightly from Unicode’s identifiers. Perl allows any character *not* in an identifier to be a string delimiter\, as in q·foo· (which particular example is a single identifier in Unicode). That means that every addition to the set of identifier characters subtracts from the set of delimiter characters\, and vice versa. So every change breaks existing syntax.

How plausible is it (at a technical level) to make this behavior user-configurable? Going forward\, I'd love it if we could match the Unicode definition\, but I'd at least like us to be able to provide an out for legacy code.

Do you mean for 5.14? I would have to think about it. Now it is a #define constant.

p5pRT commented 13 years ago

From @obra

On Mon\, Apr 04\, 2011 at 08​:20​:16AM -0600\, Karl Williamson wrote​:

How plausible is it (at a technical level) to make this behavior user-configurable? Going forward\, I'd love it if we could match the Unicode definition\, but I'd at least like us to be able to provide an out for legacy code.

Do you mean for 5.14? I would have to think about it. Now it is a #define constant.

For 5.14\, I think we're stuck with what we've got. But sprout's concern seems valid and I want to see what our options for as we continue along this path in the future. --

p5pRT commented 12 years ago

From @cpansprout

Release 0.09 fixed this.

p5pRT commented 12 years ago

@cpansprout - Status changed from 'open' to 'resolved'