Perl / perl5

🐪 The Perl programming language
https://dev.perl.org/perl5/
Other
1.9k stars 542 forks source link

Bogus error message "Malformed UTF-8 character" when using a non-word Unicode character #9862

Closed p5pRT closed 12 years ago

p5pRT commented 15 years ago

Migrated from rt.perl.org#69032 (status was 'resolved')

Searchable as RT69032$

p5pRT commented 15 years ago

From @moritz

Created by moritz@faui2k3.org

As pointed out on \<http​://www.perlmonks.org/?node_id=793800>\, a program which tries to use a non-ASCII non-alphanumeric character in a variable name throws an error "Malformed UTF-8 character (unexpected end of string) at"\, even though the file is in perfectly fine UTF-8.

Example​: $ cat foo.pl use utf8; my $» = 1; $ perl foo.pl Malformed UTF-8 character (unexpected end of string) at foo.pl line 2. Unrecognized character \xBB in column 5 at foo.pl line 2.

A better error message might be "Character '%s' not allowed in identifier"\, or something like that.

Perl Info ``` Flags: category=core severity=low Site configuration information for perl 5.10.0: Configured by Debian Project at Fri Aug 28 22:23:22 UTC 2009. Summary of my perl5 (revision 5 version 10 subversion 0) configuration: Platform: osname=linux, osvers=2.6.30.5-dsa-amd64, archname=x86_64-linux-gnu-thread-multi uname='linux brahms 2.6.30.5-dsa-amd64 #1 smp mon aug 17 02:18:43 cest 2009 x86_64 gnulinux ' config_args='-Dusethreads -Duselargefiles -Dccflags=-DDEBIAN -Dcccdlflags=-fPIC -Darchname=x86_64-linux-gnu -Dprefix=/usr -Dprivlib=/usr/share/perl/5.10 -Darchlib=/usr/lib/perl/5.10 -Dvendorprefix=/usr -Dvendorlib=/usr/share/perl5 -Dvendorarch=/usr/lib/perl5 -Dsiteprefix=/usr/local -Dsitelib=/usr/local/share/perl/5.10.0 -Dsitearch=/usr/local/lib/perl/5.10.0 -Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3 -Dsiteman1dir=/usr/local/man/man1 -Dsiteman3dir=/usr/local/man/man3 -Dman1ext=1 -Dman3ext=3perl -Dpager=/usr/bin/sensible-pager -Uafs -Ud_csh -Ud_ualarm -Uusesfio -Uusenm -DDEBUGGING=-g -Doptimize=-O2 -Duseshrplib -Dlibperl=libperl.so.5.10.0 -Dd_dosuid -des' hint=recommended, useposix=true, d_sigaction=define useithreads=define, usemultiplicity=define useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef use64bitint=define, use64bitall=define, uselongdouble=undef usemymalloc=n, bincompat5005=undef Compiler: cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fno-strict-aliasing -pipe -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64', optimize='-O2 -g', cppflags='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fno-strict-aliasing -pipe -I/usr/local/include' ccversion='', gccversion='4.3.2', gccosandvers='' intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16 ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 alignbytes=8, prototype=define Linker and Libraries: ld='cc', ldflags =' -L/usr/local/lib' libpth=/usr/local/lib /lib /usr/lib /lib64 /usr/lib64 libs=-lgdbm -lgdbm_compat -ldb -ldl -lm -lpthread -lc -lcrypt perllibs=-ldl -lm -lpthread -lc -lcrypt libc=/lib/libc-2.7.so, so=so, useshrplib=true, libperl=libperl.so.5.10.0 gnulibc_version='2.7' Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E' cccdlflags='-fPIC', lddlflags='-shared -O2 -g -L/usr/local/lib' Locally applied patches: @INC for perl 5.10.0: /home/moritz/cpan/lib /etc/perl /usr/local/lib/perl/5.10.0 /usr/local/share/perl/5.10.0 /usr/lib/perl5 /usr/share/perl5 /usr/lib/perl/5.10 /usr/share/perl/5.10 /usr/local/lib/site_perl . Environment for perl 5.10.0: HOME=/home/moritz LANG=en_US.UTF-8 LANGUAGE=C LC_CTYPE=de_DE.UTF-8 LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=/bin:/sbin:/usr/bin:/usr/sbin:/home/moritz/bin:/usr/games:/usr/local/Eiffel54/studio/spec/linux-glibc2.1/bin:/usr/bin/X11:/usr/local/bin:/usr/local/Wolfram/Mathematica/5.0/Executables/:/mnt/ex/moritz/matlab/bin PERL5LIB=/home/moritz/cpan/lib PERL6LIB=/home/moritz/src/svg-plot/lib:/home/moritz/src/svg/lib PERL_BADLANG (unset) SHELL=/bin/bash ```
p5pRT commented 15 years ago

From john.imrie@vodafoneemail.co.uk

Moritz Lenz (via RT) wrote​:

# New Ticket Created by Moritz Lenz # Please include the string​: [perl #69032] # in the subject line of all future correspondence about this issue. # \<URL​: http​://rt.perl.org/rt3/Ticket/Display.html?id=69032 >

This is a bug report for perl from moritz@​faui2k3.org\, generated with the help of perlbug 1.36 running under perl 5.10.0.

----------------------------------------------------------------- [Please enter your report here]

As pointed out on \<http​://www.perlmonks.org/?node_id=793800>\, a program which tries to use a non-ASCII non-alphanumeric character in a variable name throws an error "Malformed UTF-8 character (unexpected end of string) at"\, even though the file is in perfectly fine UTF-8.

Example​: $ cat foo.pl use utf8; my $» = 1; $ perl foo.pl Malformed UTF-8 character (unexpected end of string) at foo.pl line 2. Unrecognized character \xBB in column 5 at foo.pl line 2.

Humm. Do we actually allow *any* unicode codepoint in an identifier or only those matching \p{ID_Start}\p{ID_Continue}* ?

______________________________________________
This email has been scanned by Netintelligence
http​://www.netintelligence.com/email

p5pRT commented 15 years ago

The RT System itself - Status changed from 'new' to 'open'

p5pRT commented 15 years ago

From @ikegami

On Sun\, Sep 6\, 2009 at 2​:35 PM\, Moritz Lenz \perlbug\-followup@&#8203;perl\.orgwrote​:

# New Ticket Created by Moritz Lenz # Please include the string​: [perl #69032] # in the subject line of all future correspondence about this issue. # \<URL​: http​://rt.perl.org/rt3/Ticket/Display.html?id=69032 >

This is a bug report for perl from moritz@​faui2k3.org\, generated with the help of perlbug 1.36 running under perl 5.10.0.

----------------------------------------------------------------- [Please enter your report here]

As pointed out on \<http​://www.perlmonks.org/?node_id=793800>\, a program which tries to use a non-ASCII non-alphanumeric character in a variable name throws an error "Malformed UTF-8 character (unexpected end of string) at"\, even though the file is in perfectly fine UTF-8.

Example​: $ cat foo.pl use utf8; my $» = 1; $ perl foo.pl Malformed UTF-8 character (unexpected end of string) at foo.pl line 2. Unrecognized character \xBB in column 5 at foo.pl line 2.

A better error message might be "Character '%s' not allowed in identifier"\, or something like that.

There are two problems.

If the character that follows "$" is not normally allowed as part of an identifier\, it is taken to be the entire name of a special package var (like $_\, $$\, $[\, etc). The first problem is that the tokeniser treats the *byte* following "$" as the name of the special variable\, leaving a partial UTF-8 character for the tokeniser to find. That accounts for the first error message.

The second problem is the poor error message "Unrecognized character %s". Even "Unknown operator %s" would be more helpful. It would be even more useful to assume an 8+ bit character following an identifier was meant to be part of the identifier\, in which case the message should be as Moritz suggested.

p5pRT commented 15 years ago

From @ikegami

On Mon\, Sep 7\, 2009 at 4​:53 PM\, Eric Brine \ikegami@&#8203;adaelis\.com wrote​:

On Sun\, Sep 6\, 2009 at 2​:35 PM\, Moritz Lenz \perlbug\-followup@&#8203;perl\.orgwrote​:

# New Ticket Created by Moritz Lenz # Please include the string​: [perl #69032] # in the subject line of all future correspondence about this issue. # \<URL​: http​://rt.perl.org/rt3/Ticket/Display.html?id=69032 >

As pointed out on \<http​://www.perlmonks.org/?node_id=793800>\, a program

which tries to use a non-ASCII non-alphanumeric character in a variable name throws an error "Malformed UTF-8 character (unexpected end of string) at"\, even though the file is in perfectly fine UTF-8.

A better error message might be "Character '%s' not allowed in

identifier"\, or something like that.

There are two problems.

If the character that follows "$" is not normally allowed as part of an identifier\, it is taken to be the entire name of a special package var (like $_\, $$\, $[\, etc). The first problem is that the tokeniser treats the *byte* following "$" as the name of the special variable\, leaving a partial UTF-8 character for the tokeniser to find. That accounts for the first error message.

The second problem is the poor error message "Unrecognized character %s". Even "Unknown operator %s" would be more helpful. It would be even more useful to assume an 8+ bit character following an identifier was meant to be part of the identifier\, in which case the message should be as Moritz suggested.

perl -CO -E"say qq{use utf8; my \$i\x{2660};}" | perl Unrecognized character \xE2 in column 16 at - line 1.

Moritz suggests "Character \x2660 not allowed in identifier in column 16 at - line 1."

perl -CO -E"say qq{use utf8; 0+\x{2660};}" | perl Unrecognized character \xE2 in column 13 at - line 1.

The identified character is wrong (E2 instead of 2660). Otherwise\, this is consistent with "use utf8" absent.

perl -CO -E"say qq{use utf8; my \$\x{2660};}" | perl Malformed UTF-8 character (unexpected end of string) at - line 1. Unrecognized character \x99 in column 15 at - line 1.

If consistent with "use utf8" absent\, it would be "Can't use global $♠ in "my" at - line 1\, near "my $♠""

perl -CO -E"say qq{use utf8; \$\x{2660};}" | perl Malformed UTF-8 character (unexpected end of string) at - line 1. Unrecognized character \x99 in column 12 at - line 1.

If consistent with "use utf8" absent\, this would not be an error at all.

perl -CO -E"say qq{use utf8; my \$\x{2660}i;}" | perl Malformed UTF-8 character (unexpected end of string) at - line 1. Unrecognized character \x99 in column 15 at - line 1.

If consistent with "use utf8" absent\, it would be "Bareword found where operator expected at - line 1\, near "$♠i" (Missing operator before i?)"

- Eric "ikegami" Brine

p5pRT commented 13 years ago

From perl-diddler@tlinx.org

Created by perl-diddler@tlinx.org

I was trying to use the utf-8 character U+2424\, called 'SYMBOL FOR NEWLINE' as an identifier containing "\n" as follows​:

0 #!/usr/bin/perl -w 1 use strict; 2 use utf8; 3 use Readonly my $␤ => "\n"; 4 print "line1$␤line2\n";

The unicode character (which may not display correctly\, in whatever viewer you are using\, so 'beware') is on lines 3 and 4. On line 3\, it's right after the "$" and before the "\=>". On line 4\, it's also after the "$" between "line1" and "line2".

A Hexdump of the above​: 00000000 23 21 2f 75 73 72 2f 62 69 6e 2f 70 65 72 6c 20 |#!/usr/bin/perl | 00000010 2d 77 0a 75 73 65 20 73 74 72 69 63 74 3b 0a 75 |-w.use strict;.u| 00000020 73 65 20 75 74 66 38 3b 0a 75 73 65 20 52 65 61 |se utf8;.use Rea| 00000030 64 6f 6e 6c 79 20 6d 79 20 24 e2 90 a4 09 3d 3e |donly my $....=>| 00000040 20 22 5c 6e 22 3b 0a 70 72 69 6e 74 20 22 6c 69 | "\n";.print "li| 00000050 6e 65 31 24 e2 90 a4 6c 69 6e 65 32 5c 6e 22 3b |ne1$...line2\n";| 00000060 0a 0a |..| 00000062

Shows U+2424 correctly encoded as "0xe290a4" at hexaddrs 03A and 0x54.

However\, when I try to run this\, I get​: Malformed UTF-8 character (unexpected end of string) at /tmp/ptest.pl line 4. Unrecognized character \x90 in column 18 at /tmp/ptest.pl line 4.

Note\, FWIW\, I've successfully used other characters in the same way in another program like this (a fragment from another prog)​:

---- use utf8; binmode STDOUT\, 'encoding(UTF-8)'; use Readonly; BEGIN{*RO=\&Readonly​::Readonly}

my %constants = (   'Phi' => .5*(5.**.5-1)\,   'Φ' => .5*(5.**.5-1)\,   'phi' => .5*(1.+5.**.5)\,   'ɸ' => .5*(1.+5.**.5)\,   'pi' => 4*atan2(1\,1)\,   'π' => 4*atan2(1\,1)\, );

sub init_constants (;$) {   my $no_banner=$_[0];   print "Constants​: " unless $no_banner;   my $sep="";

  foreach my $k (keys %constants){   my $v=$constants{$k};   print $sep\,"\$"\,$k unless $no_banner;   $sep="\, ";   RO $$k => $v;   }

  print "\n" unless $no_banner; } &init_constants; ----

So I'm suprised at this specific failure\, since in looking at the hex\, the character encoding appears correct.

Let me know if you have any questions. (an excellent utf-8 character util\, that\, unfortunately\, is windows only\, can be gotten from http​://www.babelstone.co.uk/Software/BabelMap.html).

Perl Info ``` Flags: category=core severity=medium This perlbug was built using Perl 5.10.0 - Fri Jul 30 00:12:10 UTC 2010 It is being executed now by Perl 5.10.0 - Thu Sep 16 16:14:28 UTC 2010. Site configuration information for perl 5.10.0: Configured by abuild at Thu Sep 16 16:14:28 UTC 2010. Summary of my perl5 (revision 5 version 10 subversion 0) configuration: Platform: osname=linux, osvers=2.6.31, archname=x86_64-linux-thread-multi uname='linux build35 2.6.31 #1 smp 2010-01-06 16:07:25 +0100 x86_64 x86_64 x86_64 gnulinux ' config_args='-ds -e -Dprefix=/usr -Dvendorprefix=/usr -Dinstallusrbinperl -Dusethreads -Di_db -Di_dbm -Di_ndbm -Di_gdbm -Duseshrplib=true -DEBUGGING=both -Doptimize=-fmessage-length=0 -O2 -Wall -D_FORTIFY_SOURCE=2 -fstack-protector -funwind-tables -fasynchronous-unwind-tables -Wall -pipe -Accflags=-DPERL_USE_SAFE_PUTENV' hint=recommended, useposix=true, d_sigaction=define useithreads=define, usemultiplicity=define useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef use64bitint=define, use64bitall=define, uselongdouble=undef usemymalloc=n, bincompat5005=undef Compiler: cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DPERL_USE_SAFE_PUTENV -DDEBUGGING -fno-strict-aliasing -pipe -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64', optimize='-fmessage-length=0 -O2 -Wall -D_FORTIFY_SOURCE=2 -fstack-protector -funwind-tables -fasynchronous-unwind-tables -Wall -pipe -g', cppflags='-D_REENTRANT -D_GNU_SOURCE -DPERL_USE_SAFE_PUTENV -DDEBUGGING -fno-strict-aliasing -pipe' ccversion='', gccversion='4.4.1 [gcc-4_4-branch revision 150839]', gccosandvers='' intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16 ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 alignbytes=8, prototype=define Linker and Libraries: ld='cc', ldflags =' -L/usr/local/lib64' libpth=/lib64 /usr/lib64 /usr/local/lib64 libs=-lm -ldl -lcrypt -lpthread perllibs=-lm -ldl -lcrypt -lpthread libc=/lib64/libc-2.10.1.so, so=so, useshrplib=true, libperl=libperl.so gnulibc_version='2.10.1' Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E -Wl,-rpath,/usr/lib/perl5/5.10.0/x86_64-linux-thread-multi/CORE' cccdlflags='-fPIC', lddlflags='-shared -L/usr/local/lib64' Locally applied patches: @INC for perl 5.10.0: /usr/local/lib/perl/5.8 /usr/lib/perl5/5.10.0/x86_64-linux-thread-multi /usr/lib/perl5/5.10.0 /usr/lib/perl5/site_perl/5.10.0/x86_64-linux-thread-multi /usr/lib/perl5/site_perl/5.10.0 /usr/lib/perl5/vendor_perl/5.10.0/x86_64-linux-thread-multi /usr/lib/perl5/vendor_perl/5.10.0 /usr/lib/perl5/vendor_perl . Environment for perl 5.10.0: HOME=/home/law LANG=en_US.UTF-8 LANGUAGE (unset) LC_CTYPE=en_US.UTF-8 LD_LIBRARY_PATH=/usr/lib64/mpi/gcc/openmpi/lib64 LOGDIR (unset) PATH=.:/sbin:/usr/local/sbin:/usr/lib64/mpi/gcc/openmpi/bin:/home/law/bin:/usr/local/bin:/usr/bin:/bin:/usr/X11R6/bin:/usr/games:/opt/kde3/bin:/usr/lib/mit/bin:/usr/lib/mit/sbin:/usr/lib/qt3/bin:/usr/sbin PERL5LIB=/usr/local/lib/perl/5.8 PERL_BADLANG (unset) SHELL=/bin/bash ```
p5pRT commented 13 years ago

From tchrist@perl.com

Linda Walsh (via RT) \perlbug\-followup@&#8203;perl\.org wrote   on Sun\, 17 Apr 2011 01​:22​:25 PDT​:

I was trying to use the utf-8 character U+2424\, called 'SYMBOL FOR NEWLINE' as an identifier containing "\n" as follows​:

0 #!/usr/bin/perl -w 1 use strict; 2 use utf8; 3 use Readonly my $␤ => "\n"; 4 print "line1$␤line2\n";

That isn't allowed. U+2424 isn't an ID_Start character (IDS) nor even an ID_Continue character. In fact\, it's not a \w but a \p{Symbol}\, which is not legal in an identifier.

  % perl -lE 'say "\x{2424}" =~ /\p{IDS}/ || 0'   0

At http​://training.perl.com/scripts/uniprops\, you can get a tool that may help for this; it's now updated for 5.14.

  % uniprops -a 2424   U+2424 ‹␤› \N{SYMBOL FOR NEWLINE}   \pS \p{So}   All Any Assigned InControlPictures Common Zyyy Control_Pictures So S Gr_Base Grapheme_Base Graph GrBase Other_Symbol   Pat_Syn Pattern_Syntax PatSyn Print Symbol   Age=1.1 Bidi_Class=ON Bidi_Class=Other_Neutral BC=ON Block=Control_Pictures Canonical_Combining_Class=0   Canonical_Combining_Class=Not_Reordered CCC=NR Canonical_Combining_Class=NR Script=Common Decomposition_Type=None   DT=None East_Asian_Width=Neutral Grapheme_Cluster_Break=Other GCB=XX Grapheme_Cluster_Break=XX   Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group JG=NoJoiningGroup   Joining_Type=Non_Joining JT=U Joining_Type=U Line_Break=AL Line_Break=Alphabetic LB=AL Numeric_Type=None NT=None   Numeric_Value=NaN NV=NaN Present_In=1.1 IN=1.1 Present_In=2.0 IN=2.0 Present_In=2.1 IN=2.1 Present_In=3.0 IN=3.0   Present_In=3.1 IN=3.1 Present_In=3.2 IN=3.2 Present_In=4.0 IN=4.0 Present_In=4.1 IN=4.1 Present_In=5.0 IN=5.0   Present_In=5.1 IN=5.1 Present_In=5.2 IN=5.2 SC=Zyyy Script=Zyyy Sentence_Break=Other SB=XX Sentence_Break=XX   Word_Break=Other WB=XX Word_Break=XX _X_Begin

The unicode character (which may not display correctly\, in whatever viewer you are using\, so 'beware') is on lines 3 and 4. On line 3\, it's right after the "$" and before the "\=>". On line 4\, it's also after the "$" between "line1" and "line2".

A Hexdump of the above​: 00000000 23 21 2f 75 73 72 2f 62 69 6e 2f 70 65 72 6c 20 |#!/usr/bin/perl | 00000010 2d 77 0a 75 73 65 20 73 74 72 69 63 74 3b 0a 75 |-w.use strict;.u| 00000020 73 65 20 75 74 66 38 3b 0a 75 73 65 20 52 65 61 |se utf8;.use Rea| 00000030 64 6f 6e 6c 79 20 6d 79 20 24 e2 90 a4 09 3d 3e |donly my $....=>| 00000040 20 22 5c 6e 22 3b 0a 70 72 69 6e 74 20 22 6c 69 | "\n";.print "li| 00000050 6e 65 31 24 e2 90 a4 6c 69 6e 65 32 5c 6e 22 3b |ne1$...line2\n";| 00000060 0a 0a |..| 00000062

Shows U+2424 correctly encoded as "0xe290a4" at hexaddrs 03A and 0x54.

Eek! Hexdumps! Non-logical characters! The horror!

At http​://training.perl.com/scripts/uniquote\, you can get a tool that will help with this\, with the first form being the best. Isn't that much easier to read??

  % uniquote -v /tmp/lw   #!/usr/bin/perl -w   use strict;   use utf8;   use Readonly my $\N{SYMBOL FOR NEWLINE} => "\n";   print "line1$\N{SYMBOL FOR NEWLINE}line2\n";

  % uniquote -x /tmp/lw   #!/usr/bin/perl -w   use strict;   use utf8;   use Readonly my $\x{2424} => "\n";   print "line1$\x{2424}line2\n";

  % uniquote -b /tmp/lw   #!/usr/bin/perl -w   use strict;   use utf8;   use Readonly my $\xE2\x90\xA4 => "\n";   print "line1$\xE2\x90\xA4line2\n";

However\, when I try to run this\, I get​: Malformed UTF-8 character (unexpected end of string) at /tmp/ptest.pl line 4. Unrecognized character \x90 in column 18 at /tmp/ptest.pl line 4.

The bug\, and there is a bug\, is that it should be reporting that U+2424 is not a valid identifier character. It should not be grinching about \x90. This is a known problem\, although I don't know its bugno.

Note\, FWIW\, I've successfully used other characters in the same way in another program like this (a fragment from another prog)​:

---- use utf8; binmode STDOUT\, 'encoding(UTF-8)'; use Readonly; BEGIN{*RO=\&Readonly​::Readonly}

my %constants = ( 'Phi' => .5*(5.**.5-1)\, 'Φ' => .5*(5.**.5-1)\, 'phi' => .5*(1.+5.**.5)\, 'ɸ' => .5*(1.+5.**.5)\, 'pi' => 4*atan2(1\,1)\, 'π' => 4*atan2(1\,1)\, );

sub init_constants (;$) { my $no_banner=$_[0]; print "Constants​: " unless $no_banner; my $sep="";

foreach my $k (keys %constants){ my $v=$constants{$k}; print $sep\,"\$"\,$k unless $no_banner; $sep="\, "; RO $$k => $v; }

print "\n" unless $no_banner; } &init_constants;

First of all\, those are both identifier (IDS) characters​:

  % uniprops pi phi   U+03C0 ‹π› \N{GREEK SMALL LETTER PI}   \w \pL \p{LC} \p{L_} \p{L&} \p{Ll}   All Any Alnum Alpha Alphabetic Assigned Greek Is_Greek InGreek Cased Cased_Letter LC Changes_When_Casemapped CWCM   Changes_When_Titlecased CWT Changes_When_Uppercased CWU Ll L Gr_Base Grapheme_Base Graph GrBase Grek Greek_And_Coptic ID_Continue   IDC ID_Start IDS Letter L_ Lowercase_Letter Lower Lowercase Print Word XID_Continue XIDC XID_Start XIDS   U+0278 ‹ɸ› \N{LATIN SMALL LETTER PHI}   \w \pL \p{LC} \p{L_} \p{L&} \p{Ll}   All Any Alnum Alpha Alphabetic Assigned InIPA_Extensions Cased Cased_Letter LC Ll L Gr_Base Grapheme_Base Graph GrBase ID_Continue

Secondly\, you're using $$k. That's a symbolic dereference. That means you can get at the symbol table without the persnickety lexer kvetching all over your lunch​:

  % perl -E '$a = `cat /bin/cat`; $$a = length($a); say $$a'   43296

I use this all the time.

  my $file = "/tmp/foo";   open($file\, "\<"\, $file);   while (\<$file>) {   if ( ... ) {   warn "crudola";   }   }

so that I get proper filenames in my warn/die messages.

That doesn't change that U+2424 isn't an identifier character. You can still use it as a variable name\, provided you use symbolic dereferences to get at it​:

  % perl -E '$name = "\x{2424}"; $$name = `whoami`; print $$name'   tchrist

  % perl -E '$name = "\x{2424}"; say $name'   ␤

So I'm suprised at this specific failure\, since in looking at the hex\, the character encoding appears correct.

Don't look at hex. Look at code points\, with uniquote -x or -v.

Let me know if you have any questions. an excellent utf-8 character util\, that\, unfortunately\, is windows only\, can be gotten from http​://www.babelstone.co.uk/Software/BabelMap.html).

I have a lot of to-me-excellent Unicode tools in http​://training.perl.com/scripts/​:

  leo nfd rename unichars uniquote
  macroman nfkc tcgrep uninames uwc
  nfc nfkd ucsort uniprops

No guarantees\, though. :)

--tom

p5pRT commented 13 years ago

The RT System itself - Status changed from 'new' to 'open'

p5pRT commented 13 years ago

From @cpansprout

Tom Christiansen wrote​:

That isn't allowed. U+2424 isn't an ID_Start character (IDS) nor even an ID_Continue character. In fact\, it's not a \w but a \p{Symbol}\, which is not legal in an identifier.

But what about punctuation variables?

In a Latin-1 script\, one can write $£. It doesn’t work in a utf8 script.

In a UTF-8 terminal​:

$ perl -e 'use utf8; print q\$£\' | perl $ perl -CO -e 'use utf8; print q\$£\' | perl -Mutf8 Malformed UTF-8 character (unexpected end of string) at - line 1. Unrecognized character \xA3 in column 2 at - line 1.

So yes\, this is a bug.  

p5pRT commented 13 years ago

From tchrist@perl.com

But what about punctuation variables?

Oh blech. Yes\, I've known of this "hole".
I chose not to complain about it. :)

In a Latin-1 script\, one can write $£. It doesn’t work in a utf8 script.

In a UTF-8 terminal​:

Um\, why should that matter?

$ perl -e 'use utf8; print q\$£\' | perl $ perl -CO -e 'use utf8; print q\$£\' | perl -Mutf8 Malformed UTF-8 character (unexpected end of string) at - line 1. Unrecognized character \xA3 in column 2 at - line 1.

So yes\, this is a bug.

Are you saying that just because Perl allows one-character ASCII (and Latin-1) punctuation characters\, it should allow any single code point variable no matter what it is?

Are you sure?

Or are you just saying that Latin-1 should be grandfathered\, since a

Anyway\, that backslash as a delimiter for q// is simply wicked. This is a much clearer demo​:

  Given that​:  
  A DOLLAR SIGN is code point U+0024.   A POUND SIGN is code point U+00A3.

  % perl -C0 -E 'say "\x{24}\x{A3} = 1"' | uniquote -b   $\xA3 = 1

  % perl -C0 -E 'say "\x{24}\x{A3} = 1"' | perl -C0 -c   - syntax OK

but

  % perl -CS -E 'say "\x{24}\x{A3} = 1"' | uniquote -v   $\N{POUND SIGN} = 1

  % perl -CS -E 'say "\x{24}\x{A3} = 1"' | uniquote -x   $\x{A3} = 1

  % perl -CS -E 'say "\x{24}\x{A3} = 1"' | uniquote -b   $\xC2\xA3 = 1

So then​:

  % perl -CS -E 'say "\x{24}\x{A3} = 1"' | perl -C0 -Mutf8 | & uniquote -b   Malformed UTF-8 character (unexpected end of string) at - line 1.   Unrecognized character \xA3; marked by \<-- HERE after $\xC2\<-- HERE near column 2 at - line 1.   Exit 255

  % perl -CS -E 'say "\x{24}\x{A3} = 1"' | perl -CS -Mutf8 | & uniquote -b   Malformed UTF-8 character (unexpected end of string) at - line 1.   Unrecognized character \xA3; marked by \<-- HERE after $\xC2\<-- HERE near column 2 at - line 1.   Exit 255

Notice that we are generating illegal UTF-8. That's wrong. But at least that bug is fixed in blead\, kinda​:

  % perl -CS -E 'say "\x{24}\x{A3} = 1"' | perl -CS -Mutf8 | & uniquote -x   Malformed UTF-8 character (unexpected end of string) at - line 1.   /home/tchrist/scripts/uniquote​: utf8 "\xC2" does not map to Unicode at standard input line 2   Exit 1

That was bad.

  % perl -CS -E 'say "\x{24}\x{A3} = 1"' | blead -CS -Mutf8 | & uniquote -x   Malformed UTF-8 character (unexpected end of string) at - line 1.   Unrecognized character \xA3; marked by \<-- HERE after $\x{C2}\<-- HERE near column 2 at - line 1.   Exit 255

  % perl -CS -E 'say "\x{24}\x{A3} = 1"' | blead -C0 -Mutf8 | & uniquote -b   Malformed UTF-8 character (unexpected end of string) at - line 1.   Unrecognized character \xA3; marked by \<-- HERE after $\xC2\<-- HERE near column 2 at - line 1.   Exit 255

But it's fixed so as not to generate illegal UTF-8 anymore when the std streams are in that encoding​:

  % perl -CS -E 'say "\x{24}\x{A3} = 1"' | blead -CS -Mutf8 | & uniquote -b   Malformed UTF-8 character (unexpected end of string) at - line 1.   Unrecognized character \xA3; marked by \<-- HERE after $\xC3\x82\<-- HERE near column 2 at - line 1.   Exit 255

  % perl -CS -E 'say "\x{24}\x{A3} = 1"' | blead -CS -Mutf8 | & uniquote -x   Malformed UTF-8 character (unexpected end of string) at - line 1.   Unrecognized character \xA3; marked by \<-- HERE after $\x{C2}\<-- HERE near column 2 at - line 1.   Exit 255

  % perl -CS -E 'say "\x{24}\x{A3} = 1"' | blead -CS -Mutf8 | & uniquote -v   Malformed UTF-8 character (unexpected end of string) at - line 1.   Unrecognized character \xA3; marked by \<-- HERE after $\N{LATIN CAPITAL LETTER A WITH CIRCUMFLEX}\<-- HERE near column 2 at - line 1.   Exit 255

So it is a *slight* improvement\, eh? :)

--tom

p5pRT commented 13 years ago

From @cpansprout

On Apr 17\, 2011\, at 6​:41 PM\, Tom Christiansen wrote​:

But what about punctuation variables?

Oh blech. Yes\, I've known of this "hole".
I chose not to complain about it. :)

In a Latin-1 script\, one can write $£. It doesn’t work in a utf8 script.

In a UTF-8 terminal​:

Um\, why should that matter?

Just so you know what I’m feeding to perl.

$ perl -e 'use utf8; print q\$£\' | perl $ perl -CO -e 'use utf8; print q\$£\' | perl -Mutf8 Malformed UTF-8 character (unexpected end of string) at - line 1. Unrecognized character \xA3 in column 2 at - line 1.

So yes\, this is a bug.

Are you saying that just because Perl allows one-character ASCII (and Latin-1) punctuation characters\, it should allow any single code point variable no matter what it is?

Yes.

Are you sure?

No\, but it makes sense to me that way. Non-\w vars should also be forced into main. Er\, maybe this is not such a good idea\, because of the whole IDS vs XIDS vs alphanumeric mess. :-)

Or are you just saying that Latin-1 should be grandfathered\, since a

Maybe they should\, but I don’t

Anyway\, that backslash as a delimiter for q// is simply wicked.

:-)

It stands out\, doesn’t it?

This is a much clearer demo​:

Given that​:

   A DOLLAR SIGN is code point U\+0024\.
   A  POUND SIGN is code point U\+00A3\.  

% perl -C0 -E 'say "\x{24}\x{A3} = 1"' | uniquote -b $\xA3 = 1

% perl -C0 -E 'say "\x{24}\x{A3} = 1"' | perl -C0 -c - syntax OK

but

% perl -CS -E 'say "\x{24}\x{A3} = 1"' | uniquote -v $\N{POUND SIGN} = 1

% perl -CS -E 'say "\x{24}\x{A3} = 1"' | uniquote -x $\x{A3} = 1

% perl -CS -E 'say "\x{24}\x{A3} = 1"' | uniquote -b $\xC2\xA3 = 1

So then​:

% perl -CS -E 'say "\x{24}\x{A3} = 1"' | perl -C0 -Mutf8 | & uniquote -b Malformed UTF-8 character (unexpected end of string) at - line 1. Unrecognized character \xA3; marked by \<-- HERE after $\xC2\<-- HERE near column 2 at - line 1. Exit 255

% perl -CS -E 'say "\x{24}\x{A3} = 1"' | perl -CS -Mutf8 | & uniquote -b Malformed UTF-8 character (unexpected end of string) at - line 1. Unrecognized character \xA3; marked by \<-- HERE after $\xC2\<-- HERE near column 2 at - line 1. Exit 255

Notice that we are generating illegal UTF-8. That's wrong. But at least that bug is fixed in blead\, kinda​:

% perl -CS -E 'say "\x{24}\x{A3} = 1"' | perl -CS -Mutf8 | & uniquote -x Malformed UTF-8 character (unexpected end of string) at - line 1. /home/tchrist/scripts/uniquote​: utf8 "\xC2" does not map to Unicode at standard input line 2 Exit 1

That was bad.

% perl -CS -E 'say "\x{24}\x{A3} = 1"' | blead -CS -Mutf8 | & uniquote -x Malformed UTF-8 character (unexpected end of string) at - line 1. Unrecognized character \xA3; marked by \<-- HERE after $\x{C2}\<-- HERE near column 2 at - line 1. Exit 255

% perl -CS -E 'say "\x{24}\x{A3} = 1"' | blead -C0 -Mutf8 | & uniquote -b Malformed UTF-8 character (unexpected end of string) at - line 1. Unrecognized character \xA3; marked by \<-- HERE after $\xC2\<-- HERE near column 2 at - line 1. Exit 255

But it's fixed so as not to generate illegal UTF-8 anymore when the std streams are in that encoding​:

% perl -CS -E 'say "\x{24}\x{A3} = 1"' | blead -CS -Mutf8 | & uniquote -b Malformed UTF-8 character (unexpected end of string) at - line 1. Unrecognized character \xA3; marked by \<-- HERE after $\xC3\x82\<-- HERE near column 2 at - line 1. Exit 255

% perl -CS -E 'say "\x{24}\x{A3} = 1"' | blead -CS -Mutf8 | & uniquote -x Malformed UTF-8 character (unexpected end of string) at - line 1. Unrecognized character \xA3; marked by \<-- HERE after $\x{C2}\<-- HERE near column 2 at - line 1. Exit 255

% perl -CS -E 'say "\x{24}\x{A3} = 1"' | blead -CS -Mutf8 | & uniquote -v Malformed UTF-8 character (unexpected end of string) at - line 1. Unrecognized character \xA3; marked by \<-- HERE after $\N{LATIN CAPITAL LETTER A WITH CIRCUMFLEX}\<-- HERE near column 2 at - line 1. Exit 255

So it is a *slight* improvement\, eh? :)

--tom

p5pRT commented 13 years ago

From lawalsh@tlinx.org

tchrist1 via RT wrote​:

Linda Walsh (via RT) \perlbug\-followup@&#8203;perl\.org wrote on Sun\, 17 Apr 2011 01​:22​:25 PDT​:

I was trying to use the utf-8 character U+2424\, called 'SYMBOL FOR NEWLINE' as an identifier containing "\n" as follows​:

0 #!/usr/bin/perl -w 1 use strict; 2 use utf8; 3 use Readonly my $␤ => "\n"; 4 print "line1$␤line2\n";

That isn't allowed. U+2424 isn't an ID_Start character (IDS) nor even an ID_Continue character. In fact\, it's not a \w but a \p{Symbol}\, which is not legal in an identifier.


  I already thought of that and rejected it as irrelevant.

  Your reasoning doesn't jive with the error message.

  If it wasn't allowed as an identifier it would say 'invalid identifier. That's not what this is. It's a UTF-8 parsing error "Malformed UTF-8 character". It's not a malformed UTF-8 character. That's what the bug is reporting.

  As for whether or not a symbol can be in a variable name\, the variables\, perlvar lists a whole bunch of variable names that are $\\, like​: $' $+ $. $/ $| $\ (do I have to put in the whole list?). They are *syntactically* valid as variable names. From a practical standpoint\, other than a few(1?) like $£\, there aren't any LATIN1 symbols that aren't reserved. But as Father Chrysostomos mentions\, it is a valid variable in LATIN1 - but fails in UTF-8​: not because it is an invalid variable name\, but because the parser thinks it is invalid UTF-8\, which it is not.

  BTW\, later\, you write​:

Secondly\, you're using $$k. That's a symbolic dereference. That means you can get at the symbol table without the persnickety lexer kvetching all over your lunch....


  That was just the way that program was written. It's not essential for any of those characters. (You might try testing actual behavior before commenting about why it is allowed (or wouldn't be) in an actual variable name...


perl -e ' use utf8; my $π="pi as id"; print $π . "\n"; ' pi as id


  The lexer is happy with '$π' and the others ones (phi and capital phi). Using '$$' was just a clean way for me to pre-install them as symbols usable in a calculator\, i.e. I can type $pi or $π and it will give value for pi.

  Actually\, I want to type just 'π'\, but that's currently broken due to a bug in 'use constant'...(not my day for UTF-8)...

I have a lot of to-me-excellent Unicode tools in ...


  I'm sure!

  :-)

p5pRT commented 13 years ago

From tchrist@perl.com

Secondly\, you're using $$k. That's a symbolic dereference. That means you can get at the symbol table without the persnickety lexer kvetching all over your lunch....

That was just the way that program was written. It's not essential for any of those characters. (You might try testing actual behavior before commenting about why it is allowed (or wouldn't be) in an actual variable name...

I beg your pardon\, but I most certainly have "tested actual behavior". Whyever would you think I hadn't? I happen to use UTF-8 identifiers all the time. I also knew about the issue with non-IDS/IDC chars not giving good error messages.

perl -e ' use utf8; my $π="pi as id"; print $π . "\n"; ' pi as id

The lexer is happy with '$π' and the others ones (phi and capital phi).

--tom

p5pRT commented 13 years ago

From perl-diddler@tlinx.org

Father Chrysostomos via RT wrote​:

So yes\, this is a bug.


  Is this a separate bug or another instance of the same bug?

I tried​:


#!/usr/bin/perl -w use strict; use Readonly; sub RO(\[$@​%]@​) {goto &Readonly}; use utf8;

RO my $Tclear﹠home => `tput clear`; # U+FE60 Small Ampersand RO my $Tclear&home2 => `tput clear`; # U+FF06 FullWidth Ampersand


Neither work\, and both fail with​:

Unrecognized character \xEF in column 14 at ./amp.pl line [6|7].

p5pRT commented 13 years ago

From tchrist@perl.com

Linda Walsh \perl\-diddler@&#8203;tlinx\.org wrote   on Fri\, 22 Apr 2011 08​:54​:52 PDT​:

RO my $Tclear﹠home => `tput clear`; # U+FE60 Small Ampersand RO my $Tclear&home2 => `tput clear`; # U+FF06 FullWidth Ampersand

----- Neither work\, and both fail with​:

Unrecognized character \xEF in column 14 at ./amp.pl line [6|7].

The bug is not that they fail; they *should* fail. The only bug is that the error message is in bytes not characters.

via `uniquote`​:

RO my $Tclear\N{U+FE60}home => `tput clear`; # U+FE60 Small Ampersand RO my $Tclear\N{U+FF06}home2 => `tput clear`; # U+FF06 FullWidth Ampersand

or via `uniquote -v`​:

RO my $Tclear\N{SMALL AMPERSAND}home => `tput clear`; # U+FE60 Small Ampersand RO my $Tclear\N{FULLWIDTH AMPERSAND}home2 => `tput clear`; # U+FF06 FullWidth Amp

  % uniprops fe60 ff06   U+FE60 ‹﹠› \N{SMALL AMPERSAND}   \pP \p{Po}   All Any Assigned InSmallFormVariants Changes_When_NFKC_Casefolded CWKCF Common Zyyy Po   P Gr_Base Grapheme_Base Graph GrBase Other_Punctuation Punct Print Punctuation   Small_Form_Variants   U+FF06 ‹&› \N{FULLWIDTH AMPERSAND}   \pP \p{Po}   All Any Assigned InHalfwidthAndFullwidthForms Changes_When_NFKC_Casefolded CWKCF   Common Zyyy Po P Gr_Base Grapheme_Base Graph GrBase Halfwidth_And_Fullwidth_Forms   Other_Punctuation Punct Print Punctuation

As you see\, those are not IDC code points\, so do not belong in an identifier. This is made clear here​:

  % perl -E 'say "\x{fe60}" =~ /\p{IDC}/ || 0'   0   % perl -E 'say "\x{ff06}" =~ /\p{IDC}/ || 0'   0

So what bug are you thinking this is?

It is not a bug that those are illegal characters. They are.

It is *only* a bug saying that the character is \xEF instead of saying it is \x{FE60} or \x{FF06}.

--tom

p5pRT commented 13 years ago

From perl-diddler@tlinx.org

tchrist1 via RT wrote​:

Linda Walsh \perl\-diddler@&#8203;tlinx\.org wrote on Fri\, 22 Apr 2011 08​:54​:52 PDT​:

RO my $Tclear﹠home => `tput clear`; # U+FE60 Small Ampersand RO my $Tclear&home2 => `tput clear`; # U+FF06 FullWidth Ampersand

----- Neither work\, and both fail with​:

Unrecognized character \xEF in column 14 at ./amp.pl line [6|7].

The bug is not that they fail; they *should* fail. The only bug is that the error message is in bytes not characters.

I really think you are getting hung up on the props for the characters.

I don't see them as being useful to enforce in the context we are using them.

When someone goes and looks at unicode characters\, all the 'props' are NOT listed. They are detailed arcana that will confuse most users.

I know you may not like that answer\, since it seems to be something that you think is really important\, but I don't thing the vast majority of users will see it that way -- they will just wonder why a perfectly valid value doesn't work.

Example -- I want to use "​:" in song titles -- so I use the FULL Width "​:" -- imagine if MS enforced your props and threw it out for no good reason. "​:" is banned because it is used/needed.

The other symbols I am mentioning are ones that I'm using in place of ones that perl has claimed for its own operator set. It's bad precedent and **non-perl**\, to reserve a bunch of things needlessly.

The basic design philosophy of perl is "Do what I mean"(perlsyn)\, not "do the right thing". To adhere to doing the 'right thing' over doing the 'useful thing' that would be what 99% of the users would expect and want it to do would be a harmful design decision with no apparent benefit.

p5pRT commented 12 years ago

From @cpansprout

On Sun Sep 06 11​:35​:30 2009\, moritz wrote​:

This is a bug report for perl from moritz@​faui2k3.org\, generated with the help of perlbug 1.36 running under perl 5.10.0.

----------------------------------------------------------------- [Please enter your report here]

As pointed out on \<http​://www.perlmonks.org/?node_id=793800>\, a program which tries to use a non-ASCII non-alphanumeric character in a variable name throws an error "Malformed UTF-8 character (unexpected end of string) at"\, even though the file is in perfectly fine UTF-8.

Example​: $ cat foo.pl use utf8; my $� = 1; $ perl foo.pl Malformed UTF-8 character (unexpected end of string) at foo.pl line 2. Unrecognized character \xBB in column 5 at foo.pl line 2.

Unicode punctuation variables work now\, as of dfb182850 and the preceding commits.

But another issue that came up later in the ticket\, that $♠♠ produces ‘Unrecognized character \xE2’ instead of mentioning the Unicode code point\, is still not fixed.

p5pRT commented 12 years ago

From @Hugmeir

On Thu\, Oct 6\, 2011 at 6​:50 PM\, Father Chrysostomos via RT \< perlbug-followup@​perl.org> wrote​:

On Sun Sep 06 11​:35​:30 2009\, moritz wrote​:

This is a bug report for perl from moritz@​faui2k3.org\, generated with the help of perlbug 1.36 running under perl 5.10.0.

----------------------------------------------------------------- [Please enter your report here]

As pointed out on \<http​://www.perlmonks.org/?node_id=793800>\, a program which tries to use a non-ASCII non-alphanumeric character in a variable name throws an error "Malformed UTF-8 character (unexpected end of string) at"\, even though the file is in perfectly fine UTF-8.

Example​: $ cat foo.pl use utf8; my $� = 1; $ perl foo.pl Malformed UTF-8 character (unexpected end of string) at foo.pl line 2. Unrecognized character \xBB in column 5 at foo.pl line 2.

Unicode punctuation variables work now\, as of dfb182850 and the preceding commits.

Basically\, as of right now in blead\, variables of length one match (?&sigil) \p{Any} (?=\z) instead of (?&sigil) \C (?=\z). Do we really want \p{Any}\, which allows a whole bunch of problematic characters? If not\, what do we restrict it to?

But another issue that came up later in the ticket\, that $♠♠ produces ‘Unrecognized character \xE2’ instead of mentioning the Unicode code point\, is still not fixed.

This is fixed in the other gsoc branch thingy\, so maybe in a couple of whiles it will. Hopefully!

Incidentally\, Father C\, mad props for cleaning up the gv/stash stuff!

p5pRT commented 12 years ago

From @cpansprout

On Thu Oct 06 19​:52​:11 2011\, Hugmeir wrote​:

On Thu\, Oct 6\, 2011 at 6​:50 PM\, Father Chrysostomos via RT \< perlbug-followup@​perl.org> wrote​:

On Sun Sep 06 11​:35​:30 2009\, moritz wrote​:

This is a bug report for perl from moritz@​faui2k3.org\, generated with the help of perlbug 1.36 running under perl 5.10.0.

----------------------------------------------------------------- [Please enter your report here]

As pointed out on \<http​://www.perlmonks.org/?node_id=793800>\, a program which tries to use a non-ASCII non-alphanumeric character in a variable name throws an error "Malformed UTF-8 character (unexpected end of string) at"\, even though the file is in perfectly fine UTF-8.

Example​: $ cat foo.pl use utf8; my $� = 1; $ perl foo.pl Malformed UTF-8 character (unexpected end of string) at foo.pl line 2. Unrecognized character \xBB in column 5 at foo.pl line 2.

Unicode punctuation variables work now\, as of dfb182850 and the preceding commits.

Basically\, as of right now in blead\, variables of length one match (?&sigil) \p{Any} (?=\z) instead of (?&sigil) \C (?=\z). Do we really want \p{Any}\, which allows a whole bunch of problematic characters? If not\, what do we restrict it to?

\S or whatever Unicode equivalent Tom Christiansen says is more appropriate.

I probably pushed the changes too soon\, but I didn’t discover this till afterwards.

Also\, my $♠ is now permitted\, which is a bug.

And $  (that’s a non-breaking space\, but Firefox is untrustworthy)\, too.

But another issue that came up later in the ticket\, that $♠♠ produces ‘Unrecognized character \xE2’ instead of mentioning the Unicode code point\, is still not fixed.

This is fixed in the other gsoc branch thingy\, so maybe in a couple of whiles it will. Hopefully!

Incidentally\, Father C\, mad props for cleaning up the gv/stash stuff!

I still need to write a summary explaining why some parts were modified or omitted.

p5pRT commented 12 years ago

From @cpansprout

On Thu Oct 06 19​:52​:11 2011\, Hugmeir wrote​:

On Thu\, Oct 6\, 2011 at 6​:50 PM\, Father Chrysostomos via RT \< perlbug-followup@​perl.org> wrote​:

On Sun Sep 06 11​:35​:30 2009\, moritz wrote​:

This is a bug report for perl from moritz@​faui2k3.org\, generated with the help of perlbug 1.36 running under perl 5.10.0.

----------------------------------------------------------------- [Please enter your report here]

As pointed out on \<http​://www.perlmonks.org/?node_id=793800>\, a program which tries to use a non-ASCII non-alphanumeric character in a variable name throws an error "Malformed UTF-8 character (unexpected end of string) at"\, even though the file is in perfectly fine UTF-8.

Example​: $ cat foo.pl use utf8; my $� = 1; $ perl foo.pl Malformed UTF-8 character (unexpected end of string) at foo.pl line 2. Unrecognized character \xBB in column 5 at foo.pl line 2.

Unicode punctuation variables work now\, as of dfb182850 and the preceding commits.

Basically\, as of right now in blead\, variables of length one match (?&sigil) \p{Any} (?=\z) instead of (?&sigil) \C (?=\z). Do we really want \p{Any}\, which allows a whole bunch of problematic characters? If not\, what do we restrict it to?

But another issue that came up later in the ticket\, that $♠♠ produces ‘Unrecognized character \xE2’ instead of mentioning the Unicode code point\, is still not fixed.

This is fixed in the other gsoc branch thingy\, so maybe in a couple of whiles it will. Hopefully!

OK\, where do I start? (I actually want to finish reimplementing $[ first\, so it may be a while.)

Incidentally\, Father C\, mad props for cleaning up the gv/stash stuff!

p5pRT commented 12 years ago

From @cpansprout

On Thu Oct 06 20​:39​:42 2011\, sprout wrote​:

Also\, my $♠ is now permitted\, which is a bug.

I’ve made a separate ticket for that\, #111980.

And $  (that’s a non-breaking space\, but Firefox is untrustworthy)\, too.

When we deal with Unicode brackets\, we can deal with Unicode whitespace\, too. See my note at \https://rt-archive.perl.org/perl5/Ticket/Display.html?id=89032#txn-1097256.

But another issue that came up later in the ticket\, that $♠♠ produces ‘Unrecognized character \xE2’ instead of mentioning the Unicode code point\, is still not fixed.

This is fixed in the other gsoc branch thingy\, so maybe in a couple of whiles it will. Hopefully!

It was integrated recently. See \https://rt-archive.perl.org/perl5/Ticket/Display.html?id=107008#txn-1099872. I’m not sure which patch did it\, but this bug is now fixed.

--

Father Chrysostomos

p5pRT commented 12 years ago

@cpansprout - Status changed from 'open' to 'resolved'

p5pRT commented 12 years ago

From @nwc10

On Thu Mar 29 00​:14​:09 2012\, sprout wrote​:

On Thu Oct 06 20​:39​:42 2011\, sprout wrote​:

But another issue that came up later in the ticket\, that $♠♠ produces ‘Unrecognized character \xE2’ instead of mentioning the Unicode code point\, is still not fixed.

This is fixed in the other gsoc branch thingy\, so maybe in a couple of whiles it will. Hopefully!

It was integrated recently. See \https://rt-archive.perl.org/perl5/Ticket/Display.html?id=107008#txn-1099872. I’m not sure which patch did it\, but this bug is now fixed.

While I was doing something else\, I set off a bisect run. The answer is​:

HEAD is now at 734ab32 toke.c​: S_no_op cleanup good - non-zero exit from ./perl -Ilib -e eval "my \$\x{2660}\x{2660}"; die $@​ unless $@​ =~ /Unrecognized character \\x\{2660\}/ e2f06df0a8c96f7d9a5f3214fc5bf2daf34588c3 is the first bad commit commit e2f06df0a8c96f7d9a5f3214fc5bf2daf34588c3 Author​: Brian Fraser \fraserbn@&#8203;gmail\.com Date​: Sat Aug 6 07​:55​:06 2011 +0100

  toke.c​: 'Unrecognized character' croak cleanup.

:040000 040000 cab624cfbcf5d9693603b516d54d74126e2db1e6 2ac92c7f6ba76f070c2a2f2680c9a3fa909a2104 M t :100644 100644 3a3cddb7606c1ccee8fe60d376dedec91459d7c2 c0a5cdaf09292fd3ed2e484b9526e7f087371080 M toke.c bisect run success That took 1528 seconds

Nicholas Clark