Perl / perl5

šŸŖ The Perl programming language
https://dev.perl.org/perl5/
Other
1.9k stars 540 forks source link

\p{Letter} not matching unicode input when followed by $ #9445

Closed p5pRT closed 16 years ago

p5pRT commented 16 years ago

Migrated from rt.perl.org#57800 (status was 'resolved')

Searchable as RT57800$

p5pRT commented 16 years ago

From mark@blackmans.org

Created by mark@blackmans.org

/^[\p{Letter}]+$/ doesn't match a cedilla C (utf8)

perl -ne 'chomp; if (/^[\p{Letter}]+$/) { print "letter-->"\,$_\,"\n"; }'

and with a utf8 terminal\, enter a cedilla C. U+0037

if you drop the final end-of-string match\, the match succeeds with a sincle cedilla C.

- Mark

Perl Info ``` Flags: category=core severity=low Site configuration information for perl v5.8.8: Configured by MBlackman at Fri Dec 1 11:27:50 GMT 2006. Summary of my perl5 (revision 5 version 8 subversion 8) configuration: Platform: osname=darwin, osvers=8.8.1, archname=darwin-2level uname='darwin markimac.local 8.8.1 darwin kernel version 8.8.1: mon sep 25 19:42:00 pdt 2006; root:xnu-792.13.8.obj~1release_i386 i386 i386 ' config_args='-des -Dprefix=/opt/local -Dccflags=-I'/opt/local/ include' -Dldflags=-L/opt/local/lib -Dvendorprefix=/opt/local -Dcc=/ usr/bin/gcc-4.0' hint=recommended, useposix=true, d_sigaction=define usethreads=undef use5005threads=undef useithreads=undef usemultiplicity=undef useperlio=define d_sfio=undef uselargefiles=define usesocks=undef use64bitint=undef use64bitall=undef uselongdouble=undef usemymalloc=n, bincompat5005=undef Compiler: cc='/usr/bin/gcc-4.0', ccflags ='-I/opt/local/include -fno- common -DPERL_DARWIN -no-cpp-precomp -fno-strict-aliasing -pipe - Wdeclaration-after-statement -I/opt/local/include', optimize='-O3', cppflags='-no-cpp-precomp -I/opt/local/include -fno-common - DPERL_DARWIN -no-cpp-precomp -fno-strict-aliasing -pipe -Wdeclaration- after-statement -I/opt/local/include' ccversion='', gccversion='4.0.1 (Apple Computer, Inc. build 5363)', gccosandvers='' intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16 ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 alignbytes=8, prototype=define Linker and Libraries: ld='env MACOSX_DEPLOYMENT_TARGET=10.3 cc', ldflags ='-L/opt/ local/lib' libpth=/opt/local/lib /usr/lib libs=-ldbm -ldl -lm -lc perllibs=-ldl -lm -lc libc=/usr/lib/libc.dylib, so=dylib, useshrplib=false, libperl=libperl.a gnulibc_version='' Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=bundle, d_dlsymun=undef, ccdlflags=' ' cccdlflags=' ', lddlflags='-L/opt/local/lib -bundle -undefined dynamic_lookup' Locally applied patches: @INC for perl v5.8.8: /opt/local/lib/perl5/5.8.8/darwin-2level /opt/local/lib/perl5/5.8.8 /opt/local/lib/perl5/site_perl/5.8.8/darwin-2level /opt/local/lib/perl5/site_perl/5.8.8 /opt/local/lib/perl5/site_perl /opt/local/lib/perl5/vendor_perl/5.8.8/darwin-2level /opt/local/lib/perl5/vendor_perl/5.8.8 /opt/local/lib/perl5/vendor_perl . Environment for perl v5.8.8: DYLD_LIBRARY_PATH (unset) HOME=/Volumes/cs/MBlackman LANG (unset) LANGUAGE (unset) LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=/opt/local/apache2/bin:/opt/local/bin:/bin:/sbin:/usr/bin:/ usr/sbin:/usr/local/gwTeX/bin/i386-apple-darwin-current PERL_BADLANG (unset) SHELL=/bin/zsh ```
p5pRT commented 16 years ago

From zefram@fysh.org

Mark Blackman wrote​:

/^[\p{Letter}]+$/ doesn't match a cedilla C (utf8)

It matches a C-cedilla ("\xc7"). It does not match the UTF-8 encoding of C-cedilla ("\xc3\x87")\, because "\x87" is a control character\, not a letter.

if you drop the final end-of-string match\, the match succeeds with a sincle cedilla C.

The first octet of the UTF-8 encoding\, "\xc3"\, is a letter\, A-tilde.

To demonstrate what's going on​:

$ perl -MData​::Dumper -pe '$Data​::Dumper​::Useqq=1; chomp; $_= Dumper($_)'

If you type in a UTF-8-encoded C-cedilla\, which is what your terminal evidently does\, you'll get the output "\303\207". If you type in a Latin-1-encoded C-cedilla\, you'll get the output "\307". perl is seeing the octets of your input\, and doesn't know that you're using UTF-8.

To make perl see the UTF-8-encoded characters\, rather than the octets\, add the "-CI" option to the command line. This turns on UTF-8 decoding of stdin. There are similar options to apply a UTF-8 encoding layer to other I/O streams. Try​:

$ perl -CI -MData​::Dumper -pe '$Data​::Dumper​::Useqq=1; chomp; $_= Dumper($_)'

If you type in a UTF-8-encoded C-cedilla\, this time it'll output "\x{c7}"\, indicating that it's interpreted it as one character rather than two octets. If you type in a Latin-1-encoded C-cedilla\, you'll get a fatal error\, because it's not valid UTF-8.

-zefram

p5pRT commented 16 years ago

The RT System itself - Status changed from 'new' to 'open'

p5pRT commented 16 years ago

From mark@exonetric.com

Many thanks. I was under the impression that utf-8 encoding
interpretation was on by default in 5.8.5.

- Mark

On 11 Aug 2008\, at 23​:48\, Zefram wrote​:

Mark Blackman wrote​:

/^[\p{Letter}]+$/ doesn't match a cedilla C (utf8)

It matches a C-cedilla ("\xc7"). It does not match the UTF-8 encoding of C-cedilla ("\xc3\x87")\, because "\x87" is a control character\, not a letter.

if you drop the final end-of-string match\, the match succeeds with a sincle cedilla C.

The first octet of the UTF-8 encoding\, "\xc3"\, is a letter\, A-tilde.

To demonstrate what's going on​:

$ perl -MData​::Dumper -pe '$Data​::Dumper​::Useqq=1; chomp; $_=
Dumper($_)'

If you type in a UTF-8-encoded C-cedilla\, which is what your terminal evidently does\, you'll get the output "\303\207". If you type in a Latin-1-encoded C-cedilla\, you'll get the output "\307". perl is
seeing the octets of your input\, and doesn't know that you're using UTF-8.

To make perl see the UTF-8-encoded characters\, rather than the octets\, add the "-CI" option to the command line. This turns on UTF-8
decoding of stdin. There are similar options to apply a UTF-8 encoding layer
to other I/O streams. Try​:

$ perl -CI -MData​::Dumper -pe '$Data​::Dumper​::Useqq=1; chomp; $_=
Dumper($_)'

If you type in a UTF-8-encoded C-cedilla\, this time it'll output "\x{c7}"\, indicating that it's interpreted it as one character rather than two octets. If you type in a Latin-1-encoded C-cedilla\, you'll get a fatal error\, because it's not valid UTF-8.

-zefram

p5pRT commented 16 years ago

From tchrist@perl.com

In-Reply-To​: Message from Mark Blackman (via RT)   \perlbug\-followup@​perl\.org of "Mon\, 11 Aug 2008 08​:13​:58 PDT."   \rt\-3\.6\.HEAD\-29759\-1218467638\-1300\.57800\-75\-0@​perl\.org

/^[\p{Letter}]+$/ doesn't match a cedilla C (utf8)

perl -ne 'chomp; if (/^[\p{Letter}]+$/) { print "letter-->"\,$_\,"\n"; }'

and with a utf8 terminal\, enter a cedilla C. U+0037

if you drop the final end-of-string match\, the match succeeds with a [single] cedilla C.

That's a mildy odd pattern. I wonder why you're embracketing a single property like that? This aren't icky POSIX character classes; you don't need the brackets.

It all depends on how you enter things. The precomposed Unicode character LATIN CAPITAL LETTER C WITH CEDILLA is at code point 0xC7\, while the LATIN SMALL LETTER C WITH CEDILLA is at 0xE7. I guess if you flip your 3 the other way it looks like an E\, but ....

The problem is surely how you're entering data (code point such and such says nothing about the physical representation of the logical number)\, something you did *not* specify\, which makes it harder to know for sure. But your envariables don't look culpable.

You really should still consult the perlrun manpage​:

  "-C" on its own (not followed by any number or option   list)\, or the empty string "" for the "PERL_UNICODE"   environment variable\, has the same effect as "-CSDL".   In other words\, the standard I/O handles and the   default "open()" layer are UTF-8-fied but only if the   locale environment variables indicate a UTF-8 locale.   This behaviour follows the implicit (and problematic)   UTF-8 behaviour of Perl 5.8.0.

So at the initial 8.0 release of perl5\, we went through a time when things were um\, a little too quick to jump at your envariables\, and many a train-wreck ensued. You *probably* don't want that\, any more than you want the L flag\, which is documented to

  L 64 normally the "IOEioA" are unconditional\, the L makes   them conditional on the locale environment variables   (the LC_ALL\, LC_TYPE\, and LANG\, in the order of decreasing   precedence) -- if the variables indicate UTF-8\, then the   selected "IOEioA" are in effect

For your education\, edification\, and indeed\, even amusement\, I offer up the following program that explicitly sets things (read​: streams; I/O layers) in a Unicodey way--something I can't see that you did--and only then goes about sniffing around to decide whether a string is "letterishlike".

See\, even then\, you're still going to need to be just a *wee* bit more snerpickety in your inspection. But fear not\, for the key to getting this final part right is provided in a comment by itself right at the very top of the demo program below.

*DO* please enjoy! I sure know *I* did. You may lay the blame to this "artistic" coding whim on my recent (ahem) reading material\, which either you already know (of)--or else\, surely don't care to. (​:->

--tom

#!/bin/sh

# gcb - judge letterishness\, plus count # Graphemes\, Characters\, and Bytes # # Tom Christiansen \tchrist@​perl\.com # Mon Aug 11 23​:09​:44 MDT 2008

  #======================----------->vvvvvvvvvvvvvvvvvvvv\<---#   # The *KEY* to it all is simply =~ /\A(?​:(?=\pL)\X)+\z/ #   #=====================----------->#^^^^^^^^^^^^^^^^^^^^\<---#

############################################################################# # Embedded cryptojest nothwithstanding\, the well-commented\, ASCII-art demo # # program [ it's something of a ship if you turn your monitor sideways or # # run it through my rot90 filter :-] I enclose below can be expected to # # produce the following clear and illuminating Unicode output​: # #############################################################################

# 1​: Gā¼1 Cā¼ 1 Bā¼ 1 M has but \pL in U+004d
# 2​: Gā¼1 Cā¼ 1 Bā¼ 2 Īœ has but \pL in U+039c
# 3​: Gā¼1 Cā¼ 1 Bā¼ 2 Āµ has but \pL in U+00b5
# 4​: Gā¼1 Cā¼ 1 Bā¼ 2 Ī¼ has but \pL in U+03bc
# 5​: Gā¼1 Cā¼ 1 Bā¼ 1 C has but \pL in U+0043
# 6​: Gā¼1 Cā¼ 1 Bā¼ 2 Ƈ has but \pL in U+00c7
# 7​: Gā¼1 Cā¼ 2 Bā¼ 3 CĢ§ has but \pL in U+0043.0327
# 8​: Gā¼1 Cā¼ 3 Bā¼ 5 CĢ§ĢŒ has but \pL in U+0043.0327.030c
# 9​: Gā¼1 Cā¼ 2 Bā¼ 4 ƇĢŒ has but \pL in U+00c7.030c
# 10​: Gā¼1 Cā¼ 3 Bā¼ 5 CĢŒĢ§ has but \pL in U+0043.030c.0327
# 11​: Gā¼1 Cā¼ 2 Bā¼ 5 ā„ÆĢ§ has but \pL in U+212f.0327
# 12​: Gā¼1 Cā¼ 1 Bā¼ 3 ā„Æ has but \pL in U+212f
# 13​: Gā¼1 Cā¼ 1 Bā¼ 3 ā„› has but \pL in U+211b
# 14​: Gā¼1 Cā¼ 2 Bā¼ 6 ā„›āƒ  has but \pL in U+211b.20e0
# 15​: Gā¼1 Cā¼ 1 Bā¼ 2 Ī  has but \pL in U+03a0
# 16​: Gā¼1 Cā¼ 2 Bā¼ 5 Ļˆāƒ— has but \pL in U+03c8.20d7
# 17​: Gā¼1 Cā¼ 1 Bā¼ 1 ? LACKS \pL in U+003f
# 18​: Gā¼1 Cā¼ 1 Bā¼ 2 Ź” has but \pL in U+0294
# 19​: Gā¼1 Cā¼ 2 Bā¼ 4 Ź”Ģ“ has but \pL in U+0294.0334
# 20​: Gā¼1 Cā¼ 1 Bā¼ 2 Āæ LACKS \pL in U+00bf
# 21​: Gā¼2 Cā¼ 2 Bā¼ 5 Ļ€ā„Æ has but \pL in U+03c0.212f
# 22​: Gā¼2 Cā¼ 3 Bā¼ 7 Ī¦ā„ÆĢ„ has but \pL in U+03a6.212f.0304
# 23​: Gā¼2 Cā¼ 2 Bā¼ 3 Āæ? LACKS \pL in U+00bf.003f
# 24​: Gā¼2 Cā¼ 2 Bā¼ 4 Ź•Ź– has but \pL in U+0295.0296
# 25​: Gā¼2 Cā¼ 2 Bā¼ 4 Ź•Ź” has but \pL in U+0295.0294
# 26​: Gā¼2 Cā¼ 2 Bā¼ 5 ā…”ĀŖ LACKS \pL in U+2161.00aa
# 27​: Gā¼3 Cā¼ 3 Bā¼ 4 IIĀŖ has but \pL in U+0049.0049.00aa
# 28​: Gā¼4 Cā¼ 4 Bā¼10 ĪØā„Æā»Ā¹ LACKS \pL in U+03a8.212f.207b.00b9
# 29​: Gā¼4 Cā¼ 4 Bā¼ 5 CĆ³mo has but \pL in U+0043.00f3.006d.006f
# 30​: Gā¼4 Cā¼ 5 Bā¼ 6 CoĢmo has but \pL in U+0043.006f.0301.006d.006f
# 31​: Gā¼6 Cā¼ 7 Bā¼ 9 ĀæCoĢmo? LACKS \pL in U+00bf.0043.006f.0301.006d.006f.003f
# 32​: Gā¼6 Cā¼ 7 Bā¼10 Ź–CoĢmoŹ” has but \pL in U+0296.0043.006f.0301.006d.006f.0294
# 33​: Gā¼6 Cā¼14 Bā¼24 Ź–Ģ²CĢ²oĢ²ĢĢ²mĢ²oĢ²Ź”Ģ² has but \pL in U+0296.0332.0043.0332.006f.0332.0301.0332.006d.0332.006f.0332.0294.0332
# 34​: Gā¼6 Cā¼ 6 Bā¼ 9 wrĒ½Ć¾Ć¾u has but \pL in U+0077.0072.01fd.00fe.00fe.0075
# 35​: Gā¼6 Cā¼ 6 Bā¼ 9 WRĒ¼ĆžĆžU has but \pL in U+0057.0052.01fc.00de.00de.0055
# 36​: Gā¼6 Cā¼ 7 Bā¼11 wrƦĢĆ¾Ć¾u has but \pL in U+0077.0072.00e6.0301.00fe.00fe.0075
# 37​: Gā¼6 Cā¼ 7 Bā¼11 WRƆĢĆžĆžU has but \pL in U+0057.0052.00c6.0301.00de.00de.0055
# 38​: Gā¼7 Cā¼ 7 Bā¼ 8 laȝamon has but \pL in U+006c.0061.021d.0061.006d.006f.006e
# 39​: Gā¼7 Cā¼ 7 Bā¼ 8 LAȜAMON has but \pL in U+004c.0041.021c.0041.004d.004f.004e
# 40​: Gā¼6 Cā¼ 6 Bā¼ 8 tschĆ¼ĆŸ has but \pL in U+0074.0073.0063.0068.00fc.00df
# 41​: Gā¼6 Cā¼ 7 Bā¼ 9 tschuĢˆĆŸ has but \pL in U+0074.0073.0063.0068.0075.0308.00df
# 42​: Gā¼6 Cā¼ 6 Bā¼ 8 ĆŸĆ¼hcsT has but \pL in U+00df.00fc.0068.0063.0073.0054
# 43​: Gā¼7 Cā¼ 7 Bā¼ 8 TSCHƜSS has but \pL in U+0054.0053.0043.0048.00dc.0053.0053
# 44​: Gā¼7 Cā¼ 8 Bā¼ 9 TSCHUĢˆSS has but \pL in U+0054.0053.0043.0048.0055.0308.0053.0053
# 45​: Gā¼7 Cā¼ 8 Bā¼ 9 SsĢˆuhcst has but \pL in U+0053.0073.0308.0075.0068.0063.0073.0074
# 46​: Gā¼7 Cā¼ 8 Bā¼ 9 SsuĢˆhcst has but \pL in U+0053.0073.0075.0308.0068.0063.0073.0074
# 47​: Gā¼8 Cā¼ 8 Bā¼10 coŀleciĆ³ has but \pL in U+0063.006f.0140.006c.0065.0063.0069.00f3
# 48​: Gā¼8 Cā¼ 8 Bā¼10 ƓiceÄæloC has but \pL in U+00d3.0069.0063.0065.013f.006c.006f.0043
# 49​: Gā¼8 Cā¼ 9 Bā¼11 coŀlecioĢ has but \pL in U+0063.006f.0140.006c.0065.0063.0069.006f.0301
# 50​: Gā¼8 Cā¼ 9 Bā¼11 COÄæLECIOĢ has but \pL in U+0043.004f.013f.004c.0045.0043.0049.004f.0301
# 51​: Gā¼9 Cā¼10 Bā¼12 colĀ·lecioĢ LACKS \pL in U+0063.006f.006c.00b7.006c.0065.0063.0069.006f.0301
# 52​: Gā¼9 Cā¼10 Bā¼12 COLĀ·LECIOĢ LACKS \pL in U+0043.004f.004c.00b7.004c.0045.0043.0049.004f.0301

perl -CS -Mcharnames=​:full\,​:short -le 'print for( "M"\,"\N{Greek​:Mu}"\,"\x{B5}"\,"\N{Greek​:mu}"\,"C"\, "\N{Latin​:C WITH CEDILLA}"\,"C\N{COMBINING CEDILLA}"\, "C\N{COMBINING CEDILLA}\N{COMBINING CARON}"\, "\N{Latin​:C WITH CEDILLA}\N{COMBINING CARON}"\, "C\N{COMBINING CARON}\N{COMBINING CEDILLA}"\, "\N{SCRIPT SMALL E}\N{COMBINING CEDILLA}"\, "\N{SCRIPT SMALL E}"\,"\N{SCRIPT CAPITAL R}"\, "\N{SCRIPT CAPITAL R}\N{COMBINING ENCLOSING CIRCLE BACKSLASH}"\, "\N{Greek​:Pi}"\,"\N{Greek​:psi}\N{COMBINING RIGHT ARROW ABOVE}"\,"?"\, "\N{LATIN LETTER GLOTTAL STOP}"\,"\N{LATIN LETTER GLOTTAL STOP}". "\N{COMBINING TILDE OVERLAY}"\,"\x{bf}"\,"\N{Greek​:pi}\N{SCRIPT SMALL E}"\, "\N{Greek​:Phi}\N{SCRIPT SMALL E}\N{COMBINING MACRON}"\,"\x{bf}?"\, "\N{LATIN LETTER PHARYNGEAL VOICED FRICATIVE}". "\N{LATIN LETTER INVERTED GLOTTAL STOP}"\, "\N{LATIN LETTER PHARYNGEAL VOICED FRICATIVE}". "\N{LATIN LETTER GLOTTAL STOP}"\,"\x{2161}\x{aa}"\,"II\x{aa}"\, "\N{Greek​:Psi}\N{SCRIPT SMALL E}\N{SUPERSCRIPT MINUS}\N{SUPERSCRIPT ONE}"\, "C\N{Latin​:o with acute}mo"\,"Co\N{COMBINING ACUTE ACCENT}mo"\, "\x{bf}Co\N{COMBINING ACUTE ACCENT}mo?"\, "\N{LATIN LETTER INVERTED GLOTTAL STOP}Co\N{COMBINING ACUTE ACCENT}". "mo\N{LATIN LETTER GLOTTAL STOP}"\, "\N{LATIN LETTER INVERTED GLOTTAL STOP}\x{332}C\x{332}o\x{332}". "\N{COMBINING ACUTE ACCENT}\x{332}m\x{332}o\x{332}". "\N{LATIN LETTER GLOTTAL STOP}\x{332}"\, "wr\x{1fd}\N{Latin​:thorn}\N{Latin​:thorn}u"\, "\Uwr\x{1fd}\N{Latin​:thorn}\N{Latin​:thorn}u"\, "wr\x{e6}\x{301}\N{Latin​:thorn}\N{Latin​:thorn}u"\, "\Uwr\x{e6}\x{301}\N{Latin​:thorn}\N{Latin​:thorn}u"\, "la\N{Latin​:yogh}amon"\,uc"La\N{Latin​:yogh}amon"\, "tsch\N{Latin​:u with diaeresis}\x{df}"\, "tschu\N{COMBINING DIAERESIS}\x{df}"\,scalar reverse( "\utsch\N{Latin​:u with diaeresis}\x{df}")\, "\Utsch\N{Latin​:u with diaeresis}\x{df}"\, "\Utschu\N{COMBINING DIAERESIS}\x{df}"\,ucfirst scalar reverse( "tschu\N{COMBINING DIAERESIS}\x{df}")\,ucfirst reverse(scalar reverse reverse "tschu\N{COMBINING DIAERESIS}\x{df}"=~/(?#YANETUT)\X/g)\, "co\x{140}leci\N{Latin​:o with acute}"\,ucfirst reverse( "\ucol\u\x{140}eci\N{Latin​:o with acute}")\, "co\x{140}lecio\N{COMBINING ACUTE ACCENT}"\, "\Uco\x{140}lecio\N{COMBINING ACUTE ACCENT}"\, "col\x{b7}lecio\N{COMBINING ACUTE ACCENT}"\, "\Ucol\x{b7}lecio\N{COMBINING ACUTE ACCENT}"\, )'|perl -CS -mbytes -nle '(($m)=m=\A(\X+)\z=)||die;   printf"%2d​: G\x{207C}%d C\x{207C}%2d B\x{207C}%2d".   "\t%s\t%s\t\\pL in U+%v04x \n"\,++$i\,scalar(()=   $m=~m~\X~g)\,(length$m\,bytes​::length$m\,$m)\,$m=~m   "\A(?​:(?=\pL)\X)+\z"?"has but"​:"LACKS"\,$m\, ;';

p5pRT commented 16 years ago

p5p@spam.wizbit.be - Status changed from 'open' to 'resolved'

p5pRT commented 16 years ago

From @nwc10

On Mon\, Aug 11\, 2008 at 11​:32​:59PM -0600\, Tom Christiansen wrote​:

So at the initial 8.0 release of perl5\, we went through a time when things were um\, a little too quick to jump at your envariables\, and many a train-wreck ensued. You *probably* don't want that\, any more

IIRC\, honouring the user's environment variables was exactly what either the Unicode folks\, or the Linux folks\, recommended. Of course\, around this time various Linux distributions also changed their default install to *be* UTF-8 locales (with UTF-8 environment variables) without actually making very much (if any) warning to the users/system installers that this was what was happening.

So we did what was recommended\, and then the messenger got blamed. (Partly - it did also reveal that the core still had quite a lot of UTF-8 related bugs\, and that the model adopted was inconsistent)

Nicholas Clark

p5pRT commented 16 years ago

From mark@blackmans.org

On 12 Aug 2008\, at 06​:32\, Tom Christiansen wrote​:

In-Reply-To​: Message from Mark Blackman (via RT) \perlbug\-followup@&#8203;perl\.org of "Mon\, 11 Aug 2008 08​:13​:58 PDT." \rt\-3\.6\.HEAD\-29759\-1218467638\-1300\.57800\-75\-0@&#8203;perl\.org

/^[\p{Letter}]+$/ doesn't match a cedilla C (utf8)

perl -ne 'chomp; if (/^[\p{Letter}]+$/) { print "letter-->"\, $_\,"\n"; }'

and with a utf8 terminal\, enter a cedilla C. U+0037

if you drop the final end-of-string match\, the match succeeds with a [single] cedilla C.

That's a mildy odd pattern. I wonder why you're embracketing a single property like that? This aren't icky POSIX character classes; you
don't need the brackets.

The pattern chosen was for maximum clarity of intent\, certainly \pL
was preferable.

It all depends on how you enter things. The precomposed Unicode
character LATIN CAPITAL LETTER C WITH CEDILLA is at code point 0xC7\, while the
LATIN SMALL LETTER C WITH CEDILLA is at 0xE7. I guess if you flip your 3
the other way it looks like an E\, but ....

I just misremembered the exact code point despite having looked it up a couple of minutes before.

The problem is surely how you're entering data (code point such and
such says nothing about the physical representation of the logical number)\, something you did *not* specify\, which makes it harder to know for
sure. But your envariables don't look culpable.

Cut-n-paste of the cedilla C from a UTF-8 webpage to a MacOS X UTF-8 terminal. I was under the erroneous impression that under these circumstances\, perl 5.8.8. would use character rather than byte semantics.

You really should still consult the perlrun manpage​:

"-C" on its own (not followed by any number or option list)\, or the empty string "" for the "PERL_UNICODE" environment variable\, has the same effect as "-CSDL". In other words\, the standard I/O handles and the default "open()" layer are UTF-8-fied but only if the locale environment variables indicate a UTF-8 locale. This behaviour follows the implicit (and problematic) UTF-8 behaviour of Perl 5.8.0.

So at the initial 8.0 release of perl5\, we went through a time when things were um\, a little too quick to jump at your envariables\, and many a train-wreck ensued. You *probably* don't want that\, any more than you want the L flag\, which is documented to

L 64 normally the "IOEioA" are unconditional\, the L makes them conditional on the locale environment variables (the LC_ALL\, LC_TYPE\, and LANG\, in the order of
decreasing precedence) -- if the variables indicate UTF-8\, then the selected "IOEioA" are in effect

For your education\, edification\, and indeed\, even amusement\, I offer
up the following program that explicitly sets things (read​: streams; I/O
layers) in a Unicodey way--something I can't see that you did--and only then
goes about sniffing around to decide whether a string is "letterishlike".

See\, even then\, you're still going to need to be just a *wee* bit more snerpickety in your inspection. But fear not\, for the key to getting this final part right is provided in a comment by itself right at the very top of the demo program below.

*DO* please enjoy! I sure know *I* did. You may lay the blame to this "artistic" coding whim on my recent (ahem) reading material\, which either you already know (of)--or else\, surely don't care to.
(​:->

:) thanks.

--tom

#!/bin/sh

# gcb - judge letterishness\, plus count # Graphemes\, Characters\, and Bytes # # Tom Christiansen \tchrist@&#8203;perl\.com # Mon Aug 11 23​:09​:44 MDT 2008

#======================----------->vvvvvvvvvvvvvvvvvvvv\<---# # The *KEY* to it all is simply =~ /\A(?​:(?=\pL)\X)+\z/ # #=====================----------->#^^^^^^^^^^^^^^^^^^^^\<---#

############################################################################# # Embedded cryptojest nothwithstanding\, the well-commented\, ASCII- art demo # # program [ it's something of a ship if you turn your monitor
sideways or # # run it through my rot90 filter :-] I enclose below can be expected
to # # produce the following clear and illuminating Unicode
output​: # #############################################################################

# 1​: Gā¼1 Cā¼ 1 Bā¼ 1 M has but \pL in U+004d # 2​: Gā¼1 Cā¼ 1 Bā¼ 2 Īœ has but \pL in U+039c # 3​: Gā¼1 Cā¼ 1 Bā¼ 2 Āµ has but \pL in U+00b5 # 4​: Gā¼1 Cā¼ 1 Bā¼ 2 Ī¼ has but \pL in U+03bc # 5​: Gā¼1 Cā¼ 1 Bā¼ 1 C has but \pL in U+0043 # 6​: Gā¼1 Cā¼ 1 Bā¼ 2 Ƈ has but \pL in U+00c7 # 7​: Gā¼1 Cā¼ 2 Bā¼ 3 CĢ§ has but \pL in U+0043.0327 # 8​: Gā¼1 Cā¼ 3 Bā¼ 5 CĢ§ĢŒ has but \pL in U+0043.0327.030c # 9​: Gā¼1 Cā¼ 2 Bā¼ 4 ƇĢŒ has but \pL in U+00c7.030c # 10​: Gā¼1 Cā¼ 3 Bā¼ 5 CĢŒĢ§ has but \pL in U+0043.030c.0327 # 11​: Gā¼1 Cā¼ 2 Bā¼ 5 ā„ÆĢ§ has but \pL in U+212f.0327 # 12​: Gā¼1 Cā¼ 1 Bā¼ 3 ā„Æ has but \pL in U+212f # 13​: Gā¼1 Cā¼ 1 Bā¼ 3 ā„› has but \pL in U+211b # 14​: Gā¼1 Cā¼ 2 Bā¼ 6 ā„›āƒ  has but \pL in U+211b.20e0 # 15​: Gā¼1 Cā¼ 1 Bā¼ 2 Ī  has but \pL in U+03a0 # 16​: Gā¼1 Cā¼ 2 Bā¼ 5 Ļˆāƒ— has but \pL in U+03c8.20d7 # 17​: Gā¼1 Cā¼ 1 Bā¼ 1 ? LACKS \pL in U+003f # 18​: Gā¼1 Cā¼ 1 Bā¼ 2 Ź” has but \pL in U+0294 # 19​: Gā¼1 Cā¼ 2 Bā¼ 4 Ź”Ģ“ has but \pL in U+0294.0334 # 20​: Gā¼1 Cā¼ 1 Bā¼ 2 Āæ LACKS \pL in U+00bf # 21​: Gā¼2 Cā¼ 2 Bā¼ 5 Ļ€ā„Æ has but \pL in U+03c0.212f # 22​: Gā¼2 Cā¼ 3 Bā¼ 7 Ī¦ā„ÆĢ„ has but \pL in U+03a6.212f.0304 # 23​: Gā¼2 Cā¼ 2 Bā¼ 3 Āæ? LACKS \pL in U+00bf.003f # 24​: Gā¼2 Cā¼ 2 Bā¼ 4 Ź•Ź– has but \pL in U+0295.0296 # 25​: Gā¼2 Cā¼ 2 Bā¼ 4 Ź•Ź” has but \pL in U+0295.0294 # 26​: Gā¼2 Cā¼ 2 Bā¼ 5 ā…”ĀŖ LACKS \pL in U+2161.00aa # 27​: Gā¼3 Cā¼ 3 Bā¼ 4 IIĀŖ has but \pL in U+0049.0049.00aa # 28​: Gā¼4 Cā¼ 4 Bā¼10 ĪØā„Æā»Ā¹ LACKS \pL in U+03a8.212f.207b. 00b9 # 29​: Gā¼4 Cā¼ 4 Bā¼ 5 CĆ³mo has but \pL in U+0043.00f3.006d.006f # 30​: Gā¼4 Cā¼ 5 Bā¼ 6 CoĢmo has but \pL in U+0043.006f.0301.006d. 006f # 31​: Gā¼6 Cā¼ 7 Bā¼ 9 ĀæCoĢmo? LACKS \pL in U+00bf.0043.006f. 0301.006d.006f.003f # 32​: Gā¼6 Cā¼ 7 Bā¼10 Ź–CoĢmoŹ” has but \pL in U+0296.0043.006f. 0301.006d.006f.0294 # 33​: Gā¼6 Cā¼14 Bā¼24 Ź–Ģ²CĢ²oĢ²ĢĢ²mĢ²oĢ²Ź”Ģ² has but \pL in U +0296.0332.0043.0332.006f.0332.0301.0332.006d.0332.006f.0332.0294.0332 # 34​: Gā¼6 Cā¼ 6 Bā¼ 9 wrĒ½Ć¾Ć¾u has but \pL in U+0077.0072.01fd. 00fe.00fe.0075 # 35​: Gā¼6 Cā¼ 6 Bā¼ 9 WRĒ¼ĆžĆžU has but \pL in U+0057.0052.01fc. 00de.00de.0055 # 36​: Gā¼6 Cā¼ 7 Bā¼11 wrƦĢĆ¾Ć¾u has but \pL in U +0077.0072.00e6.0301.00fe.00fe.0075 # 37​: Gā¼6 Cā¼ 7 Bā¼11 WRƆĢĆžĆžU has but \pL in U +0057.0052.00c6.0301.00de.00de.0055 # 38​: Gā¼7 Cā¼ 7 Bā¼ 8 laȝamon has but \pL in U+006c.0061.021d. 0061.006d.006f.006e # 39​: Gā¼7 Cā¼ 7 Bā¼ 8 LAȜAMON has but \pL in U+004c.0041.021c. 0041.004d.004f.004e # 40​: Gā¼6 Cā¼ 6 Bā¼ 8 tschĆ¼ĆŸ has but \pL in U +0074.0073.0063.0068.00fc.00df # 41​: Gā¼6 Cā¼ 7 Bā¼ 9 tschuĢˆĆŸ has but \pL in U +0074.0073.0063.0068.0075.0308.00df # 42​: Gā¼6 Cā¼ 6 Bā¼ 8 ĆŸĆ¼hcsT has but \pL in U+00df.00fc. 0068.0063.0073.0054 # 43​: Gā¼7 Cā¼ 7 Bā¼ 8 TSCHƜSS has but \pL in U +0054.0053.0043.0048.00dc.0053.0053 # 44​: Gā¼7 Cā¼ 8 Bā¼ 9 TSCHUĢˆSS has but \pL in U +0054.0053.0043.0048.0055.0308.0053.0053 # 45​: Gā¼7 Cā¼ 8 Bā¼ 9 SsĢˆuhcst has but \pL in U +0053.0073.0308.0075.0068.0063.0073.0074 # 46​: Gā¼7 Cā¼ 8 Bā¼ 9 SsuĢˆhcst has but \pL in U +0053.0073.0075.0308.0068.0063.0073.0074 # 47​: Gā¼8 Cā¼ 8 Bā¼10 coŀleciĆ³ has but \pL in U+0063.006f. 0140.006c.0065.0063.0069.00f3 # 48​: Gā¼8 Cā¼ 8 Bā¼10 ƓiceÄæloC has but \pL in U +00d3.0069.0063.0065.013f.006c.006f.0043 # 49​: Gā¼8 Cā¼ 9 Bā¼11 coŀlecioĢ has but \pL in U+0063.006f. 0140.006c.0065.0063.0069.006f.0301 # 50​: Gā¼8 Cā¼ 9 Bā¼11 COÄæLECIOĢ has but \pL in U+0043.004f.013f. 004c.0045.0043.0049.004f.0301 # 51​: Gā¼9 Cā¼10 Bā¼12 colĀ·lecioĢ LACKS \pL in U+0063.006f.006c. 00b7.006c.0065.0063.0069.006f.0301 # 52​: Gā¼9 Cā¼10 Bā¼12 COLĀ·LECIOĢ LACKS \pL in U+0043.004f.004c. 00b7.004c.0045.0043.0049.004f.0301

perl -CS -Mcharnames=​:full\,​:short -le 'print for( "M"\,"\N{Greek​:Mu}"\,"\x{B5}"\,"\N{Greek​:mu}"\,"C"\, "\N{Latin​:C WITH CEDILLA}"\,"C\N{COMBINING CEDILLA}"\, "C\N{COMBINING CEDILLA}\N{COMBINING CARON}"\, "\N{Latin​:C WITH CEDILLA}\N{COMBINING CARON}"\, "C\N{COMBINING CARON}\N{COMBINING CEDILLA}"\, "\N{SCRIPT SMALL E}\N{COMBINING CEDILLA}"\, "\N{SCRIPT SMALL E}"\,"\N{SCRIPT CAPITAL R}"\, "\N{SCRIPT CAPITAL R}\N{COMBINING ENCLOSING CIRCLE BACKSLASH}"\, "\N{Greek​:Pi}"\,"\N{Greek​:psi}\N{COMBINING RIGHT ARROW ABOVE}"\,"?"\, "\N{LATIN LETTER GLOTTAL STOP}"\,"\N{LATIN LETTER GLOTTAL STOP}". "\N{COMBINING TILDE OVERLAY}"\,"\x{bf}"\,"\N{Greek​:pi}\N{SCRIPT SMALL
E}"\, "\N{Greek​:Phi}\N{SCRIPT SMALL E}\N{COMBINING MACRON}"\,"\x{bf}?"\, "\N{LATIN LETTER PHARYNGEAL VOICED FRICATIVE}". "\N{LATIN LETTER INVERTED GLOTTAL STOP}"\, "\N{LATIN LETTER PHARYNGEAL VOICED FRICATIVE}". "\N{LATIN LETTER GLOTTAL STOP}"\,"\x{2161}\x{aa}"\,"II\x{aa}"\, "\N{Greek​:Psi}\N{SCRIPT SMALL E}\N{SUPERSCRIPT MINUS}\N{SUPERSCRIPT
ONE}"\, "C\N{Latin​:o with acute}mo"\,"Co\N{COMBINING ACUTE ACCENT}mo"\, "\x{bf}Co\N{COMBINING ACUTE ACCENT}mo?"\, "\N{LATIN LETTER INVERTED GLOTTAL STOP}Co\N{COMBINING ACUTE ACCENT}". "mo\N{LATIN LETTER GLOTTAL STOP}"\, "\N{LATIN LETTER INVERTED GLOTTAL STOP}\x{332}C\x{332}o\x{332}". "\N{COMBINING ACUTE ACCENT}\x{332}m\x{332}o\x{332}". "\N{LATIN LETTER GLOTTAL STOP}\x{332}"\, "wr\x{1fd}\N{Latin​:thorn}\N{Latin​:thorn}u"\, "\Uwr\x{1fd}\N{Latin​:thorn}\N{Latin​:thorn}u"\, "wr\x{e6}\x{301}\N{Latin​:thorn}\N{Latin​:thorn}u"\, "\Uwr\x{e6}\x{301}\N{Latin​:thorn}\N{Latin​:thorn}u"\, "la\N{Latin​:yogh}amon"\,uc"La\N{Latin​:yogh}amon"\, "tsch\N{Latin​:u with diaeresis}\x{df}"\, "tschu\N{COMBINING DIAERESIS}\x{df}"\,scalar reverse( "\utsch\N{Latin​:u with diaeresis}\x{df}")\, "\Utsch\N{Latin​:u with diaeresis}\x{df}"\, "\Utschu\N{COMBINING DIAERESIS}\x{df}"\,ucfirst scalar reverse( "tschu\N{COMBINING DIAERESIS}\x{df}")\,ucfirst reverse(scalar reverse
reverse "tschu\N{COMBINING DIAERESIS}\x{df}"=~/(?#YANETUT)\X/g)\, "co\x{140}leci\N{Latin​:o with acute}"\,ucfirst reverse( "\ucol\u\x{140}eci\N{Latin​:o with acute}")\, "co\x{140}lecio\N{COMBINING ACUTE ACCENT}"\, "\Uco\x{140}lecio\N{COMBINING ACUTE ACCENT}"\, "col\x{b7}lecio\N{COMBINING ACUTE ACCENT}"\, "\Ucol\x{b7}lecio\N{COMBINING ACUTE ACCENT}"\, )'|perl -CS -mbytes -nle '(($m)=m=\A(\X+)\z=)||die; printf"%2d​: G\x{207C}%d C\x{207C}%2d B\x{207C}%2d". "\t%s\t%s\t\\pL in U+%v04x \n"\,++$i\,scalar(()= $m=~m~\X~g)\,(length$m\,bytes​::length$m\,$m)\,$m=~m "\A(?​:(?=\pL)\X)+\z"?"has but"​:"LACKS"\,$m\, ;';

p5pRT commented 16 years ago

From tchrist@perl.com

On Mon\, Aug 11\, 2008 at 11​:32​:59PM -0600\, Tom Christiansen wrote​:

So at the initial 8.0 release of perl5\, we went through a time when things were um\, a little too quick to jump at your envariables\, and many a train-wreck ensued. You *probably* don't want that\, any more

IIRC\, honouring the user's environment variables was exactly what either the Unicode folks\, or the Linux folks\, recommended. Of course\, around this time various Linux distributions also changed their default install to *be* UTF-8 locales (with UTF-8 environment variables) without actually making very much (if any) warning to the users/system installers that this was what was happening.

You're quite right\, Nick. For some reason\, I remember the Linux filesystem woes as being more acutely painful than the CPAN build woes. People who'd put 8bit data in their filenames were suddenly accosted by reams of warnings about these malformed UTF-8 sequences in their now- untypable/unreachable files.

So we did what was recommended\, and then the messenger got blamed.

Please accept my sincere and complete apologies if I in any fashion gave you the impression that I was holding you (or Jarkko\, or anybody) at fault here. I very expressly *am* *not*.

(Partly - it did also reveal that the core still had quite a lot of UTF-8 related bugs\, and that the model adopted was inconsistent)

Indeed. And that was something we needed to understand. It was new territory\, a problem-space being explored and experimented with. I don 't see ((m)any?) other languages with as good a Unicode-vs-Legacy story as Perl has\, whatever folks here may say.

I always recommend something later than 8.0 for Unicode work. Even 8.1 is quite a bit different -- and better. I've also glad 10.0 has the v5 UCD.

But I have to tell you something. I don't know whether you have ideas what to do about it\, as we anglophones are a bit deaf to the topic in general\, but I figure you should have a better chance than I would.

The problem is that I *JUST* *CAN* *NOT* get (recalcitrant\, provincial\, blahetc) Americans to give a rat sass about Unicode. They cannot see how what they think of as "foreign languages" matter in their daily work. It's wrongheaded\, of course\, even to think of it in that light\, but they still do.

At Boston USENIX this summer\, the reviews had several Americans writing that I should "delete the Unicode section; don't waste time telling us about Unicode; it doesn't affect us or our jobs" while the Europeans and Asians in the class gave even more reviews saying "Unicode section too short; tell us more."

Besides illustrating how hard it is to please all the people all the time\, the starkly opposite write-ups in the US vs non-US students really left me scratching my head. Somehow I'm not doing a good job at a "selling" Americans on it\, and I'm not sure why. I must have a mental blind-spot\, because to me it's really important. I talk about mixing Latin\, Greek\, and Cyrillic\, and they yawn. I cite IPA\, and the nonlinguists just stare at me in total uncomprenhesion (unlike "foreign" dictionaries\, most if not all American dictionaries' pronunciation sections use their own in-house home-grown crock of total silliness instead of IPA). I talk about various symbols and such\, and they say\, "Oh\, Word has a Dingbats font for that."

It's very frustrating.

The class this week\, which is at a national lab doing serious and important climate-change analysis\, is filled with programmers aged anywhere from 20 to 50. They're not dull people\, nor slow. They mostly crunch data\, or help the PhD scientists who do. But not a single one of them ever knowingly touches any textual data that's more than even 7-bit\, let alone 8-.

They've *heard* of Unicode. But they don't see the need for knowing about it\, and their instincts are wrong. Example of literal sequence from yesterday's class\, after which I abandoned talking about Unicode.

  Me​: If length(S) tells how many characters are in S\, and   chr(N) gives back the *single* *character* at code point N\,   then what does length(chr(400)) give you?

  Student​: 3.

Sigh. I even told them the answer in my pharsing of the question\, but they guessed wrong (no matter how you look at it\, I think).

Being able to write

  D1 = 2 * pi * r**1   D2 = pi * r**2   D3 = 4/3 * pi * r**3

with a real pi didn't seem to motivate them much. "We can do that in Word or HTML." I know\, I should chase the HTML lead.

I wager I could more easily interest them in the Middle-English poem\, Sir Gawain and the Green Knight (but hey\, eth and thorn are in ISO-8859-1\, that is\, "Latin-1")\, or even Vergil's poem famously starting Arma Virumque\, than I could in Unicode (and hey look\, that's still in Latin :-).

At least these aren't guys who talk about the "El 9-0 Effect" like most Americans. We've a town in southern Colorado\, CaƱon City\, that refuses to strip the tilde or respell it to Canyon City\, and the road to the enw airport is PeƱa Blvd after the transportation secretary of the time.

But few publications spell either correctly\, especially national ones (_The New York Times_ and _The Economist_ excepted).

Then I ask them which comes first​: color or chocolate. They say\, "why chocolate of course." I say\, well.... sometimes\, in some countries\, during certain years.

Because my Spanish is halfway ok and Colorado still has huge haciendas with original land grants from the King of Spain\, I try to use that for an example\, but nobody can tell me how to correctly order pino\, pintura\, and piƱata (answer​: as written) let alone the longer (and already well- ordered list)​:

  radio rĆ”faga rana ranĆŗnculo raƱa rĆ”pido rastrillo

And then I let down their spirits by explaining Unicode doesn't even address this issues at all\, no more than it explains how to put sort phone numbers for people whose surnames may begin Mc- or Mac-. And they think this rƩcherchƩ and irrelevant to their lives in the USA\, all the while they're padding their CVs/rƩsumƩs. :-)

  (FTR\, the correct answer is that one disregards all (acute)   accents and diaereses (as in pingĆ¼ine for penguin)\, but   places the letter "Ʊ" *after* the letter "n" and *before*   the letter "o"; before 1997\, place "ch" as a single-character   diagraph between "c" and "d"\, but afterwards\, don't.)

Probably if I had some good English examples\, and preferably ones of more recent vintage than​:

  When that Averylle with his shoures soote   The droughte of March / hath perced to the roote   ...   To ferne halwes / kouthe in sondry londes   And specially / from euery shyres ende   Of Engelond / to Caunterbury they wende   The holy blisful martir / for to seke   That hem hath holpen whan ƞat they weere seeke.

then maybe it might holpen--er\, help. :-)

--tom --

  +------------------------------+   | GrƦcum est​: non potest legi! |   +------------------------------+