Perl / perl5

🐪 The Perl programming language
https://dev.perl.org/perl5/
Other
1.96k stars 556 forks source link

started out as doc clarification needed in 'eval...but... #13160

Closed p5pRT closed 11 years ago

p5pRT commented 11 years ago

Migrated from rt.perl.org#119239 (status was 'resolved')

Searchable as RT119239$

p5pRT commented 11 years ago

From perl-diddler@tlinx.org

Created by perl-diddler@tlinx.org

In my copy of the perlfunc man page under the eval keyword... 3rd paragraph\, 2nd sentence says​:

  In the   absence of the "unicode_eval" feature\, the string will   sometimes be treated as characters and sometimes as bytes\,   depending on the internal encoding\, and source filters   activated within the "eval" exhibit the erratic\, but   historical\, behaviour of affecting some outer file scope that   is still compiling.

--- The above sentence is as clear as writing as a doctoral computer scientist might write.

1) Doesn't it\, at all\, depend on the context of where it is called? I.e. if "use utf8"\, is in effect\, and I say​:

---- #!/usr/bin/perl #use utf8; #doesn't seem to be necessary for utf8 in source # and nothing needs to be done for utf8 on output? # my $string="“犬夜叉”"; our $value=int rand 2; our $newvalue; our $newvalue2; use P; eval q($newvalue="_$string_\, val=$value";); @​_ and die @​_; P "value=%s\, newvalue=%s\, string=%s"\, $value\, $newvalue\, $string; eval qq($newvalue2="_$string\_\, val=$value"); @​_ and die @​_; P "value=%s\, newvalue2=%s\, string=%s"\, $value\, $newvalue2\, $string; ------------------

ok... so which of my file scope was it supposed to affect and why doesn't the print have any value for $newvalue or $newvalue2?

doesn't eval use the context from which it is called or is that only the block eval?

So what does something like​:

----- my $d=int rand 2; eval q(use Dbg($d\,$d\,$d)); # (vs.) eval qq(use Dbg($d\,$d\,$d)); --- do (or what should it do)?

Does it "do the eval"* at run time and call "use Dbg(...) at run time\, *(including\, or not\, based on previous inclusion\, but calling 'import'\, if present\, regardless))

If it did them at run time\, shouldn't one of the "newvalue{\,} have a value?

-----

1.B) doc paragraph is unclear and main reason I filed this. In writing test cases (see what it caused!?)\, I came across​:

1.B.1)​: Question of why it is outputting UTF8 without some wide char warning? Does that mean it reads UTF8 as well?

hmmmm....would that mean the prog could write something it couldn't read?

1.B.2) Does a dynamic eval only work in it's context when it is not a run-time evaluation (as required by a string)\, but is a {block eval}?

So if I have​:

{my\,our} $d=int rand 2; eval {use Dbg($d\,$d\,$d)};

--- wouldn't the block forrm make the useDbg be eval'ed at compile time? with "$d" being undefined?

($d represents user input in some command line option).

----

Suppose (1) is the main\, 1.B is very curious\, and I'm wondering if 1B(1+2) is perl version dependant?...

Perl Info ``` Flags: category=docs severity=medium This perlbug was built using Perl 5.16.2 - Fri Feb 15 01:17:37 UTC 2013 It is being executed now by Perl 5.16.2 - Fri Feb 15 01:12:05 UTC 2013. Site configuration information for perl 5.16.2: Configured by abuild at Fri Feb 15 01:12:05 UTC 2013. Summary of my perl5 (revision 5 version 16 subversion 2) configuration: Platform: osname=linux, osvers=3.4.6-2.10-default, archname=x86_64-linux-thread-multi uname='linux build34 3.4.6-2.10-default #1 smp thu jul 26 09:36:26 utc 2012 (641c197) x86_64 x86_64 x86_64 gnulinux ' config_args='-ds -e -Dprefix=/usr -Dvendorprefix=/usr -Dinstallusrbinperl -Dusethreads -Di_db -Di_dbm -Di_ndbm -Di_gdbm -Dd_dbm_open -Duseshrplib=true -Doptimize=-fmessage-length=0 -O2 -Wall -D_FORTIFY_SOURCE=2 -fstack-protector -funwind-tables -fasynchronous-unwind-tables -g -Wall -pipe -Accflags=-DPERL_USE_SAFE_PUTENV -Dotherlibdirs=/usr/lib/perl5/site_perl' hint=recommended, useposix=true, d_sigaction=define useithreads=define, usemultiplicity=define useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef use64bitint=define, use64bitall=define, uselongdouble=undef usemymalloc=n, bincompat5005=undef Compiler: cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DPERL_USE_SAFE_PUTENV -fno-strict-aliasing -pipe -fstack-protector -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64', optimize='-fmessage-length=0 -O2 -Wall -D_FORTIFY_SOURCE=2 -fstack-protector -funwind-tables -fasynchronous-unwind-tables -g -Wall -pipe', cppflags='-D_REENTRANT -D_GNU_SOURCE -DPERL_USE_SAFE_PUTENV -fno-strict-aliasing -pipe -fstack-protector' ccversion='', gccversion='4.7.2 20130108 [gcc-4_7-branch revision 195012]', gccosandvers='' intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16 ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 alignbytes=8, prototype=define Linker and Libraries: ld='cc', ldflags =' -L/usr/local/lib64 -fstack-protector' libpth=/lib64 /usr/lib64 /usr/local/lib64 libs=-lm -ldl -lcrypt -lpthread perllibs=-lm -ldl -lcrypt -lpthread libc=/lib64/libc-2.17.so, so=so, useshrplib=true, libperl=libperl.so gnulibc_version='2.17' Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E -Wl,-rpath,/usr/lib/perl5/5.16.2/x86_64-linux-thread-multi/CORE' cccdlflags='-fPIC', lddlflags='-shared -L/usr/local/lib64 -fstack-protector' Locally applied patches: @INC for perl 5.16.2: /usr/lib/perl5/site_perl/5.16.2/x86_64-linux-thread-multi /usr/lib/perl5/site_perl/5.16.2 /usr/lib/perl5/vendor_perl/5.16.2/x86_64-linux-thread-multi /usr/lib/perl5/vendor_perl/5.16.2 /usr/lib/perl5/5.16.2/x86_64-linux-thread-multi /usr/lib/perl5/5.16.2 /usr/lib/perl5/site_perl/5.16.2/x86_64-linux-thread-multi /usr/lib/perl5/site_perl/5.16.2 /usr/lib/perl5/site_perl . Environment for perl 5.16.2: HOME=/home/law LANG=en_US.UTF-8 LANGUAGE (unset) LC_COLLATE=C LC_CTYPE=en_US.UTF-8 LD_LIBRARY_PATH=/usr/lib64/mpi/gcc/openmpi/lib64 LOGDIR (unset) PATH=/home/law/bin/lib:/sbin:/usr/local/sbin:/usr/lib64/mpi/gcc/openmpi/bin:/home/law/bin:/usr/local/bin:/usr/bin:/bin:/usr/bin/X11:/usr/X11R6/bin:/usr/games:/opt/kde3/bin:/usr/lib/mit/bin:/usr/lib/mit/sbin:.:/usr/lib/qt3/bin:/opt/dell/srvadmin/bin:/usr/sbin:/etc/local/func_lib:/home/law/lib PERL_BADLANG (unset) SHELL=/bin/bash ```
p5pRT commented 11 years ago

From @rjbs

* Linda Walsh \perlbug\-followup@​perl\.org [2013-08-12T01​:25​:21]

1) Doesn't it\, at all\, depend on the context of where it is called? I.e. if "use utf8"\, is in effect\, and I say​:

All that utf8.pm does is indicate that your source code is encoded in UTF-8\, so that if your source document has this​:

  use utf8;   my $band = "Queensrÿche";

...the length will be 11\, not 12 because the "ÿ" will be one codepoint (the ÿ character encoded in the file) rather than two (the UTF-8-encoded stored bytes in the file).

Perl's design is that\, as much as possible\, performing texty operations on strings treats the strings as strings of Unicode codepoints. The utf8 pragma is not meant to do anything but tell perl(1) to decode the input document.

Some functions in perl have\, historically\, behaved based on weird guesses or bad heuristics as to context. The is "The Unicode Bug." It was fixed for many operations by "use feature 'unicode_strings'"\, telling even more of the language "seriously\, perl\, it's all text."

"eval" had The Unicode Bug\, which is now fixed in the scope of "use feature 'unicode_eval'". So\, in fact\, the behavior of "eval" *does* depend on the context of where it is called... but the thing that matters is the unicode_eval feature rather than the utf8 pragma.

Meanwhile\, for those cases where one has read a bytestream and wants perl to evaluate it as if it was reading those bytes from a file\, eval_bytes was added.

I hope this has clarified things. The perl string model can be a big pain\, but we are fairly stuck with it at the moment.

#use utf8; #doesn't seem to be necessary for utf8 in source # and nothing needs to be done for utf8 on output?

I'm not sure I understand the above comment\, but I hope that the question is answered by my text\, above. "use utf8" will not affect output. Actually\, here's an example of how it will\, in some ways​:

  ~$ perl -E 'use warnings; my $band = "Queensrÿche"; say $band'   Queensrÿche   ~$ perl -E 'use warnings; use utf8; my $band = "Queensrÿche"; say $band'   Queensr?che

In the first case\, we have that 12-element string which contains the raw UTF-8. If we tried checking it for /\xFF/ it would fail\, since it doesn't have that character. Similarly\, /\p{Latin_1_Supplement}/ would fail. On the other hand\, it prints back out correctly because my terminal is also UTF-8.

In the second one\, /\xFF/ would match (huzzah! and also \p{Latin_1_Supplement})\, but the output is screwed up because it emits octet 0xFF. Oops!

If we want to get "worse\," we can just pick a worse band!

  ~$ perl -E 'use warnings; use utf8; my $band = "Spın̈al Tap"; say $band'   Wide character in say at -e line 1.   Spın̈al Tap

Now we have a string of Unicode codepoints. There are 11 (rather than the 13 octets in the input to -e)\, including 9 ASCII characters\, the dotless i\, and the combining diaeresis. The two non-ASCII characters are above 0xFF\, so when they're printed\, perl can't just emit the byte with the value of the character. It punts\, emitting the in-memory representation\, which happily is UTF-8\, so it *seems* like the program did the right thing. To remind us that we got lucky (because we prefer English metal to West Coast USA metal)\, it emits a warning​: "I just printed a character bigger than 0xFF so you probably forgot to encode."

So\, output behaves the same way under "use utf8" other than the fact that your output-producing code is getting different inputs!

my $string="“犬夜叉”"; our $value=int rand 2; our $newvalue; our $newvalue2; use P; eval q($newvalue="_$string_\, val=$value";); @​_ and die @​_; P "value=%s\, newvalue=%s\, string=%s"\, $value\, $newvalue\, $string; eval qq($newvalue2="_$string\_\, val=$value"); @​_ and die @​_; P "value=%s\, newvalue2=%s\, string=%s"\, $value\, $newvalue2\, $string;

I don't have P installed\, and I wasn't sure it would be worth the trouble.

I'm hoping that explanations above have rendered this section "answered."

-- rjbs

p5pRT commented 11 years ago

The RT System itself - Status changed from 'new' to 'open'

p5pRT commented 11 years ago

From @rjbs

* Ricardo Signes \perl\.p5p@​rjbs\.manxome\.org [2013-08-13T22​:10​:15]

I'm hoping that explanations above have rendered this section "answered."

I'd like to close this ticket unless there are further questions.

-- rjbs

p5pRT commented 11 years ago

From perl-diddler@tlinx.org

On Tue Aug 13 19​:11​:01 2013\, perl.p5p@​rjbs.manxome.org wrote​:

* Linda Walsh \perlbug\-followup@​perl\.org [2013-08-12T01​:25​:21]

1) Doesn't it\, at all\, depend on the context of where it is called? I.e. if "use utf8"\, is in effect\, and I say​:

All that utf8.pm does is indicate that your source code is encoded in UTF-8\, so that if your source document has this​:

use utf8; my $band = "Queensrÿche";


First\, I want to be sure to start with saying the depth of your answer was suburb -- didn't feel it slighted the [rambling] question in the slightest. Certainly even though I now reside on the west coast\, I can't disagree with the lilting UTF-8 examples you used.

Second\, its hard for me to get into a head space to write a response worthy as a successor to such while also addressing the issues that still plagued my overfocused ADHD (a specific subtype of ADHD that is the most difficult to treat as the meds that normally treat ADHD can make the overfocused part worse)\, so realtime responses are difficult for me at times.

Points) A) - "(to myself as much as anyone)" - use utf8 only applies to source code not content strings\, so my wonderings why the Japanese INU(dog) YA(night) SHA(dividing point) came out "ok"\, were 1) due to it being in content string\, and (among other things) none of the chars being in the range 0x80-0xff).

B) This part is still unclear (and is not really unicode related) but it related to the scoping rules referred to obliquely as "the erratic\, but historical\, behavior of affecting ''some'' [emphasis mine] outer file scope that is still compiling.

The scoping rules of the eval were what had me curious and as to why a 'our' scoped value wasn't seen inside either of the string evals that would theoretically be inside the package that contains those variables.

If I used a non-string (not really interpreted) eval\, it seems to reference the package vars... but why do the string evals not reference the surrounding package vars? (or vars declared w/my in which the eval is scoped)...

The scoping rules of the eval are unclear from the the text though reading it -- maybe the specific piece I quoted was only talking about scoping as related to UTF8?

Still have the question of why the string evals didn't see their package scoped vars...


Hope this more explains the central issue\, though I think any side UTF8 issues are likely\, fully\, if not entirely lyrically\, explained...

p5pRT commented 11 years ago

From @ikegami

On Thu\, Aug 15\, 2013 at 3​:43 PM\, Linda Walsh via RT \< perlbug-followup@​perl.org> wrote​:

Points) A) - "(to myself as much as anyone)" - use utf8 only applies to source code not content strings\, so my wonderings why the Japanese INU(dog) YA(night) SHA(dividing point) came out "ok"

$ perl -Mutf8 -E'say "\x{72AC}\x{591C}\x{53C9}"' Wide character in say at -e line 1. 犬夜叉

File handles take bytes unless you arrange for an encoding layer to provide those bytes. That error message is a polite way of saying "I was able to detect that you gave me garbage\, and I guessed at what you wanted".

Now lets look at the following​:

$ perl -Mutf8 -E'say "\x51\x75\x65\x65\x6E\x73\x72\xFF\x63\x68\x65"' Queensr?che

Perl issued no error message because you provided bytes (values in 0..255) as expected. Perl has no way to know you meant to output UTF-8.

- Eric

p5pRT commented 11 years ago

From @ikegami

On Thu\, Aug 15\, 2013 at 4​:32 PM\, Eric Brine \ikegami@&#8203;adaelis\.com wrote​:

On Thu\, Aug 15\, 2013 at 3​:43 PM\, Linda Walsh via RT \< perlbug-followup@​perl.org> wrote​:

Points) A) - "(to myself as much as anyone)" - use utf8 only applies to source code not content strings\, so my wonderings why the Japanese INU(dog) YA(night) SHA(dividing point) came out "ok"

$ perl -Mutf8 -E'say "\x{72AC}\x{591C}\x{53C9}"'

Wide character in say at -e line 1. 犬夜叉

File handles take bytes unless you arrange for an encoding layer to provide those bytes. That error message is a polite way of saying "I was able to detect that you gave me garbage\, and I guessed at what you wanted".

Now lets look at the following​:

$ perl -Mutf8 -E'say "\x51\x75\x65\x65\x6E\x73\x72\xFF\x63\x68\x65"' Queensr?che

Perl issued no error message because you provided bytes (values in 0..255) as expected. Perl has no way to know you meant to output UTF-8.

The logic is simply equivalent to​:

if ($str =~ /[^\x00-\xFF]/) {   # ERROR   warn("Wide character in say");   utf8​::encode($str); # This might be what they meant } _write($str);

p5pRT commented 11 years ago

From perl-diddler@tlinx.org

On Thu Aug 15 13​:32​:51 2013\, ikegami@​adaelis.com wrote​:

On Thu\, Aug 15\, 2013 at 3​:43 PM\, Linda Walsh via RT \< perlbug-followup@​perl.org> wrote​:

Points) A) - "(to myself as much as anyone)" - use utf8 only applies to source code not content strings\, so my wonderings why the Japanese INU(dog) YA(night) SHA(dividing point) came out "ok"

$ perl -Mutf8 -E'say "\x{72AC}\x{591C}\x{53C9}"' Wide character in say at -e line 1. 犬夜叉


Urg...um... so the original example that I had that printed " $string="“犬夜叉”"; didn't flag those as either wide characters nor it it flag it as an error for not having "use utf-8;" in the sourced yet having utf-8 in the source\, which was interpreted as a byte-string\, and output as a byte string\, thus no warning from perl.

So if I used hex it would fail\, but I don't need the

  "use utf8;"

as Ricardo included which in this example is a red-herring? (That a bit confusing)....

But​: perl -e 'use P;

my $name= [qw (犬夜叉)]; my $band={band => "Queensrÿche"};

P "string=%s"\, $name; P "band=%s"\, $band;

{ use feature "say";   say "string=%s"\, $name;   say "band=%s"\, $band; } ' string=["犬夜叉"] band={band=>"Queensrÿche"} string=%sARRAY(0x1d8fa68) band=%sHASH(0x1daf018)


Hmmm... no utf8 warnings on any of those. But if perl had take it as a byte-string\, the

= e7 8a ac e5 a4 9c e5 8f 89 \<\<--- why wouldn't those have been taken as latin1 (as the source wasn't listed as utf8)\, and been "encoded\, internally to their UTF-8 encodings?
I.e. "E7" = 0xc3 0xa7\, 8e = 0xc2 0x9a;


But if I try to use 'ÿ' in an identifier as in changing 'band' to bandÿ I get​:

Unrecognized character \xC3; marked by \<-- HERE after my $band\<-- HERE near column 9 at -e line 4.

Isn't that inconsistent? The program source seems to be taken as UTF8 encoded if it is in a string\, but not ...

What do I get if I read from \<package​::DATA>? Would the UTF8 encoded strings be read as UTF8 ... I'm guessing not?


Note -- original unclarity regarding scope of what is affected by eval "stuff here" or 'stuff here'\, still exists...

Though the utf8 stuff doesn't seem entirely consistent now that you mention it..

p5pRT commented 11 years ago

From @ikegami

On Thu\, Aug 15\, 2013 at 9​:38 PM\, Linda Walsh via RT \< perlbug-followup@​perl.org> wrote​:

On Thu Aug 15 13​:32​:51 2013\, ikegami@​adaelis.com wrote​:

On Thu\, Aug 15\, 2013 at 3​:43 PM\, Linda Walsh via RT \< perlbug-followup@​perl.org> wrote​:

Points) A) - "(to myself as much as anyone)" - use utf8 only applies to source code not content strings\, so my wonderings why the Japanese INU(dog) YA(night) SHA(dividing point) came out "ok"

$ perl -Mutf8 -E'say "\x{72AC}\x{591C}\x{53C9}"' Wide character in say at -e line 1. 犬夜叉

---- Urg...um... so the original example that I had that printed " $string="“犬夜叉”";

You didn't use C\<\< use utf8; >> which means your code couldn't possibly have contained

  $string="“犬夜叉”"; # "\x{201C}\x{72AC}\x{591C}\x{53C9}\x{201D}"

It actually contains

  $string="â��ç�¬å¤�å��â��"; # "\xE2\x80\x9C\xE7\x8A\xAC\xE5\xA4\x9C\xE5\x8F\x89\xE2\x80\x9D"

You might have saved the program in UTF-8\, but you told Perl it was iso-8859-1 (by not using C\<\< use utf8; >>).

p5pRT commented 11 years ago

From perl-diddler@tlinx.org

On Thu Aug 15 23​:24​:06 2013\, ikegami@​adaelis.com wrote​:

On Thu\, Aug 15\, 2013 at 9​:38 PM\, Linda Walsh via RT \< perlbug-followup@​perl.org> wrote​:

On Thu Aug 15 13​:32​:51 2013\, ikegami@​adaelis.com wrote​:

On Thu\, Aug 15\, 2013 at 3​:43 PM\, Linda Walsh via RT \< perlbug-followup@​perl.org> wrote​:

Points) A) - "(to myself as much as anyone)" - use utf8 only applies to source code not content strings\, so my wonderings why the Japanese INU(dog) YA(night) SHA(dividing point) came out "ok"

$ perl -Mutf8 -E'say "\x{72AC}\x{591C}\x{53C9}"' Wide character in say at -e line 1. 犬夜叉

---- Urg...um... so the original example that I had that printed " $string="“犬夜叉”";

You didn't use C\<\< use utf8; >> which means your code couldn't possibly have contained

$string="“犬夜叉”";  \# "\\x\{201C\}\\x\{72AC\}\\x\{591C\}\\x\{53C9\}\\x\{201D\}"

It actually contains

$string="�����";  \#

"\xE2\x80\x9C\xE7\x8A\xAC\xE5\xA4\x9C\xE5\x8F\x89\xE2\x80\x9D"

You might have saved the program in UTF-8\, but you told Perl it was iso-8859-1 (by not using C\<\< use utf8; >>).


You didn't read the rest of the note...

If that was true​:

Hmmm... no utf8 warnings on any of those. But if perl had take it as a byte-string\, the

= e7 8a ac e5 a4 9c e5 8f 89 \<\<--- why wouldn't those have been taken as latin1 (as the source wasn't listed as utf8)\, and been "encoded\, internally to their UTF-8 encodings?
I.e. "E7" = 0xc3 0xa7\, 8e = 0xc2 0x9a; ----


If perl had taken that input as latin1\, then why wouldn't I have seen the wide character warning on output?

OTOH\, if I add "use utf8"\, to this program​:

#!/usr/bin/perl use 5.6.16; use utf8; #use P; use warnings; my $name= [qw (犬夜叉)]; my $band={band => "Queensrÿche"}; printf "string=%s\, len=%s\n"\, $name->[0]\, length($name->[0]); printf "band=%s\n"\, $band->{band};


I get corrupted output​: /tmp/s.pl Wide character in printf at /tmp/s.pl line 8. string=犬夜叉\, len=3 band=Queensr

  Ishtar​:law/bin/lib> more s.pl


Isn't this sort of the opposite of what one would expect?

I guess I should file this under another bug\, as this isn't really the doc bug about eval scoping...

p5pRT commented 11 years ago

From @rjbs

* Linda Walsh via RT \perlbug\-followup@&#8203;perl\.org [2013-08-16T03​:13​:14]

If perl had taken that input as latin1\, then why wouldn't I have seen the wide character warning on output?

Because Latin-1 has no wide characters.

Really\, I don't like to bring up Latin-1. It just confuses things.

These days\, Perl is pretty good about acting like it only knows about code points. (Let's put aside what is stored in the octets used internally for the scalar in memory.) A string is a sequence of codepoints\, which are just non-negative integers.

"use utf8" says "while you read this source code in\, decode it as UTF-8 and use THOSE codepoints for everything\, rather than the octets encoding it."

When you print\, this happens​:

  codepoints-in-your-string => fh layers => output destination

One common layer is encoding\, which will encode your codepoints into UTF-8 (or whatever) so that the output destination gets only octets\, since UTF-8 encoding results in a sequence of 8-bit values.

If you leave out an encoding layer\, and your codepoints include things >255\, then there will be a warning\, because you can't send 0x0100 to a bytestream.

Consider this program​:

  use 5.18.0;   {   my $str = "“犬夜叉”";   my @​codepoints = split ''\, $str;   say join q{ }\, map {; sprintf 'U+%04X'\, ord } @​codepoints;   say $str;   say $str =~ /\p{InCJK}/ ? "InCJK" : "Not InCJK";   }

  say '-' x 78;

  {   use utf8;   my $str = "“犬夜叉”";   my @​codepoints = split ''\, $str;   say join q{ }\, map {; sprintf 'U+%04X'\, ord } @​codepoints;   say $str;   say $str =~ /\p{InCJK}/ ? "InCJK" : "Not InCJK";   }

In both cases\, we're "say"-ing to STDOUT\, which has no encoding layer applied.

The first block succeeds at sending the "right" thing to the terminal (assuming the terminal is in UTF-8). The regexp fails\, though\, because none of the *fifteen* codepoints in the string has the InCJK property — and it is *right* to fail. The string is clearly *binary* data\, not a text string... but it's only clear to a human. Perl doesn't\, and can't\, know. It treats all strings like text when you do texty stuff like matching.

The second block also "succeeds" at sending the right thing\, but it's really a guess. perl sees that you're trying to fit U+201C into a byte-wide output stream. It sighs\, emits a warning\, then sends U+00E2 U+0080 U+009C in its place. The sigh and the warning are because you should have explicitly encoded. Meanwhile\, the regexp match in the second block *does* match\, because most of those (> 0xFF) codepoints *do* match InCJK.

So\, how do you know which string contains raw octets from files or terminal reads (like the string in the first block) versus strings that contain Unicode codepoints (like the string in the second block)?

** The only answer is​: strict discipline

You have to keep track of what you've read in\, either from the source code\, a filehandle\, the terminal (which is a filehandle)\, and so on. Then you need to never forget. The common practice is (or should be) to decode all input immediately upon reading from a bytestream\, then to encode it immediately before outputting it to a bytestream.

use 5.6.16;

^--- I think you meant something else\, but it's irrelevant. :-)

use utf8; #use P; use warnings; my $name= [qw (犬夜叉)]; my $band={band => "Queensrÿche"}; printf "string=%s\, len=%s\n"\, $name->[0]\, length($name->[0]); printf "band=%s\n"\, $band->{band}; ---- I get corrupted output​: /tmp/s.pl Wide character in printf at /tmp/s.pl line 8. string=犬夜叉\, len=3 band=Queensr

---- Isn't this sort of the opposite of what one would expect?

This is exactly right.

You have failed\, like many\, to grasp what makes Queensrÿche so great. It isn't Geoff Tate or their cool logo. It's that the Unicode codepoint for ÿ is U+00FF. That means that it fits into a byte\, so when you try to print it out\, Perl doesn't realize it needs to switch to emitting UTF-8.

This affects any codepoint between 0x80 and 0xFF inclusive\, because they're too big to be in the part where codepoints UTF-8-encode to their own value in one octet\, but not big enough to alert perl that the codepoint needs to be emitted as its (happily very-very-close-to-UTF-8) internal representation.

If you're curious as to what all these codepoints are​:

  perl -Mcharnames -E 'printf "U+%04X​: %s\n"\, $_\, charnames​::viacode($_) for   (0x80 .. 0xFF)'

This program suggests that Mötley Crüe will also trigger The Heavy Metal Unicode Problem.

So\, in conclusion​: the solution to the heavy metal problem plaguing Perl programs is strict discipline. I guess Pastor Mangielo was right after all.

-- rjbs

p5pRT commented 11 years ago

From @mauke

On 16.08.2013 16​:53\, Ricardo Signes wrote​:

Consider this program​:

use 5.18.0; { my $str = "“犬夜叉”"; my @​codepoints = split ''\, $str; say join q{ }\, map {; sprintf 'U+%04X'\, ord } @​codepoints;

This could be replaced by

  say sprintf "%*v04X"\, " "\, $str;

(well\, almost​: the U+ bits are missing).

If you're debugging a problem and you just want to see what Perl thinks your codepoints are\,

  printf "%vd\n"\, $str; # or printf "%vx\n"\, $str

can be extremely useful.

 say $str;
 say $str =~ /\\p\{InCJK\}/ ? "InCJK" : "Not InCJK";

}

say '-' x 78;

{ use utf8; my $str = "“犬夜叉”"; my @​codepoints = split ''\, $str; say join q{ }\, map {; sprintf 'U+%04X'\, ord } @​codepoints; say $str; say $str =~ /\p{InCJK}/ ? "InCJK" : "Not InCJK"; }

-- Lukas Mai \plokinom@&#8203;gmail\.com

p5pRT commented 11 years ago

From @Smylers

Linda Walsh via RT writes​:

On Tue Aug 13 19​:11​:01 2013\, perl.p5p@​rjbs.manxome.org wrote​:

All that utf8.pm does is indicate that your source code is encoded in UTF-8 ...

First\, I want to be sure to start with saying the depth of your answer was suburb

As in\, “a little way out of town\, but not completely rural”?

Smylers -- Stop drug companies hiding negative research results. Sign the AllTrials petition to get all clinical research results published. Read more​: http​://www.alltrials.net/blog/the-alltrials-campaign/

p5pRT commented 11 years ago

From @rjbs

* Ricardo Signes \perl\.p5p@&#8203;rjbs\.manxome\.org [2013-08-16T10​:53​:58]

So\, in conclusion​: the solution to the heavy metal problem plaguing Perl programs is strict discipline. I guess Pastor Mangielo was right after all.

If there are no further issues\, I will close this ticket in a few days.

-- rjbs

p5pRT commented 11 years ago

@rjbs - Status changed from 'open' to 'resolved'