utf8 problems (still, )

Perl / perl5

🐪 The Perl programming language

https://dev.perl.org/perl5/

Other

1.99k stars 559 forks source link

utf8 problems (still, ) #12354

Closed p5pRT closed 12 years ago

p5pRT commented 12 years ago

Migrated from rt.perl.org#114602 (status was 'resolved')

Searchable as RT114602$

p5pRT commented 12 years ago

From perl-diddler@tlinx.org

Created by perl-diddler@tlinx.org

I have PERL5OPT=CSA in my env.

so STDIEO should be utf-8. Right?

I have "use utf8" in my source. So my source is utf8.

I am using a function identifier prefixed with the function prefix\, 'ƒ' (U+192)... but in an earlier bug I complained that characters in the range should be interpreted as UTF8 -- because to do otherwise prevents "escaping" the output to get wide characters..

'ƒ' (U+192) is encoded in UTF as \xc6\x92.

I have a debug routine that prints out the function it was called from.

Instead of "ƒRegister_FStype"\, I get: "ÆRegister_FStype)" (which I see in vim displayed as the capital latin AE ligurature\, followed by a hex unprintable for 0x92. In hex (echo'ed to hex dump -- it's: \xc3 \x86\, \xc2 \x92.

It's like it ignored the utf8 flag in my code and read utf8-encoded bytes in as latin1\, then transcoded them to utf8 again on output.

I'd like to vote for "unless you are in "use bytes"\, values between 128-255 are interpretted as UTF-8 encoded data.

I thought that's what I'd get if I did a utf8 in my code and read the code\, but utf8-compatibility internally broken/inconsisten as evidenced by code that is tagged as utf8 still gets re-translated when going to a utf-8 output stream.

I hope no one will try to justify why this isn't a bug -- i.e. -- why uTF-8 source is not compatible with UTF-8... as that would just be depressing no matter what rationalizing.

just got done realizing that vim's not utf-8 compatible in it's RE engine -- so no wonder it has probs parsing a utf-8 language -- but then had people try to tell me that it really was utf-8 compat -- even though the REengine is ascii only didn't work. I have suggested (as well as years ago) that they use the perl RE engine\, as them trying to duplicate all the work that's gone into perl's unicode seems like a waste\, not to mention near impossible to get right.

Perl Info

``` Flags: category=core severity=medium This perlbug was built using Perl 5.14.2 - Wed Feb 8 15:59:25 UTC 2012 It is being executed now by Perl 5.14.2 - Wed Feb 8 15:55:36 UTC 2012. Site configuration information for perl 5.14.2: Configured by abuild at Wed Feb 8 15:55:36 UTC 2012. Summary of my perl5 (revision 5 version 14 subversion 2) configuration: Platform: osname=linux, osvers=3.1.0-1.2-default, archname=x86_64-linux-thread-multi uname='linux build09 3.1.0-1.2-default #1 smp thu nov 3 14:45:45 utc 2011 (187dde0) x86_64 x86_64 x86_64 gnulinux ' config_args='-ds -e -Dprefix=/usr -Dvendorprefix=/usr -Dinstallusrbinperl -Dusethreads -Di_db -Di_dbm -Di_ndbm -Di_gdbm -Dd_dbm_open -Duseshrplib=true -Doptimize=-fmessage-length=0 -O2 -Wall -D_FORTIFY_SOURCE=2 -fstack-protector -funwind-tables -fasynchronous-unwind-tables -g -Wall -pipe -Accflags=-DPERL_USE_SAFE_PUTENV -Dotherlibdirs=/usr/lib/perl5/site_perl' hint=recommended, useposix=true, d_sigaction=define useithreads=define, usemultiplicity=define useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef use64bitint=define, use64bitall=define, uselongdouble=undef usemymalloc=n, bincompat5005=undef Compiler: cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DPERL_USE_SAFE_PUTENV -fno-strict-aliasing -pipe -fstack-protector -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64', optimize='-fmessage-length=0 -O2 -Wall -D_FORTIFY_SOURCE=2 -fstack-protector -funwind-tables -fasynchronous-unwind-tables -g -Wall -pipe', cppflags='-D_REENTRANT -D_GNU_SOURCE -DPERL_USE_SAFE_PUTENV -fno-strict-aliasing -pipe -fstack-protector' ccversion='', gccversion='4.6.2', gccosandvers='' intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16 ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 alignbytes=8, prototype=define Linker and Libraries: ld='cc', ldflags =' -L/usr/local/lib64 -fstack-protector' libpth=/lib64 /usr/lib64 /usr/local/lib64 libs=-lm -ldl -lcrypt -lpthread perllibs=-lm -ldl -lcrypt -lpthread libc=/lib64/libc-2.14.1.so, so=so, useshrplib=true, libperl=libperl.so gnulibc_version='2.14.1' Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E -Wl,-rpath,/usr/lib/perl5/5.14.2/x86_64-linux-thread-multi/CORE' cccdlflags='-fPIC', lddlflags='-shared -L/usr/local/lib64 -fstack-protector' Locally applied patches: @INC for perl 5.14.2: /usr/lib/perl5/site_perl/5.14.2/x86_64-linux-thread-multi /usr/lib/perl5/site_perl/5.14.2 /usr/lib/perl5/vendor_perl/5.14.2/x86_64-linux-thread-multi /usr/lib/perl5/vendor_perl/5.14.2 /usr/lib/perl5/5.14.2/x86_64-linux-thread-multi /usr/lib/perl5/5.14.2 /usr/lib/perl5/site_perl/5.14.2/x86_64-linux-thread-multi /usr/lib/perl5/site_perl/5.14.2 /usr/lib/perl5/site_perl . Environment for perl 5.14.2: HOME=/home/law LANG=en_US.UTF-8 LANGUAGE (unset) LC_COLLATE=C LC_CTYPE=en_US.UTF-8 LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=.:/sbin:/usr/local/sbin:/home/law/bin:/usr/local/bin:/usr/bin:/bin:/usr/bin/X11:/usr/X11R6/bin:/usr/games:/opt/kde3/bin:/usr/lib/mit/bin:/usr/lib/mit/sbin:/usr/lib/qt3/bin:/usr/sbin:/etc/local/func_lib:/home/law/lib:/home/law/bin/lib PERL5OPT=-CSA PERL_BADLANG (unset) SHELL=/bin/bash ```

p5pRT commented 12 years ago

From @Leont

On Mon\, Aug 27\, 2012 at 3:32 AM\, Linda Walsh \perlbug\-followup@perl\.org wrote:

Hi Linda\,

I have PERL5OPT=CSA in my env.

so STDIEO should be utf-8. Right?

I have "use utf8" in my source. So my source is utf8.

I am using a function identifier prefixed with the function prefix\, 'ƒ' (U+192)... but in an earlier bug I complained that characters in the range should be interpreted as UTF8 -- because to do otherwise prevents "escaping" the output to get wide characters..

'ƒ' (U+192) is encoded in UTF as \xc6\x92.

I have a debug routine that prints out the function it was called from.

Instead of "ƒRegister_FStype"\, I get: "Æ’Register_FStype)" (which I see in vim displayed as the capital latin AE ligurature\, followed by a hex unprintable for 0x92. In hex (echo'ed to hex dump -- it's: \xc3 \x86\, \xc2 \x92.

It's like it ignored the utf8 flag in my code and read utf8-encoded bytes in as latin1\, then transcoded them to utf8 again on output.

Can you add a short code example that shows the issue to your bug report? Without that\, we can only guess what's going on.

I'd like to vote for "unless you are in "use bytes"\, values between 128-255 are interpretted as UTF-8 encoded data.

We can't automatically do that in the general case\, for reasons of backwards compatibility. We may enable «use utf8» though in the future under some «use 5.XX».

Leon

p5pRT commented 12 years ago

The RT System itself - Status changed from 'new' to 'open'

p5pRT commented 12 years ago

From perl-diddler@tlinx.org

Leon Timmermans via RT wrote:

We can't automatically do that in the general case\, for reasons of backwards compatibility. We may enable «use utf8» though in the future under some «use 5.XX».

Got a great idea for solving that.

Why not say if you don't "use 5.20.0"\, you won't get the new semantics.

That solves the backwards compat problem.

Any reason not to move forward with the rest of the world?

As for a simple and/or short test case... that's not easy as it's debug output from a 3400 line perl prog.

I can write a simple test case\, but I don't know if it will duplicate the problem\, meaning I don't know what part of perl is doing it ..

My test routine that has the sub:

package FileSystem;
use 5.12.0; use warnings; use P; use Dbg(1\, 1\, 1); use utf8;

our $supported_fstypes; BEGIN{ bless $supported_fstypes=($supported_fstypes//{})\, __PACKAGE__;}

sub ƒRegister_FStype($) { Tsub; my $p = shift if ref $_[0]; my $c = ref $p || $p; my $type_name = $_[0]; #fs_type = package supporting the 'type_name' $supported_fstypes->{$type_name}=\%{$p} unless $supported_fstypes->{$type_name}; $supported_fstypes->{$type_name}; }

Tsub is located in Dbg... and Dbg uses P; ;

Trying to strip down Dbg...

#! /usr/bin/perl -w # # Depends on "P" #include mem(empty class) { BEGIN{$::INC{'mem.pm'}=1} package Dbg; #{{ {1 our $Version=0.0.1; # 0.0.1 - convert to varname/filename instead of RE (perf) # use mem &{sub(){$::INC{'Dbg.pm'}='/usr/lib/perl5'}}; #use warnings; use 5.12.0; our (@EXPORT\, $Tracing\, $Dumping\, $Tracesub); use mem &{sub(){ @EXPORT = qw(Tracing TPe Tsub) } }; use mem &{sub(){ $Dbg::Tracing=1; } }; use mem &{sub(){ $Dbg::Dumping=1; } }; use mem &{sub(){ $Dbg::Tracesub=1; } }; use P; use utf8; my $path_sep;

sub varname ($) { substr $_[0]\, (1+rindex $_[0]\,':') } #cc'ed from Vars sub filename ($) { substr $_[0]\, (1+rindex $_[0]\,'/') } #dup of above for path

sub import () { my $callee=caller; my $c = __PACKAGE__; no strict 'refs'; foreach my $proc (@EXPORT) { *{$callee."::".$proc}=\&{$c."::".$proc} } my @params=@_; our ($Tracing\, $Dumping\, $Tracesub) = @params if @params; }

our %Internals=(Mem=>1\, Dbg=>1\, ErrHandler=>1);

our $stack_depth_max=0;

sub stack_depth (;$) { # count anything outside of /usr area my $level = 1;my ($f\,$p); ++$level while (($p\,$f)=caller $level) and !$Internals{$p} && ($f !~ m{^/usr}); my $amount; $stack_depth_max=0 unless defined $stack_depth_max; if (($amount=$stack_depth_max-$level)\<0 ) { $stack_depth_max = $level; ${$_[0]} = -$amount if $_[0]; } $level-1; }

sub non_internal_level () { my $level = 1; my $c; ++$level while $c=(caller $level)[0] and $Internals{$c}; $level-1; }

sub pkgsub () { my $level=non_internal_level; ((caller $level)[0]\, (caller $level+1)[3]) }

sub fileln () { my $level=non_internal_level; (filename((caller $level)[1])\, (caller $level)[2\,3]); }

sub fileln_fmt (;$) { state ($pfile\,$pline\,$psub); state $same=0; my ($file\,$line\,$sub) = &fileln; my $display; if ($pfile && $pfile eq $file) { $display=++$same\<14?'same':''; if (!($pline && $pline == $line && !length($same))) { $display.=sprintf "#%04d"\, $line; } } else { $same=0; $display = sprintf "%s#%04d"\, $file\, $line; } @_ and $display.= '(' . $sub . ')'; ($pfile\,$pline\,$psub) = ($file\, $line\,$sub); "[$display]"; }

sub _Dbg_Var (;$) { my $name = shift; return undef unless $name; # value = 0\, not a valid ref; my $ref_name = ref $name; my $m=undef; my $s = stack_depth(\$m); unless ($m) { print STDERR ">" x $s } else { $s-=$m; my $str=(">" x $s). &{sub() {"»" x $m}}; print STDERR $str; } print STDERR &fileln_fmt; return 1 unless $ref_name; # value != 0 \, and not a ref => true my ($clr_pkg\, $clr_sub) = pkgsub(); $clr_sub = ( $clr_sub =~ m{:([^:]+)$} )[0]; my $pkgtrace = $name->{$clr_pkg}; if ($pkgtrace) { my $refpkg=ref $pkgtrace; unless ($refpkg) { # unless it is a ref (i.e. - not simple T/F) #Pe "***$name in $clr_pkg\::\<*> ($clr_sub)***\n"; return 1; } elsif ($refpkg eq 'HASH' && $pkgtrace->{$clr_sub} ) { #Pe "***$name in $clr_pkg\::$clr_sub***\n"; return 1; } } return undef; }

sub Tracing(;$) { my $_ = { _=>$_\, c0=>[(caller 0)[ 0\,1\,2\,3]]\, c13=>((caller 1)[3])}; #Pe "Tracing c0=%s\, c2=%s\, c3=%s c4=%s\,c13=%s"\, @{$_->{c0}}\, $_->{c13}; my $namep; my $ignore=0; #@_ and $ignore+=$_[0]; { no strict 'refs'; my $mod_name = varname($_->{c0}->[3]); #P "modname=%s"\,$mod_name;

defined $mod_name or do { $_=$_->{_}; return undef; };

$namep = \${$mod_name}; } unless ($namep && $$namep) { $_=$_->{_}; return undef; } unshift @_\,$$namep; goto &_Dbg_Var; }

sub Tracesub(;$) { my $namep; my $ignore=0; @_ and $ignore+=$_[0]; { no strict 'refs'; my $lit_name = varname((caller 0+$ignore)[3]); defined $lit_name or return undef; $namep = \${$lit_name}; } return undef unless $namep; return undef unless $$namep; unshift @_\,$$namep; goto &_Dbg_Var; }

sub Tsub (;@) { Tracesub && Pe ('('.varname((caller 1)[3]).')'\,@_) }

1;} #}}}1

And P... is mostly a glorified (say/printf/sprintf ) (Pe ->stderr\, )... I can include those if needed... but it's not exactly finished either (as if 'Dbg' is? or a program still being written is?)...

I stripped out Ddumping routines from Debug as I don't think those get called.

Does that give any clues?

p5pRT commented 12 years ago

From @arc

Linda Walsh \perlbug\-followup@perl\.org wrote:

I am using a function identifier prefixed with the function prefix\, 'ƒ' (U+192) I have a debug routine that prints out the function it was called from. Instead of "ƒRegister_FStype"\, I get: "Æ’Register_FStype)"

AFAICT\, this is caused by older versions of Perl not really handling Unicode in symbol names. However\, Perl 5.16 contains substantial changes to make all of that work\, as mentioned in this perldelta entry:

https://metacpan.org/module/perl5160delta#Unicode-Symbol-Names

$ cat foo.pl use strict; use warnings; use utf8;

sub dbg { print "called from "\, (caller 1)[3]\, "\n" } sub ƒblah { dbg() }

ƒblah(); $ for v in 5.8.9 5.10.0 5.10.1 5.12.4 5.14.2 5.16.0; do

(perlbrew use $v; echo -n "$v: "; perl -CSA foo.pl) done 5.8.9: called from main::Æ’blah 5.10.0: called from main::Æ’blah 5.10.1: called from main::Æ’blah 5.12.4: called from main::Æ’blah 5.14.2: called from main::Æ’blah 5.16.0: called from main::ƒblah

So I think this bug can be resolved.

-- Aaron Crane ** http://aaroncrane.co.uk/

p5pRT commented 12 years ago

From @Leont

On Mon Aug 27 06:23:34 2012\, perl@aaroncrane.co.uk wrote:

AFAICT\, this is caused by older versions of Perl not really handling Unicode in symbol names. However\, Perl 5.16 contains substantial changes to make all of that work\, as mentioned in this perldelta entry:

https://metacpan.org/module/perl5160delta#Unicode-Symbol-Names

$ cat foo.pl use strict; use warnings; use utf8;

sub dbg { print "called from "\, (caller 1)[3]\, "\n" } sub ƒblah { dbg() }

ƒblah(); $ for v in 5.8.9 5.10.0 5.10.1 5.12.4 5.14.2 5.16.0; do

(perlbrew use $v; echo -n "$v: "; perl -CSA foo.pl) done 5.8.9: called from main::Æ’blah 5.10.0: called from main::Æ’blah 5.10.1: called from main::Æ’blah 5.12.4: called from main::Æ’blah 5.14.2: called from main::Æ’blah 5.16.0: called from main::ƒblah

So I think this bug can be resolved.

Excellent\, will close it now :-)

Leon

p5pRT commented 12 years ago

@Leont - Status changed from 'open' to 'resolved'

p5pRT commented 12 years ago

From @ikegami

On Sun\, Aug 26\, 2012 at 9:32 PM\, Linda Walsh \perlbug\-followup@perl\.orgwrote:

I'd like to vote for "unless you are in "use bytes"\, values between 128-255 are interpretted as UTF-8 encoded data.

Interpreted by what?

Things that interpret strings as characters (regex\, uc\, etc) finally treat them as Unicode codepoints as it should when C\<\< use feature qw( unicode_strings ); >> or similar is used.

The requested change would be awful. It would cause Perl to think I provided garbage in C\<\< use 5.012; my $uc = uc(chr(0xE9)); >> instead of placing "É" in $uc.

- Eric

p5pRT commented 12 years ago

From perl-diddler@tlinx.org

Eric Brine via RT wrote:

On Sun\, Aug 26\, 2012 at 9:32 PM\, Linda Walsh \perlbug\-followup@perl\.orgwrote:

I'd like to vote for "unless you are in "use bytes"\, values between 128-255 are interpretted as UTF-8 encoded data. Interpreted by what?

Interpreted by perl\, in or after 5.20.0 in the absence of a "use bytes"\, or "use locale \" directive.

Things that interpret strings as characters (regex\, uc\, etc) finally treat them as Unicode codepoints as it should when C\<\< use feature qw( unicode_strings ); >> or similar is used.

That's currently true for Perl V5.16 (not 5.14 as this bug was filed against.). I will bet there are several places where it isn't true in 5.16 as well\, because use feature unicode isn't the default at the boundary.

You have a habit of using unqualified or domain-specific terminology that obfuscates your claims and what you are saying. This enables you to make categoric statements\, that are untrue in many cases\, but when clarified you'll admit to only applying to certain conditions -- thus making them untrue for all but the 'certain conditions' which you don't state. In some cases\, the conditions for your statements to be true are much narrower than the general case\, making what you say generally false\, OR comes across as you attempting to have everyone accept your narrow definition as the 'general definition'.

Your usage of conditionally true statements as categoric statements not only dissuades discussion about whether or not they are a good idea\, but to suppress consideration of how narrowly your statements must be qualified in order to be true. Thus you give the impression of holding more knowledge than you do\, and since most readers know of *some* circumstance\, where what you say is true\, they let it slide assuming\, in your imprecision\, you are talking about that area.

However\, when talking to a group this is a disfavor to the group\, as those who don't know what specific day of the week (or version\, year\, etc) that your statement is true for\, improperly take it as some universal truth\, when it's really a quirk of truth under some narrow conditions.

At the same time\, you penalize people questioning you about such\, by pointing out the obviousness of your statement's truths using examples from the narrow domains where they apply\, which drawing attention away from those areas where they don't -- thus adding nothing to the discussion other than you making broad generalizations about things that you would like to be true "in general" while being relatively free from close attention to just how narrowly your statements apply.

The requested change would be awful.

For a narrow group in a narrow situation\, but in general would be wonderful. This is a prime example of how you speak for a much wider category than exists for most people.

It would cause Perl to think I provided garbage in C\<\< use 5.012; my $uc = uc(chr(0xE9)); >> instead of placing "É" in $uc.

Great example. If that works with the proposed change\, then can I assume I have your support?

I.e -- lets see\, man page says:

chr NUMBER chr Returns the character represented by that NUMBER in the character set.

For example\, "chr(65)" is "A" in either ASCII or Unicode\, and chr(0x263a) is a Unicode smiley face.

So chr(0xe9) will always return "latin small letter E with Grave".

'uc' works on characters -- and will return the uppercase version of the same -- so does this mean I can count on your full support\, knowing your legacy programs won't have an issue?

Note -- this would fix the bug in 'chr'\, where characters '0x80-0xff' are currently always mapped into the latin1 charset\, regardless of your local and language settings -- ALL numbers outside of the range 0x80-0xff return return the correct unicode character\, but the bug in 'chr' would be changed\, so when the character is written to your screen (which you have defined as being of "some charset -- default would be to use unicode/utf8) it will correctly display unless you have override your terminal's default and didn't bother telling perl about it.

So in the case of broken usage -- it would still be broken\, but from the case you describe there'd be no issue -- Ain't it great! You and me\, in agreement -- what could be the problem?

p5pRT commented 12 years ago

From @ikegami

On Mon\, Aug 27\, 2012 at 5:43 PM\, Linda W \perl\-diddler@tlinx\.org wrote:

Eric Brine via RT wrote:

On Sun\, Aug 26\, 2012 at 9:32 PM\, Linda Walsh \perlbug\-followup@perl\.org**wrote:

I'd like to vote for "unless you are in "use bytes"\, values between 128-255 are interpretted as UTF-8 encoded data. Interpreted by what?

Interpreted by perl\, in or after 5.20.0 in the absence of a "use bytes"\, or "use locale \" directive.

"perl" doesn't interpret bytes (unless you're referring to source code\, but I doubt it). Are you talking about some operators\, or what? The question has not been answered.

Things that interpret strings as characters (regex\, uc\, etc) finally

treat them as Unicode codepoints as it should when C\<\< use feature qw( unicode_strings ); >> or similar is used.

----

That's currently true for Perl V5.16 (not 5.14 as this bug was filed against.). I will bet there are several places where it isn't true in 5.16 as well\, because use feature unicode isn't the default at the boundary.

That sentence makes no sense.

1) unicode_strings has nothing to do with any "boundary".

2) unicode_strings is *never* the default. It must be activated (e.g. using C\<\< use v5.xxx; >>)\, and that can't change for backwards compatibility reasons.

You have a habit of using unqualified or domain-specific terminology

that obfuscates your claims and what you are saying.

No\, I don't. We are using the exact same words (specifically\, "UTF-8" and "Unicode"). You happen to often use them incorrectly\, to the point that it's impossible to make sense of what you say.

This enables you to make categoric statements\, that are untrue in many cases\, but when clarified you'll admit to only applying to certain conditions -- thus making them untrue for all but the 'certain conditions' which you don't state.

Anyone will tell you that's untrue. I'm always "accused" of the opposite.

The requested change would be awful.

For a narrow group in a narrow situation\, but in general would be wonderful. This is a prime example of how you speak for a much wider category than exists for most people.

No\, I was speaking generally. But you know what\, it's easy to do. Just get rid of C\<\< use ut8; >> and C\<\< -CSDA >> on your UTF-8 system. People use those because it's advantageous to use those.

It would cause Perl to think I provided garbage in C\<\< use 5.012; my

$uc = uc(chr(0xE9)); >> instead of placing "É" in $uc.

Great example. If that works with the proposed change\, then can I assume I have your support?

You may not. This would have to continue working with this unspecified change you mention\, but that far from the only thing that must not change. How many test cases does perl have\, for starters?

So chr(0xe9) will always return "latin small letter E with Grave".

Correct.

'uc' works on characters -- and will return the uppercase version of the same

Correct.

-- so does this mean I can count on your full support\, knowing your legacy programs won't have an issue?

Won't have an issue with what?

Note -- this would fix the bug in 'chr'\, where characters '0x80-0xff' are currently always mapped into the latin1 charset\, regardless of your local and language settings -- ALL numbers outside of the range 0x80-0xff return return the correct unicode character\, but the bug in 'chr' would be changed\,

There's no bug in chr. It always returns the correct character\, even for inputs in 0x80-0xFF.

so when the character is written to your screen (which you have defined as being of "some charset

-- default would be to use unicode/utf8)

Default has to be bytes. The default is currently overridden using the following or similar:

use open ':std'\, ':locale';

p5pRT commented 12 years ago

From perl-diddler@tlinx.org

Eric Brine wrote:

On Mon\, Aug 27\, 2012 at 5&#8203;:43 PM\, Linda W \<perl\-diddler@&#8203;tlinx\.org
\<mailto&#8203;:perl\-diddler@&#8203;tlinx\.org>> wrote&#8203;: 

Note \-\- this would fix the bug in 'chr'\, where characters '0x80\-0xff'
are currently always mapped into the latin1 charset\, regardless of
your
local and language settings \-\- ALL numbers outside of the range
0x80\-0xff return return the correct unicode character\, but the bug in
'chr' would be changed\,

There's no bug in chr. It always returns the correct character\, even for inputs in 0x80-0xFF.

so when the character is written to your screen
\(which you have defined as being of "some charset

\-\- default would be to use unicode/utf8\)

Default has to be bytes. The default is currently overridden using the following or similar: use open ':std'\, ':locale';

Oh? That's not what I see...

perl -e 'use P; foreach my $i (0x65\,0xe9\,0x192) { my $c1=chr($i); my $uc1=uc($c1); P "i=%s\, c1=%s\, uc1=%s"\, $i\, $c1\, $uc1; }' i=101\, c1=e\, uc1=E i=233\, c1=c1=wide character in print at /usr/lib/perl5/5.14.2/x86_64-linux-thread-multi/IO/Handle.pm line 417. i=402\, c1=ƒ\, uc1=Ƒ

chr says it Returns the character represented by that NUMBER in the character set. For example\, "chr(65)" is "A" in either ASCII or Unicode\, and chr(0x263a) is a Unicode smiley face.

But the character you present and want to work -- doesn't work in my current 5.14 perl.

This is the "compatible behavior" you need to keep ?

For the above example I unset PERL5OPT\, relying on defaults --

It is only the characters in the range 0x80-0xff that fail with chr to produce a character in *THE* character set.

I don't switch charsets mid-program.Perl does. It' violates any specification -- you said it had to use 'bytes'... but it doesn't. It returned an error for the char you want to use. It if expected bytes\, it would have output the byte -- but it did not. Instead\, it corrupts the data -- producing output that isn't LATIN1 compatible NOR Unicode compatible -- THE WORST possible choice.

Please don't claim that backwards compatibility requires this broken behavior forever into the future\, because backwards compatibility is about the past -- NOT the future.

  That's currently true for Perl V5\.16 \(not 5\.14 as this bug was filed
against\.\)\.  I will bet there are several places where it isn't true in
5\.16 as well\, because use feature unicode isn't the default at the
boundary\.

That sentence makes no sense.

1) unicode_strings has nothing to do with any "boundary".

How much do know about perl's internals? There is character in perl and character data outside of perl If you have perl data\, and want to convert to Latin1: the man page says: " For example\, to convert a string from Perl's internal format into ISO-8859-1\, also known as Latin1: $octets = encode("iso-8859-1"\, $string);"

There is a perl internal format that is different from the output format. Similarly\, "$string = decode(ENCODING\, OCTETS[\, CHECK]) This function returns the string that results from decoding the scalar value OCTETS\, assumed to be a sequence of octets in ENCODING\, into Perl's internal form." Again\, the choice of how you tell perl to interpret the chars is up to you.. But you decide on an interpretation\, then tell perl what routine to use for it to cross the boundary into perl's internal form. This is call a 'boundary'....

Theoretically\, perl's internal format\, I'm told\, approaches UTF-8 for some value of UTF-8 (except perhaps in the broken range). Characters in that range don't get full treated like latin1 nor UTF-8 -- it's broken.

2) unicode_strings is *never* the default. It must be activated (e.g. using C\<\< use v5.xxx; >>)\, and that can't change for backwards compatibility reasons.
 You have a habit of using unqualified or domain\-specific terminology

that obfuscates your claims and what you are saying\.  
No\, I don't. We are using the exact same words ...

Please show me where I am using C\<\> in any form. I use the world boundary to describe a place where on one side there is 1 thing\, and on the other side another. you think there are no boundaries in Perl or that boundaries have nothing to do with unicode\, yet clearly the perl man pages refer to characters needing to be encoded or decoded when crossing the boundary of some external interpretation into perl's internal form.

You somehow think that deciding what encoding a charset is in is not 'interpretation -- likely in the same way you think that choosing to not try to read my writing as comprised of Japanese words and characters\, is also not an act of choosing a particular interpretation of the binary data that is transmi

The requested change would be awful.
 For a narrow group in a narrow situation\, but in general would be
wonderful\.  This is a prime example of how you speak for a much wider
category than exists for most people\.
No\, I was speaking generally. But you know what\, it's easy to do. Just get rid of C\<\< use ut8; >> and C\<\< -CSDA >> on your UTF-8 system. People use those because it's advantageous to use those.

I did -- and have lots of situations like "wide character in print at" because perl doesn't handle unicode.

=============

     It would cause Perl to think I provided garbage in C\<\< use
    5\.012; my
     $uc = uc$chr\(0xE9$\); >> instead of placing "É" in $uc\.

  Great example\.  If that works with the proposed change\, then can
I assume I have your support?
You may not. This would have to continue working with this unspecified change you mention\, but that far from the only thing that must not change. How many test cases does perl have\, for starters?

Obviously it doesn't include the test case at the top -- based on your simple example.

If Perl assumed UTF-8 encoding and didn't do internal conversions in the range 80-0xff\, then the above would work as well as my example. As it is\, perl fails. This is your idea of backward compatibility? It sounds backwards -- right down to the bugs.

But fortunately\, someone can say 'use 5.20.0' (or whenever perl becomes unicode compatible) and that's what they will get -- and there won't be any issues with backwards compatibility -- because no program could be using that construct and still be working with present versions of perl.

So can we lay the strawman "backwards compatibility " to rest"?

p5pRT commented 12 years ago

From @ikegami

On Mon\, Aug 27\, 2012 at 10:41 PM\, Linda W \perl\-diddler@tlinx\.org wrote:

Eric Brine wrote:

On Mon\, Aug 27\, 2012 at 5:43 PM\, Linda W \perl\-diddler@tlinx\.org wrote:

Note -- this would fix the bug in 'chr'\, where characters '0x80-0xff' are currently always mapped into the latin1 charset\, regardless of your local and language settings -- ALL numbers outside of the range 0x80-0xff return return the correct unicode character\, but the bug in 'chr' would be changed\,

There's no bug in chr. It always returns the correct character\, even for inputs in 0x80-0xFF.

so when the character is written to your screen (which you have defined as being of "some charset

-- default would be to use unicode/utf8)

Default has to be bytes. The default is currently overridden using the following or similar: use open ':std'\, ':locale';

----- Oh? That's not what I see...

perl -e 'use P; foreach my $i (0x65\,0xe9\,0x192) { my $c1=chr($i); my $uc1=uc($c1); P "i=%s\, c1=%s\, uc1=%s"\, $i\, $c1\, $uc1; }' i=101\, c1=e\, uc1=E i=233\, c1=c1=wide character in print at /usr/lib/perl5/5.14.2/x86_64-linux-thread-multi/IO/Handle.pm line 417. i=402\, c1=ƒ\, uc1=Ƒ

---- chr says it Returns the character represented by that NUMBER in the
           character set\.  For example\, "chr$65$" is "A" in either ASCII
           or Unicode\, and chr$0x263a$ is a Unicode smiley face\.
--- But the character you present and want to work -- doesn't work in my current 5.14 perl.

[Why do you always use such an awful quoting style?]

You showed a problem with *uc*\, not with *chr*.

Furthermore\, you're not using the 5.14 language. If you did\, uc would work fine.

This is the "compatible behavior" you need to keep ?

Many programs expect uc to only change ASCII chars\, so *we* need to preserve that in old versions of the language. New versions of the langauge (e.g. use 5.014;) does not have that behaviour\, so it obviously does not need to "keep" it.

I don't switch charsets mid-program.Perl does.

No it doesn't. Perl doesn't know anything about charsets\, so it can't possibly switch them.

1) unicode_strings has nothing to do with any "boundary".

---- How much do know about perl's internals?

A lot.

Theoretically\, perl's internal format\, I'm told\, approaches UTF-8 for some value of UTF-8 (except perhaps in the broken range). Characters in that range don't get full treated like latin1 nor UTF-8 -- it's broken.

Just like Perl has multiple number formats (IV\, UV\, NV)\, Perl has more than one string formats. It has one that can stores 8-bit values\, and it has one that can store 32/64-bit values (depending on the build). Perl can convert between the two seamlessly\, just like it does with numbers.

Some functions cared about the internal format. Those were fixed in the 5.14 language (C\<\< use 5.014; >>). (There are some issues remaining with file names.)

Please show me where I am using C\<\> in any form.

huh? Never heard of xxyz.

you think there are no boundaries in Perl

Completely false. Where did you get that idea?????

You somehow think that deciding what encoding a charset is in is not 'interpretation

No idea what that means.

I did -- and have lots of situations like "wide character in print at" because perl doesn't handle unicode.

No\, that very message is proof that you *didn't*. You're still using Unicode characters instead of UTF-8.

But fortunately\, someone can say 'use 5.20.0' (or whenever perl becomes unicode compatible)

duh\, I told you that.

So can we lay the strawman "backwards compatibility " to rest"?

Please do and address what I actually said. It's like you don't even read that to which you reply.

p5pRT commented 12 years ago

From perl-diddler@tlinx.org

Eric Brine wrote:

On Mon\, Aug 27\, 2012 at 10:41 PM\, Linda W \perl\-diddler@tlinx\.org wrote:
Eric Brine wrote:

On Mon\, Aug 27\, 2012 at 5:43 PM\, Linda W \perl\-diddler@tlinx\.org wrote:

Note -- this would fix the bug in 'chr'\, where characters '0x80-0xff' are currently always mapped into the latin1 charset\, regardless of your local and language settings -- ALL numbers outside of the range 0x80-0xff return return the correct unicode character\, but the bug in 'chr' would be changed\,

There's no bug in chr. It always returns the correct character\, even for inputs in 0x80-0xFF.

so when the character is written to your screen (which you have defined as being of "some charset

-- default would be to use unicode/utf8)

Default has to be bytes. The default is currently overridden using the following or similar: use open ':std'\, ':locale';

----- Oh? That's not what I see...

perl -e 'use P; foreach my $i (0x65\,0xe9\,0x192) { my $c1=chr($i); my $uc1=uc($c1); P "i=%s\, c1=%s\, uc1=%s"\, $i\, $c1\, $uc1; }' i=101\, c1=e\, uc1=E i=233\, c1=c1=wide character in print at /usr/lib/perl5/5.14.2/x86_64-linux-thread-multi/IO/Handle.pm line 417. i=402\, c1=ƒ\, uc1=Ƒ

---- chr says it Returns the character represented by that NUMBER in the
           character set\.  For example\, "chr$65$" is "A" in either ASCII
           or Unicode\, and chr$0x263a$ is a Unicode smiley face\.
--- But the character you present and want to work -- doesn't work in my current 5.14 perl.
[Why do you always use such an awful quoting style?]

You showed a problem with *uc*\, not with *chr*.

?? how do you arrive at that conclusion?

So why do I get the same error if I remove *uc*?

n> perl -e 'use P; foreach my $i (0x65\,0xe9\,0x192) { my $c1=chr($i); P "i=%s\, c1=%s"\, $i\, $c1; } ' i=101\, c1=e i=233\, c1=ide character in print at /usr/lib/perl5/5.14.2/x86_64-linux-thread-multi/IO/Handle.pm line 417. i=402\, c1=ƒ

p5pRT commented 12 years ago

From @Leont

On Wed\, Aug 29\, 2012 at 9:36 AM\, Linda W \perl\-diddler@tlinx\.org wrote:

?? how do you arrive at that conclusion?

So why do I get the same error if I remove *uc*?

n> perl -e 'use P;

foreach my $i (0x65\,0xe9\,0x192) { my $c1=chr($i); P "i=%s\, c1=%s"\, $i\, $c1; } ' i=101\, c1=e i=233\, c1=ide character in print at /usr/lib/perl5/5.14.2/x86_64-linux-thread-multi/IO/Handle.pm line 417. i=402\, c1=ƒ

Because STDOUT hasn't been set to being utf8. It's printing the 0x192 that causes that error.

Leon

p5pRT commented 12 years ago

From @demerphq

On 29 August 2012 09:36\, Linda W \perl\-diddler@tlinx\.org wrote:

Eric Brine wrote:

On Mon\, Aug 27\, 2012 at 10:41 PM\, Linda W \perl\-diddler@tlinx\.org wrote:

Eric Brine wrote:

On Mon\, Aug 27\, 2012 at 5:43 PM\, Linda W \perl\-diddler@tlinx\.org wrote:

Note -- this would fix the bug in 'chr'\, where characters '0x80-0xff' are currently always mapped into the latin1 charset\, regardless of your local and language settings -- ALL numbers outside of the range 0x80-0xff return return the correct unicode character\, but the bug in 'chr' would be changed\,

There's no bug in chr. It always returns the correct character\, even for inputs in 0x80-0xFF.

so when the character is written to your screen (which you have defined as being of "some charset

-- default would be to use unicode/utf8)

Default has to be bytes. The default is currently overridden using the following or similar: use open ':std'\, ':locale';

----- Oh? That's not what I see...

perl -e 'use P; foreach my $i (0x65\,0xe9\,0x192) { my $c1=chr($i); my $uc1=uc($c1); P "i=%s\, c1=%s\, uc1=%s"\, $i\, $c1\, $uc1; }' i=101\, c1=e\, uc1=E i=233\, c1=c1=wide character in print at /usr/lib/perl5/5.14.2/x86_64-linux-thread-multi/IO/Handle.pm line 417. i=402\, c1=ƒ\, uc1=Ƒ

Please done post examples using your pet module P. We have no idea what that module does and we dont care really\, except to the extent that perhaps you can extract part of it and show it to be buggy.

If you use a utf8 terminal then you must make perl output utf8 representations of any characters you are using.

Here is an explicit way to do so:

perl -e ' use Encode qw(encode_utf8); foreach my $i (0x65\,0xe9\,0x192) { my $c1=chr($i); print encode_utf8 "i=%s\, c1=%s"\, $i\, $c1; } ' As leon says you could also use IO layers to achieve this.

chr(192) is going to be "encodingless" in perl until it is concatenated with a unicode string\, or it is unicode "upgraded".

Any output you put through IO operators like print (absent IO layers) is expected to be BYTES. encode_utf8() ensures that your input string\, whatever it may be is converted to the appropriate byte sequence dictated by utf8 for each codepoint the string contains.

Yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 12 years ago

From @demerphq

On 29 August 2012 10:22\, demerphq \demerphq@gmail\.com wrote:

On 29 August 2012 09:36\, Linda W \perl\-diddler@tlinx\.org wrote:

Eric Brine wrote:

On Mon\, Aug 27\, 2012 at 10:41 PM\, Linda W \perl\-diddler@tlinx\.org wrote:

Eric Brine wrote:

On Mon\, Aug 27\, 2012 at 5:43 PM\, Linda W \perl\-diddler@tlinx\.org wrote:

Note -- this would fix the bug in 'chr'\, where characters '0x80-0xff' are currently always mapped into the latin1 charset\, regardless of your local and language settings -- ALL numbers outside of the range 0x80-0xff return return the correct unicode character\, but the bug in 'chr' would be changed\,

There's no bug in chr. It always returns the correct character\, even for inputs in 0x80-0xFF.

so when the character is written to your screen (which you have defined as being of "some charset

-- default would be to use unicode/utf8)

Default has to be bytes. The default is currently overridden using the following or similar: use open ':std'\, ':locale';

----- Oh? That's not what I see...

perl -e 'use P; foreach my $i (0x65\,0xe9\,0x192) { my $c1=chr($i); my $uc1=uc($c1); P "i=%s\, c1=%s\, uc1=%s"\, $i\, $c1\, $uc1; }' i=101\, c1=e\, uc1=E i=233\, c1=c1=wide character in print at /usr/lib/perl5/5.14.2/x86_64-linux-thread-multi/IO/Handle.pm line 417. i=402\, c1=ƒ\, uc1=Ƒ

Please done post examples using your pet module P. We have no idea what that module does and we dont care really\, except to the extent that perhaps you can extract part of it and show it to be buggy.

If you use a utf8 terminal then you must make perl output utf8 representations of any characters you are using.

Here is an explicit way to do so:

perl -e ' use Encode qw(encode_utf8); foreach my $i (0x65\,0xe9\,0x192) { my $c1=chr($i); print encode_utf8 "i=%s\, c1=%s"\, $i\, $c1;

That should be print encode_utf sprintf "i=%s\, c1=%s"\, $i\, $c1;

Sorry. -- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 12 years ago

From @Leont

On Wed\, Aug 29\, 2012 at 10:22 AM\, demerphq \demerphq@gmail\.com wrote:

Please done post examples using your pet module P. We have no idea what that module does and we dont care really\, except to the extent that perhaps you can extract part of it and show it to be buggy.

It seems to do something like «printf "$_[0]\n"\, @_[1..$#_]»

chr(192) is going to be "encodingless" in perl until it is concatenated with a unicode string\, or it is unicode "upgraded".

She's using 0x192\, not 192 ;-)

Leon

p5pRT commented 12 years ago

From @demerphq

On 29 August 2012 11:15\, Leon Timmermans \fawaka@gmail\.com wrote:

On Wed\, Aug 29\, 2012 at 10:22 AM\, demerphq \demerphq@gmail\.com wrote:

Please done post examples using your pet module P. We have no idea what that module does and we dont care really\, except to the extent that perhaps you can extract part of it and show it to be buggy.

It seems to do something like «printf "$_[0]\n"\, @_[1..$#_]»

Ah. Well neither of us should need to know that. She should spell it out so anyone reading this list can understand what she is talking about without additional context.

chr(192) is going to be "encodingless" in perl until it is concatenated with a unicode string\, or it is unicode "upgraded".

She's using 0x192\, not 192 ;-)

Ah\, right\, not enough coffee when I wrote my reply obviously.

Yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 12 years ago

From perl-diddler@tlinx.org

Leon Timmermans wrote:

On Wed\, Aug 29\, 2012 at 9:36 AM\, Linda W \perl\-diddler@tlinx\.org wrote:

?? how do you arrive at that conclusion?

So why do I get the same error if I remove *uc*?

n> perl -e 'use P;

foreach my $i (0x65\,0xe9\,0x192) { my $c1=chr($i); P "i=%s\, c1=%s"\, $i\, $c1; } ' i=101\, c1=e i=233\, c1=ide character in print at /usr/lib/perl5/5.14.2/x86_64-linux-thread-multi/IO/Handle.pm line 417. i=402\, c1=ƒ

Because STDOUT hasn't been set to being utf8. It's printing the 0x192 that causes that error.

Then let's reverse the order: perl -e 'use P; foreach my $i (0x65\,0x192\, 0xe9\,) { my $c1=chr($i); P "i=%s\, c1=%s"\, $i\, $c1; } ' i=101\, c1=e Wide character in print at /usr/lib/perl5/5.14.2/x86_64-linux-thread-multi/IO/Handle.pm line 417. i=402\, c1=ƒ i=233\, c1=]1;

Why does it print the script ƒ correctly\, but not the acute é ? With the high bit of the char (0x80) set\, they are both known to be non-ASCII. In both cases\, on output\, perl is treating them as non-ASCII. Yet it corrupts the acute é .

So\, UTF-8 isn't set on my output...What is it set to? that it correctly prints the script ƒ but not the acute é ?

p5pRT commented 12 years ago

From @Leont

On Wed\, Aug 29\, 2012 at 7:25 PM\, Linda W \perl\-diddler@tlinx\.org wrote:

Why does it print the script ƒ correctly\, but not the acute é ?

The é is printed as latin-1 to STDOUT\, because STDOUT is not in any kind of unicode mode. The ƒ can not be transformed into latin-1\, so a warning is given and the sequence is written as UTF8 as a fallback.

Your terminal is set up to be UTF-8\, Which in means the ƒ will be interpreted correctly but é not. If you had set it to latin-1 it would be just the other way around.

With the high bit of the char (0x80) set\, they are both known to be non-ASCII. In both cases\, on output\, perl is treating them as non-ASCII. Yet it corrupts the acute é .

No it doesn't. Your terminal does.

So\, UTF-8 isn't set on my output...What is it set to? that it correctly prints the script ƒ but not the acute é ?

Bytes

Leon

p5pRT commented 12 years ago

From @ikegami

On Wed\, Aug 29\, 2012 at 3:36 AM\, Linda W \perl\-diddler@tlinx\.org wrote:

You showed a problem with *uc*\, not with *chr*.

---- ?? how do you arrive at that conclusion?

Sorry\, a problem with your *print*. You passed garbage to it (non-bytes when you told it should be expecting bytes).

p5pRT commented 12 years ago

From @ikegami

On Wed\, Aug 29\, 2012 at 1:25 PM\, Linda W \perl\-diddler@tlinx\.org wrote:

Why does it print the script ƒ correctly\,

I don't call issuing a warning and getting the right character by a fluke "printing correctly".

but not the acute é ?

You didn't print "é". You printed E9. E9 got printed correctly. It doesn't show up as é because E9 is not é to your terminal.

p5pRT commented 12 years ago

From perl-diddler@tlinx.org

Eric Brine via RT wrote:

On Wed\, Aug 29\, 2012 at 1:25 PM\, Linda W \perl\-diddler@tlinx\.org wrote:

Why does it print the script ƒ correctly\,

I don't call issuing a warning and getting the right character by a fluke "printing correctly".

but not the acute é ?

You didn't print "é". You printed E9. E9 got printed correctly.

No -- I used:

"foreach my $i (0x65\,0xe9\,0x192) {P "i=%s\, c1=%s"\, $i\, chr($i)}"

which prints "chr(0xe9)"\, not E9. This is the bug in chr that I'm referring to -- the manpage for chr says:

"For example\, "chr(65)" is "A" in either ASCII or Unicode\, and chr(0x263a) is a Unicode smiley face."

So 0x65 -> code point 0x65 - prints correctly\, 0x192 -> code point 0x192 also prints correctly. The bug is that printing chr(0xe9) doesn't print code point 0xE9. It prints garbage.

Second point is responding to Leon who said in response to me:

Leon Timmermans wrote:

On Wed\, Aug 29\, 2012 at 7:25 PM\, Linda W \perl\-diddler@tlinx\.org wrote:

So\, UTF-8 isn't set on my output...What is it set to? that it correctly prints the script ƒ but not the acute é ?

Bytes

If chr(0xe9) is gave unicode "code-point 0xe9" in the same way that "chr(0x192)" and "chr(0x65)"\, I would see acute e (é). But Erik claims that chr(0xe9) doesn't produce code point 0xe9 as it is documented to do but\, produces the byte value E9. That is consistent with your statement that perl is generating "bytes". However\,

But if perl was printing out "bytes" I should see \x92\x01 where it prints the script ƒ -- which would be FINE with me.. as I have not told it to convert to UTF-8 on output. It's not that I don't know how to do that. I do. But I am talking about **default*** behavior.

In examining the hex output:

00000000 69 3d 31 30 31 2c 20 63 31 3d 65 0a 69 3d 32 33 |i=101\, c1=e.i=23| 00000010 33 2c 20 63 31 3d e9 0a 69 3d 34 30 32 2c 20 63 |3\, c1=..i=402\, c| 00000020 31 3d c6 92 0a |1=...|

We see that chr(0x192) doesn't print \x92\x01 (which would be the byte values as ordered in memory)\, but\, contrary to Erik's claim that Perl does no interpretation\, It has interpreted the chr(0x192) to mean \u0192 or code point U+402 (\u0192) and generates the representation of that on output.

However\, it doesn't correctly interpret chr(0xe9) as \u00e9 (UCP U+233\, acute e) when generating output.

In the perlre manpage\, under the search rules\, the default is: On ASCII platforms\, the native rules are ASCII\, and on EBCDIC platforms (at least the ones that Perl handles)\, they are Latin-1.

Ok... I'm not on an EBCDIC platform.

But under unicode interpretation (/u) we see: " On ASCII platforms\, this means that the code points between 128 and 255 take on their Latin-1 (ISO-8859-1) meanings (which are the same as Unicode's)\, whereas in strict ASCII their meanings are undefined. Thus the platform effectively becomes a Unicode platform"..

So the problem is that on output\, perl isn't following the default rules -- to use either ASCII OR\, for c>0x80\, Unicode\, but is using the rules as would only be active under "/u" in a search string.

The default is the problem ... it should behave like the default in search. To do otherwise is to create an internal inconsistency that is entirely within perl -- i.e. in the regex\, the default is to treat it as a code point\, but when I go to write it out\, it converts it to latin1.

I am suggesting that after perl 5.20.0\, this inconsistency be eliminated and it will default to writing it as UTF-8 on output. Such would not be the case if a pragma was in effect (like use bytes)\, but as it is now\, I get neither UTF-8\, NOR byte compatibility on output -- but UTF-8 encoded output for code points > 0xff and \< 0x7f\, but NOT between those values.

p5pRT commented 12 years ago

From @ikegami

On Wed\, Aug 29\, 2012 at 6:55 PM\, Linda W \perl\-diddler@tlinx\.org wrote:

Eric Brine via RT wrote:

On Wed\, Aug 29\, 2012 at 1:25 PM\, Linda W \perl\-diddler@tlinx\.org wrote:

Why does it print the script ƒ correctly\,

I don't call issuing a warning and getting the right character by a fluke "printing correctly".

but not the acute é ?

You didn't print "é". You printed E9. E9 got printed correctly.

No -- I used:

"foreach my $i (0x65\,0xe9\,0x192) {P "i=%s\, c1=%s"\, $i\, chr($i)}"

which prints "chr(0xe9)"\, not E9. This is the bug in chr that I'm referring to -- the manpage for chr says:

I didn't mean the chars "E" and "9"\, I meant the byte with value 0xE9.

So again\, you told Perl to print E9\, and Perl printed E9. No bug there.

So again\, the bug is actually in your code when you tried to print chr(0x192): It doesn't map to a byte\, and you didn't tell Perl how it should convert it to a byte. That's why you got a warning. That why it wouldn't have worked in a different terminal.

Neither of these can be changed.

The default handling of what is sent to handle cannot be changed either. It would be acceptable for a pragma to change the default\, though\, and such a pragma already exists (open).

"For example\, "chr(65)" is "A" in either ASCII or Unicode\, and chr(0x263a) is a Unicode smiley face."

Note the "in ASCII" and "in Unicode". Elsewhere (e.g. to a UTF-8 terminal like yours)\, it means other things.

So 0x65 -> code point 0x65

chr(0x65) is not code point 0x65. It's a character with value 0x65. It could mean U+0065\, but it could also mean "violet". Perl does not\, should not and cannot assign any meaning to it.

- Eric

p5pRT commented 12 years ago

From @ikegami

On Wed\, Aug 29\, 2012 at 7:29 PM\, Eric Brine \ikegami@adaelis\.com wrote:

So again\, you told Perl to print E9\, and Perl printed E9. No bug there.

So again\, the bug is actually in your code when you tried to print chr(0x192): It doesn't map to a byte\, and you didn't tell Perl how it should convert it to a byte. That's why you got a warning. That why it wouldn't have worked in a different terminal.

I forgot to mention: Keep in mind that handles can only transmit bytes.

p5pRT commented 12 years ago

From perl-diddler@tlinx.org

Eric Brine wrote:

On Wed\, Aug 29\, 2012 at 6:55 PM\, Linda W \perl\-diddler@tlinx\.org wrote:

Eric Brine via RT wrote:

On Wed\, Aug 29\, 2012 at 1:25 PM\, Linda W \perl\-diddler@tlinx\.org wrote:

Why does it print the script ƒ correctly\,

I don't call issuing a warning and getting the right character by a fluke "printing correctly".

but not the acute é ?

You didn't print "é". You printed E9. E9 got printed correctly.

No -- I used:

"foreach my $i (0x65\,0xe9\,0x192) {P "i=%s\, c1=%s"\, $i\, chr($i)}"

which prints "chr(0xe9)"\, not E9. This is the bug in chr that I'm referring to -- the manpage for chr says:

I didn't mean the chars "E" and "9"\, I meant the byte with value 0xE9.

I Didn't say you meant the chars E and 9\, I was using your syntax for 0xe9 (as faulty as you seem to now think it is).

You can't even understand your own syntax -- and every other point addressed below is you playing word games -- as though you were deliberately being an obstructive troll.

So again\, you told Perl to print E9\, and Perl printed E9. No bug there.

No... I told it to print chr(0xE9).... that's what the code says. So again\, the bug is actually in your code when you tried to print chr(0x192): It doesn't map to a byte\, and you didn't tell Perl how it should convert it to a byte. That's why you got a warning. That why it wouldn't have worked in a different terminal.

Perl displayed a warning because it didn't know how to convert UTF-8 -- so it displayed UTF-8 and DID THE RIGHT THING -- if it was to display to UTF-8 -- IT did the wrong thing if it was to display it as bytes...

You and others have claimed the output is in bytes -- that's not what it did. It encoded 0x192 as UTF-8 and displayed that.

You can't have it both ways. Either don't convert the 0x192 to UTF-8 on output OR DO convert the 0xE9 -- right now it's broken no matter which way you look at it.

Neither of these can be changed.

You seem to think that because you don't want it\, it can't be done. This isn't supported by physical or legislative law\, and certainly not by common sense. It's a bug\, it can be changed or it's a bad design\, and it can still be changed. Why do you think it's called "software"? It isn't carved in stone.

The default handling of what is sent to handle cannot be changed either. It would be acceptable for a pragma to change the default\, though\, and such a pragma already exists (open).

"For example\, "chr(65)" is "A" in either ASCII or Unicode\, and chr(0x263a) is a Unicode smiley face."

Note the "in ASCII" and "in Unicode". Elsewhere (e.g. to a UTF-8 terminal like yours)\, it means other things.

But my terminal displays unicode. Not UTF-8. UTF-8 is not a character set\, it is an "encoding". Characters are display -- not "encodings". "encodings" are used to transport non-byte information over a byte-streamed data path.

So 0x65 -> code point 0x65

chr(0x65) is not code point 0x65. It's a character with value 0x65. It could mean U+0065\, but it could also mean "violet". Perl does not\, should not and cannot assign any meaning to it.

-- Sorry\, but the man page says it is "A" -- a character in ASCII or UNICODE.

Why don't you fix the manpage and stop playing word games.

- Eric

p5pRT commented 12 years ago

From @dmcbride

Eric Brine wrote:

On Wed\, Aug 29\, 2012 at 6:55 PM\, Linda W \perl\-diddler@tlinx\.org wrote:

Eric Brine via RT wrote:

On Wed\, Aug 29\, 2012 at 1:25 PM\, Linda W \perl\-diddler@tlinx\.org wrote:

Why does it print the script ƒ correctly\,

I don't call issuing a warning and getting the right character by a fluke "printing correctly".

This is not nearly as straight forward as you might want to think it is.

but not the acute é ?

You didn't print "é". You printed E9. E9 got printed correctly.

No -- I used:

"foreach my $i (0x65\,0xe9\,0x192) {P "i=%s\, c1=%s"\, $i\, chr($i)}"

which prints "chr(0xe9)"\, not E9. This is the bug in chr that I'm

referring to -- the manpage for chr says: I didn't mean the chars "E" and "9"\, I meant the byte with value 0xE9.

What character is at unicode codepoint 0xe9?

$ perl -e 'print chr 0xe9'| hexdump 0000000 00e9
0000001

Looks like it's printing out properly to me. However\, it doesn't print é\, it prints � Garbage. Is that something perl is doing? Let's check when we do it "right":

$ perl -CS -e 'print chr 0xe9'| hexdump 0000000 a9c3
0000002

And here we get our é. Notice that it didn't actually print out 0xe9.
Instead\, it was converted to UTF-8\, as per the -CS\, and printed out that way.
And then the terminal interpreted the UTF-8\, translated it to a unicode code point\, and looked up the glyph to display.

So what happened when we printed out 0xe9 directly? The same thing: the terminal tried to interpret it as-is\, but because the high bit is set\, it's looking for something else. Since nothing else is printed\, the unicode conversion fails\, and we get a garbage glyph.

You can't even understand your own syntax \-\- and every other point

Actually\, I'm guessing that Eric simply can't understand you.

addressed below is you playing word games -- as though you were deliberately being an obstructive troll.

No\, I'm guessing that Eric is wondering about the same thing in reverse\, but is too polite to say it. You should try not to call people attempting to help you\, even if they're failing\, names of any kind\, too.

So again\, the bug is actually in your code when you tried to print chr(0x192): It doesn't map to a byte\, and you didn't tell Perl how it should convert it to a byte. That's why you got a warning. That why it wouldn't have worked in a different terminal.

-----
Perl displayed a warning because it didn't know how to convert
UTF-8 -- so it displayed UTF-8 and DID THE RIGHT THING -- if it was

No\, Perl displayed the bytes as-is\, after warning you that it doesn't know what to do with them.

to display to UTF-8 -- IT did the wrong thing if it was to display it as bytes...

$ perl -e 'print chr 0x192' | hexdump Wide character in print at -e line 1. 0000000 92c6
0000002 $ perl -CS -e 'print chr 0x192' | hexdump 0000000 92c6
0000002

Notice that the output of these two commands is identical except for the warning (on STDERR\, thankfully). Perl is spitting out the UTF-8-encoded text as-is either way\, but the first one gets a warning because we didn't explicitly set the output to be utf8\, but the second one sees that we made it explicit what encoding to use\, and is thusly happy with the wide characters.

You and others have claimed the output is in bytes -- that's not what it did. It encoded 0x192 as UTF-8 and displayed that.

Nope. Try installing hexdump or some equivalent hex viewer and see what is really showing up.

You can't have it both ways. Either don't convert the 0x192 to UTF-8 on output OR DO convert the 0xE9 -- right now it's broken no matter which way you look at it.

Perl is converting to UTF-8 when it can. The problem with 0xe9 appears to be backwards compatibility: the idea that code that doesn't deal with unicode shouldn't need to change. The flaw here may simply be that at that time\, unicode terminals might have been rare\, whereas today they're common.

"For example\, "chr(65)" is "A" in either ASCII

or Unicode\, and chr(0x263a) is a Unicode smiley face."

Note the "in ASCII" and "in Unicode". Elsewhere (e.g. to a UTF-8 terminal like yours)\, it means other things.

But my terminal displays unicode. Not UTF-8. UTF-8 is not a character

I think there's a fundamental misunderstanding of the difference between "unicode" and "UTF-8" going on here.

Your terminal displays unicode. It accepts the unicode via the UTF-8 encoding\, because streams only hold bytes\, not characters.

set\, it is an "encoding". Characters are display -- not "encodings". "encodings" are used to transport non-byte information over a byte-streamed data path.

Such as stdout.

So 0x65 -> code point 0x65

Which is not the same as the example above which uses chr(65)\, which is the same as chr(0x41)\, not chr(0x65).

chr(0x65) is not code point 0x65. It's a character with value 0x65. It could mean U+0065\, but it could also mean "violet". Perl does not\, should not and cannot assign any meaning to it.

-- Sorry\, but the man page says it is "A" -- a character in ASCII or UNICODE.

Nope.

Why don't you fix the manpage and stop playing word games.

Calm down\, and try to read what is there in a charitable light. Eric is really a very helpful guy. When someone is misunderstanding you\, the right course of action is to ask what you can do to help correct the misunderstanding\, not to attack someone who obviously knows more about the subject at hand than you and I put together.

The man page is correct.

$ perl -e 'print chr 65' A

p5pRT commented 12 years ago

From perl-diddler@tlinx.org

Darin McBride via RT wrote:

This is not nearly as straight forward as you might want to think it is.

but not the acute é ?

You didn't print "é". You printed E9. E9 got printed correctly.

No -- I used:

"foreach my $i (0x65\,0xe9\,0x192) {P "i=%s\, c1=%s"\, $i\, chr($i)}"

which prints "chr(0xe9)"\, not E9. This is the bug in chr that I'm referring to -- the manpage for chr says: I didn't mean the chars "E" and "9"\, I meant the byte with value 0xE9.

What character is at unicode codepoint 0xe9?

$ perl -e 'print chr 0xe9'| hexdump 0000000 00e9
0000001

Looks like it's printing out properly to me. However\, it doesn't print é\, it prints � Garbage. Is that something perl is doing? Let's check when we do it "right":

$ perl -CS -e 'print chr 0xe9'| hexdump 0000000 a9c3
0000002

And here we get our é. Notice that it didn't actually print out 0xe9.
Instead\, it was converted to UTF-8\, as per the -CS\, and printed out that way.
And then the terminal interpreted the UTF-8\, translated it to a unicode code point\, and looked up the glyph to display.

So what happened when we printed out 0xe9 directly? The same thing: the terminal tried to interpret it as-is\, but because the high bit is set\, it's looking for something else. Since nothing else is printed\, the unicode conversion fails\, and we get a garbage glyph.
You can't even understand your own syntax \-\- and every other point
Actually\, I'm guessing that Eric simply can't understand you.

addressed below is you playing word games -- as though you were deliberately being an obstructive troll.

No\, I'm guessing that Eric is wondering about the same thing in reverse\, but is too polite to say it. You should try not to call people attempting to help you\, even if they're failing\, names of any kind\, too.
So again\, the bug is actually in your code when you tried to print chr(0x192): It doesn't map to a byte\, and you didn't tell Perl how it should convert it to a byte. That's why you got a warning. That why it wouldn't have worked in a different terminal. -----
Perl displayed a warning because it didn't know how to convert
UTF-8 -- so it displayed UTF-8 and DID THE RIGHT THING -- if it was
No\, Perl displayed the bytes as-is\, after warning you that it doesn't know what to do with them.

to display to UTF-8 -- IT did the wrong thing if it was to display it as bytes...

$ perl -e 'print chr 0x192' | hexdump Wide character in print at -e line 1. 0000000 92c6
0000002 $ perl -CS -e 'print chr 0x192' | hexdump 0000000 92c6
0000002

Notice that the output of these two commands is identical except for the warning (on STDERR\, thankfully). Perl is spitting out the UTF-8-encoded text as-is either way\, Yes -- and I am saying when it encounters the 0xE9 which it knows to be a wide character (a _unicode_ character)\, it should spit it out in the UTF-8 encoded text AS WELL!

In one case\, it take the wide char 0x192 and spits out a UTF-8 encoding of it.

In the other case\, it takes a wide character 0xe9 and spits out as binary.

These are inconsistent.

This is the bug I am referring to.

Nope. Try installing hexdump or some equivalent hex viewer and see what is really showing up. In the previous email\, I displayed the full hex output from hexdump.
(hexdump -C)

In examining the hex output:

00000000 69 3d 31 30 31 2c 20 63 31 3d 65 0a 69 3d 32 33 ||i=101\, c1=e.i=23| 00000010 33 2c 20 63 31 3d e9 0a 69 3d 34 30 32 2c 20 63 |3\, c1=..i=402\, c| ^^^ output of E9 in binary (not in UTF-8) |00000020 31 3d c6 92 0a |1=...| ^^^ output 0x192 -- not in binary\, encoded in UTF-8

This is the **default** output with no encodings turned on -- it puts out binary for \u00E9\, and UTF-8 for \u192.

This is the entire point. In default mode\, with no flags\, it puts out garbage -- it is neither fully binary\, and it is neither fully UTF-8

It's half and half which is what I am calling garbage -- computers putting out "inconsistent output" = garbage = bug.

Perl is converting to UTF-8 when it can. The problem with 0xe9 appears to be backwards compatibility: the idea that code that doesn't deal with unicode shouldn't need to change. The flaw here may simply be that at that time\, unicode terminals might have been rare\, whereas today they're common.

That's fine... code that doesn't want to deal with unicode doesn't need to go back and edit their code and put in "use 5.20" statements.

I'm only proposing this for the future -- and in the future -- I want to see perl be consistent. Not put out garbage in the default cause as it does now. (Garbage as defined above!)

But my terminal displays unicode. Not UTF-8. UTF-8 is not a character

I think there's a fundamental misunderstanding of the difference between "unicode" and "UTF-8" going on here. ---Now you are playing word games....how does your next paragraph differ from the following one?

Your terminal displays unicode. It accepts the unicode via the UTF-8 encoding\, because streams only hold bytes\, not characters.

set\, it is an "encoding". Characters are display -- not "encodings". "encodings" are used to transport non-byte information over a byte-streamed data path.

So 0x65 -> code point 0x65

Which is not the same as the example above which uses chr(65)\, which is the same as chr(0x41)\, not chr(0x65). What?

This example:

"(0x65\,0xe9\,0x192) " -- Those are all hex -- there is no 0x41.

(go back above -- I left it -- look at the code)

p5pRT commented 12 years ago

From @demerphq

Let me say this once again\, there is no bug.

And stop arguing with people that are trying to help you.

When three or four clever people on this list tell you that there is no bug then there is no bug. Repeating over and over your misconceptions does not help you or anyone else.

Attacking the people generously offering their private time to try to help you is completely unacceptable on this mailing list.

If you do not start showing some understanding of the social etiquette of this list then you will get blacklisted from it. That is not something we want to do. But we will if you do not make an effort to get along with the people that inhabit this list.

Yves

On 30 August 2012 07:41\, Linda W \perl\-diddler@tlinx\.org wrote:

Darin McBride via RT wrote:

This is not nearly as straight forward as you might want to think it is.

but not the acute é ?

You didn't print "é". You printed E9. E9 got printed correctly.

No -- I used:

"foreach my $i (0x65\,0xe9\,0x192) {P "i=%s\, c1=%s"\, $i\, chr($i)}"

which prints "chr(0xe9)"\, not E9. This is the bug in chr that I'm

referring to -- the manpage for chr says:

I didn't mean the chars "E" and "9"\, I meant the byte with value 0xE9.

What character is at unicode codepoint 0xe9?

$ perl -e 'print chr 0xe9'| hexdump 0000000 00e9 0000001

Looks like it's printing out properly to me. However\, it doesn't print é\, it prints � Garbage. Is that something perl is doing? Let's check when we do it "right":

$ perl -CS -e 'print chr 0xe9'| hexdump 0000000 a9c3 0000002

And here we get our é. Notice that it didn't actually print out 0xe9. Instead\, it was converted to UTF-8\, as per the -CS\, and printed out that way. And then the terminal interpreted the UTF-8\, translated it to a unicode code point\, and looked up the glyph to display.

So what happened when we printed out 0xe9 directly? The same thing: the terminal tried to interpret it as-is\, but because the high bit is set\, it's looking for something else. Since nothing else is printed\, the unicode conversion fails\, and we get a garbage glyph.
You can't even understand your own syntax \-\- and every other point
Actually\, I'm guessing that Eric simply can't understand you.

addressed below is you playing word games -- as though you were deliberately being an obstructive troll.

No\, I'm guessing that Eric is wondering about the same thing in reverse\, but is too polite to say it. You should try not to call people attempting to help you\, even if they're failing\, names of any kind\, too.

So again\, the bug is actually in your code when you tried to print chr(0x192): It doesn't map to a byte\, and you didn't tell Perl how it should convert it to a byte. That's why you got a warning. That why it wouldn't have worked in a different terminal.

-----
Perl displayed a warning because it didn't know how to convert
UTF-8 -- so it displayed UTF-8 and DID THE RIGHT THING -- if it was

No\, Perl displayed the bytes as-is\, after warning you that it doesn't know what to do with them.

to display to UTF-8 -- IT did the wrong thing if it was to display it as bytes...

$ perl -e 'print chr 0x192' | hexdump Wide character in print at -e line 1. 0000000 92c6 0000002 $ perl -CS -e 'print chr 0x192' | hexdump 0000000 92c6 0000002

Notice that the output of these two commands is identical except for the warning (on STDERR\, thankfully). Perl is spitting out the UTF-8-encoded text as-is either way\,

Yes -- and I am saying when it encounters the 0xE9 which it knows to be a wide character (a _unicode_ character)\, it should spit it out in the UTF-8 encoded text AS WELL!

In one case\, it take the wide char 0x192 and spits out a UTF-8 encoding of it.

In the other case\, it takes a wide character 0xe9 and spits out as binary.

These are inconsistent.

This is the bug I am referring to.

Nope. Try installing hexdump or some equivalent hex viewer and see what is really showing up.

In the previous email\, I displayed the full hex output from hexdump. (hexdump -C)

In examining the hex output:

00000000 69 3d 31 30 31 2c 20 63 31 3d 65 0a 69 3d 32 33 |i=101\, c1=e.i=23| 00000010 33 2c 20 63 31 3d e9 0a 69 3d 34 30 32 2c 20 63 |3\, c1=..i=402\, c| ^^^ output of E9 in binary (not in UTF-8) 00000020 31 3d c6 92 0a |1=...| ^^^ output 0x192 -- not in binary\, encoded in UTF-8

This is the **default** output with no encodings turned on -- it puts out binary for \u00E9\, and UTF-8 for \u192.

This is the entire point. In default mode\, with no flags\, it puts out garbage -- it is neither fully binary\, and it is neither fully UTF-8

It's half and half which is what I am calling garbage -- computers putting out "inconsistent output" = garbage = bug.

Perl is converting to UTF-8 when it can. The problem with 0xe9 appears to be backwards compatibility: the idea that code that doesn't deal with unicode shouldn't need to change. The flaw here may simply be that at that time\, unicode terminals might have been rare\, whereas today they're common.

---- That's fine... code that doesn't want to deal with unicode doesn't need to go back and edit their code and put in "use 5.20" statements.

I'm only proposing this for the future -- and in the future -- I want to see perl be consistent. Not put out garbage in the default cause as it does now. (Garbage as defined above!)

But my terminal displays unicode. Not UTF-8. UTF-8 is not a character

I think there's a fundamental misunderstanding of the difference between "unicode" and "UTF-8" going on here.

---Now you are playing word games....how does your next paragraph differ from the following one?

Your terminal displays unicode. It accepts the unicode via the UTF-8 encoding\, because streams only hold bytes\, not characters.

set\, it is an "encoding". Characters are display -- not "encodings". "encodings" are used to transport non-byte information over a byte-streamed data path.

So 0x65 -> code point 0x65

Which is not the same as the example above which uses chr(65)\, which is the same as chr(0x41)\, not chr(0x65).

What?

This example:

"(0x65\,0xe9\,0x192) " -- Those are all hex -- there is no 0x41.

(go back above -- I left it -- look at the code)

-- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 12 years ago

From perl-diddler@tlinx.org

demerphq wrote:

Let me say this once again\, there is no bug.

By bug you mean there is nothing there that isn't "intentional". I'm not disputing that.

And stop arguing with people that are trying to help you.

By argue you mean try to get you to understand something that others have said they don't understand?

By help me\, what do you mean? Do you mean they are helping me to try to get perl to process characters uniformly on output by default?

As the fact that perl does not do that by default is my problem. It doesn't generate UTF-8 for characters\, it doesn't generate binary for character.

It does a little of both depending on the char -- and that's for things it thinks are 'chars' (i.e. calls them wide chars on output -- so it 'knows' they a unicode'\, yet\, by design\, it writes them to output inconsistently.

It doesn't write them in UTF-8 encoded format\, uniformly. (it does for chars >0x100 & \< 0x80)\, but not for 0x80-0xff. For those\, it writes them in binary. This inconsistent treatment -- when it knows enough to put out a warning -- so it's not like there's "no indication" that perl knows it is doing wrong -- it warns you that it is about generate potentially bogus output\, because it knows that it treats 'wide chars' (chars above the ASCII range)\, inconsistently.

So it is VERY clear that generating corrupt output for wide-chars\, in the default case\, is *intentional*... in that regards\, you might call it not a bug... However\, I don't use the the term bug to apply only to 'accidents'\, but also to intentional design constraints and deficiencies built into the product.

The problem is -- perl *knows*\, when it should be outputting a wide char. It flags it with a warning -- so someone can't use perl in 8 bit mode to process binary data and spit it out to an 8 bit data stream\, unless they are careful to not do anything that might flag the data as unicode (othewise they get warned).

A the same time\, perl touts it's unicode compatibility\, and would be so by DEFAULT\, if it simply didn't put out a warning about 'wide chars'\, but instead\, always wrote out the UTF-8 value of the char instead of both\, generating a warning\, AND generating inconsistent output.

When three or four clever people on this list tell you .... yeah\, I'm well aware this side of the problem:

http://en.wikipedia.org/wiki/Illusory_superiority

http://www.newyorker.com/online/blogs/frontal-cortex/2012/06/daniel-kahneman-bias-studies.html

Ability to learn and correct is inversely proportional to self perceived cleverness and knowledge.

Then there's the second problem -- people really hate being 'dis'-illusioned (being parted from the illusion that they were right).

Historically\, despots and tyrants have resorted to killing the messenger (usually proclaiming it's their only option).

But usually such people don't see their behavior as violent or oppressive -- so they don't even know when it applies to them -- more of that not knowing their own weak spots....(urls above)...

p5pRT commented 12 years ago

From @nwc10

On Thu\, Aug 30\, 2012 at 02:44:16AM -0700\, Linda W wrote:

demerphq wrote:

And stop arguing with people that are trying to help you.

By argue you mean try to get you to understand something that others have said they don't understand?

By help me\, what do you mean? Do you mean they are helping me to try to get perl to process characters uniformly on output by default?

As the fact that perl does not do that by default is my problem. It doesn't generate UTF-8 for characters\, it doesn't generate binary for character.

And this won't change by default.

You can attempt to argue against this until you are blue in the face\, but

So it is VERY clear that generating corrupt output for wide-chars\, in the default case\, is *intentional*... in that regards\, you might call it not a bug... However\, I don't use the the term bug to apply only to 'accidents'\, but also to intentional design constraints and deficiencies built into the product.

but it's a historical compatability decision that we're not going to change. Yes\, I'm happy to agree that this can be called a bug\, but it's a wontfix/cantfix bug.

Or\, "fixed in six"

Now\, probably\, it should be a *fatal* rather than warning and doing the wrong thing on output. I'm not sure if that's viable to change\, but it might be.

If you want to be outputting UTF-8\, you need to tell Perl to do this. ie change from the default.

Nicholas Clark

p5pRT commented 12 years ago

From @Leont

When three or four clever people on this list tell you ....

yeah\, I'm well aware this side of the problem:

http://en.wikipedia.org/wiki/Illusory_superiority

http://www.newyorker.com/online/blogs/frontal-cortex/2012/06/daniel-kahneman-bias-studies.html

Ability to learn and correct is inversely proportional to self perceived cleverness and knowledge.

Insulting people who are putting an effort into helping you is exceedingly aggregating. This habit of yours is downright abusive and simply not appropriate on this list\, or any other place for that matter.

Then there's the second problem -- people really hate being 'dis'-illusioned (being parted from the illusion that they were right). Historically\, despots and tyrants have resorted to killing the messenger (usually proclaiming it's their only option). But usually such people don't see their behavior as violent or oppressive -- so they don't even know when it applies to them -- more of that not knowing their own weak spots....(urls above)...

Stop playing the victim of a conspiracy\, start taking responsibility for your own actions.

Leon

p5pRT commented 12 years ago

From @demerphq

On 30 August 2012 11:44\, Linda W \perl\-diddler@tlinx\.org wrote:

demerphq wrote:

Let me say this once again\, there is no bug.

By bug you mean there is nothing there that isn't "intentional". I'm not disputing that.

And stop arguing with people that are trying to help you.

By argue you mean try to get you to understand something that others have said they don't understand?

By help me\, what do you mean? Do you mean they are helping me to try to get perl to process characters uniformly on output by default?

As the fact that perl does not do that by default is my problem. It doesn't generate UTF-8 for characters\, it doesn't generate binary for character.

We have explained time and again what is going on here. You seem not to listen. Ill try to summarize as best I can here:

A) perl is terminal agnostic and unaware. For all it knows you are a Klingon using a Klingon terminal which a character set and encoding that it has never in its life encountered.

B) absent specific requests to do otherwise as far as Perl is concerned IO is octet level and completely encoding unaware.

C) Unless you tell it otherwise If you ask Perl to output a string which is flagged as "unicode" and that string contains "wide characters" which would require it to output octets whose values do not correspond 1 to 1 with the codepoints of the unicode string it warns that it is doing so.

D) absent requests to do otherwise chr() outputs a binary string containing one octet for the range 0..255 and a unicode string for codepoints of 256 and above. The internal representation of this codepoint will be in UTF8 and will be multi-octet.

E) If you concatenate a string containing bytes with a string contain unicode the bytes are upgrade one by one to their utf8 equivalent.

F) If you wish to ensure that whatever string you output\, unicode or binary\, is represented as utf8/unicode you can use IO layers or you can use Encode::encode_utf8.

G) Your terminal is responsible for rendering whatever output perl provides. It is up to you to ensure that perl provides the encoding your terminal expects. It wont and cant guess.

G) none of this is going to change

Yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 12 years ago

From @rjbs

* Linda W \perl\-diddler@tlinx\.org [2012-08-30T05:44:16]

By help me\, what do you mean? Do you mean they are helping me to try to get perl to process characters uniformly on output by default?

He means trying to help you understand exactly what is happening\, and why\, and why it isn't going to change — which it isn't.

Historically\, despots and tyrants have resorted to killing the messenger (usually proclaiming it's their only option).

Oh\, just stop it\, Linda. You are not being killed\, the exasperated people trying to explain the situation are not despots and tyrants\, and the emperor has all his clothes on.

You have a report\, once again\, of some behavior that is annoying and not entirely straightforward. I don't think anybody is going to argue that this is the behavior we'd shoot for if we were designing an all new Perl. It's what we've got\, though\, and for reasons that can't be handwaved away.

They've been explained several times in several ways by several experts. When you print character strings to a file handle\, and the strings contain characters that do not map 1:1 to bytes\, the file handle needs to know how to map them. That's what encoding layers are for. We are not going to make 'use 5.x' apply layers to STDOUT. For one thing\, it violates the "use VERSION effects must be lexical" rule\, because it would have to work by affecting STDOUT\, a global\, rather than a string operation within a lexical scope like =~. For another\, automatically encoding output only makes sense in some contexts (like dealing with text) and not others (like dealing with binary data\, where the warning really does indicate a bug).

I feel like addressing the technical issue isn't the point anymore\, though. Nobody really minds explaining why stupid things are stuck being stupid. It's very tiring to see you repeatedly harangue the very people who try to give an explanation. Plenty of the mail in this bug was exasperated-sounding on both sides\, but if you want to know where I think it went off the rails\, it was here:

https://rt-archive.perl.org/perl5/Ticket/Display.html?id=114602#txn-1149230

Please stop calling people trolls\, despots\, or deluded. It doesn't accomplish anything but make the idea of trying to respond to your reports less palatable.

-- rjbs

p5pRT commented 12 years ago

From @dmcbride

On Wednesday August 29 2012 10:41:46 PM Linda W wrote:

So 0x65 -> code point 0x65

Which is not the same as the example above which uses chr(65)\, which is the same as chr(0x41)\, not chr(0x65).

What?

This example:

"(0x65\,0xe9\,0x192) " -- Those are all hex -- there is no 0x41.

(go back above -- I left it -- look at the code)

You missed the example from the docs.

"For example\, "chr(65)" is "A" in either ASCII

Notice that the example is using 65\, not 0x65 (which is 101 in decimal). And you're using 0x65. Of course 0x65 doesn't show up as "A". Because A has a code point of 65 (or 0x41). Expecting 0x65 to show up as "A" when the example you're complaining about is using 65 is where the error is occurring\, not the docs.

The only thing that could possibly be changed in the docs here is to be consistent about using hex vs decimal code points. However\, I'm not sure that this should be required - *most* of the time\, in my experience\, that someone is talking about standard English letters' code points it's in decimal\, whereas anything above 255 is generally in hex\, so it's consistent with normal methods of addressing these glyphs.

p5pRT commented 12 years ago

From @ikegami

On Wed\, Aug 29\, 2012 at 11:50 PM\, Linda W \perl\-diddler@tlinx\.org wrote:

You can't even understand your own syntax \-\- and every other point
addressed below is you playing word games -- as though you were deliberately being an obstructive troll.

I used E9 to mean byte E9\, and I was consisten throughout. This stupid ad hominen attack is demonstrably false.

So again\, you told Perl to print E9\, and Perl printed E9. No bug there.

No... I told it to print chr(0xE9).... that's what the code says.

Yes\, the file handle takes bytes\, so you told Perl to print E9.

Perl displayed a warning because it didn't know how to convert

UTF-8

It warned because you asked to print bytes and you didn't provide a byte.

so it displayed UTF-8 and DID THE RIGHT THING -- if it was to display to UTF-8 -- IT did the wrong thing if it was to display it as bytes...

Two wrongs made a right\, so?

You and others have claimed the output is in bytes -- that's not what it did.

What do you think convert to UTF-8 means? It means taking text and serialising it into *bytes*.

You really should understand the basics of a topic before discussing it\, or at least try to understand them.

It encoded 0x192 as UTF-8 and displayed that.

Almost\, it encoded using utf8\, not UTF-8.

You can't have it both ways. Either don't convert the 0x192 to UTF-8 on outpu OR convert the 0xE9 -- right now it's broken no matter which way you look at it.

That's like saying C\<\<undef + 5>> and C\<\<4 + 5>>is inconsistent because undef gets converted and 4 doesn't. There's no inconsistency.

chr(0xE9)'s handling is correct\, So that means you're saying chr(0x192) is wrong. What you would rather happen?. A fatal error instead of a non-fatal one?

Neither of these can be changed.

You seem to think that because you don't want it\, it can't be done.

No\, it can't because it's perfectly acceptable for Perl to output stuff other than text\, and you would break that.

"For example\, "chr(65)" is "A" in either ASCII or Unicode\, and chr(0x263a) is a Unicode smiley face."

Note the "in ASCII" and "in Unicode". Elsewhere (e.g. to a UTF-8 terminal like yours)\, it means other things.

But my terminal displays unicode. Not UTF-8. UTF-8 is not a character set\, it is an "encoding". Characters are display -- not "encodings". "encodings" are used to transport non-byte information over a byte-streamed data path.

I'm baffled. If you understand this\, how can you make all the other mistakes?

So 0x65 -> code point 0x65

chr(0x65) is not code point 0x65. It's a character with value 0x65. It could mean U+0065\, but it could also mean "violet". Perl does not\, should not and cannot assign any meaning to it.

Sorry\, but the man page says it is "A" -- a character in ASCII or UNICODE

65 is "A" in ASCII 65 is "A" in Unicode 65 is not "A" in zip 65 is not "A" in fuel sensor reading

File handles don't know which one of those it is. It sees 65\, not "A".

Why don't you fix the manpage and stop playing word games.

Why? The man page isn't correct. It's your assumption that all strings are ASCII or Unicode that's wrong. Like I said -- twice -- note the "in ASCII" and "in Unicode". Those are far from the only two kinds of strings.

p5pRT commented 12 years ago

From @ikegami

On Thu\, Aug 30\, 2012 at 12:32 AM\, Darin McBride \dmcbride@cpan\.org wrote:

You can't have it both ways. Either don't convert the 0x192 to UTF-8 on output OR DO convert the 0xE9 -- right now it's broken no matter which way you look at it.

Perl is converting to UTF-8 when it can. The problem with 0xe9 appears to be backwards compatibility

E9 being output as E9 is perfectly sane as useful\, not something we're merely obliged to keep for backwards compatibility.

Or maybe you were referring to the lack of default of encoding as something we need to keep for backwards compatibility. True that.

Otherwise\, great post!

p5pRT commented 12 years ago

From @ikegami

On Thu\, Aug 30\, 2012 at 1:41 AM\, Linda W \perl\-diddler@tlinx\.org wrote:

Yes -- and I am saying when it encounters the 0xE9 which it knows to be a wide character (a _unicode_ character)\, it should spit it out in the UTF-8 encoded text AS WELL!

The proper terminology is "Unicode codepoint\," not "Unicode character"; there's no such thing as a "Unicode character".

Aside from that\, there are two major errors in that passage.

1. E9 is not a wide character. A wide character is a character outside of 00-FF. Perl actually knows that it's NOT a wide character\, which is why it knows it can correctly output it\, which is why it doesn't issue a warning.

2. Perl has no idea it's a Unicode codepoint. You think strings can only contain Unicode text? That's ridiculous. Counter examples:

$not_unicode1 = encode_utf8("\x{9000}"); # UTF-8 $not_unicode2 = pack('C4'\, split /\./\, '192.168.1.233'); # Packed IP address $not_unicode3 = pack('n*'\, 0x53E9\, 0x3453); # Sensor readings. $not_unicode4 = pack('C/a*'\, "x" x 233); # "Pascal" string

Everything else has already been said.

- Eric

p5pRT commented 12 years ago

From perl-diddler@tlinx.org

Eric Brine via RT wrote:

On Thu\, Aug 30\, 2012 at 1:41 AM\, Linda W \perl\-diddler@tlinx\.org wrote:

Yes -- and I am saying when it encounters the 0xE9 which it knows to be a wide character (a _unicode_ character)\, it should spit it out in the UTF-8 encoded text AS WELL!

1. E9 is not a wide character. A wide character is a character outside of 00-FF. Perl actually knows that it's NOT a wide character\, which is why it knows it can correctly output it\, which is why it doesn't issue a warning.

1) you are ignoring the FACT that previous examples showed the warning for chr(0xe9). 2) You ignore the perl source tree at your peril:

(fr. sv.c)

if (!utf8_to_bytes(s\, &len)) { if (fail_ok) return FALSE; else { if (PL_op) Perl_croak(aTHX_ "Wide character in %s"\, OP_DESC(PL_op)); else Perl_croak(aTHX_ "Wide character"); }

Perl defines something to be wide if it can't fit in 1 byte after doing a UTF8-decode.

(see 'sv.c' -- Please don't assume that I am less capable of looking at the source code than you. After you made the claim of knowing perl internals VERY well\, you try to pull something like this?.... Did you just over-inflate your evaluation of your knowledge or was this a deliberate attempt at deception?...I'll presume it was an honest mistake. Knowing your style\, I'm sure we can both assume that was the case.).

define isASCII(c) ((WIDEST_UTYPE)(c) \< 128)

2. Perl has no idea it's a Unicode codepoint. You think strings can only contain Unicode text? That's ridiculous.

The above code indicates when perl 'thinks' it is a unicode code point.

Your examples are irrelevant.

Counter examples:

$not_unicode1 = encode_utf8("\x{9000}"); # UTF-8 $not_unicode2 = pack('C4'\, split /\./\, '192.168.1.233'); # Packed IP address $not_unicode3 = pack('n*'\, 0x53E9\, 0x3453); # Sensor readings. $not_unicode4 = pack('C/a*'\, "x" x 233); # "Pascal" string

Everything else has already been said.

- Eric

p5pRT commented 12 years ago

From perl-diddler@tlinx.org

Darin McBride via RT wrote:

On Wednesday August 29 2012 10:41:46 PM Linda W wrote:

So 0x65 -> code point 0x65

Which is not the same as the example above which uses chr(65)\, which is the same as chr(0x41)\, not chr(0x65).

What?

This example:

"(0x65\,0xe9\,0x192) " -- Those are all hex -- there is no 0x41.

(go back above -- I left it -- look at the code)

You missed the example from the docs.

"For example\, "chr(65)" is "A" in either ASCII

Notice that the example is using 65\, not 0x65 (which is 101 in decimal). Error... indefinite reference @ "the example"...

I included an example (for reference:)

foreach my $i (0x65\,0xe9\,0x192) { my $c1=chr($i); P "i=%s\, c1=%s"\, $i\, $c1; } ' i=101\, c1=e i=233\, c1=ide character in print at /usr/lib/perl5/5.14.2/x86_64-linux-thread-multi/IO/Handle.pm line 417. i=402\, c1=ƒ

in the referenced note\, AND I included the quote from the manpage\, It would be more clear if you wanted to refer refer to '65' as talking about the manpage using 65\, or the example in the manpage\, to differentiate it from the example code I included.

(There is obviously no reason why anyone would become confused with such multiple examples floating around\, or indefinite references... ;-)).

p5pRT commented 12 years ago

From @ikegami

On Thu\, Aug 30\, 2012 at 9:37 PM\, Linda W \perl\-diddler@tlinx\.org wrote:

**

Perl defines something to be wide if it can't fit in 1 byte after doing a UTF8-decode.

First\, it's "decoding". At a high level\, it's downgrading. At a low level\, there's no such thing as decoding; it's changing the storage format (encoding) to UTF8=0 (bytes).

Which is exactly what I said. A wide char is a char that's not a byte.

2. Perl has no idea it's a Unicode codepoint. You think strings can

only contain Unicode text? That's ridiculous.

The above code indicates when perl 'thinks' it is a unicode code point.

No\, utf8 has nothing to do with Unicode. Don't confuse utf8 and UTF-8.

p5pRT commented 12 years ago

From @ikegami

On Thu\, Aug 30\, 2012 at 9:53 PM\, Eric Brine \ikegami@adaelis\.com wrote:

On Thu\, Aug 30\, 2012 at 9:37 PM\, Linda W \perl\-diddler@tlinx\.org wrote:

**

Perl defines something to be wide if it can't fit in 1 byte after doing a UTF8-decode.

First\, it's "decoding". At a high level\, it's downgrading. At a low level\, there's no such thing as decoding; it's changing the storage format (encoding) to UTF8=0 (bytes).

Which is exactly what I said. A wide char is a char that's not a byte.

2. Perl has no idea it's a Unicode codepoint. You think strings can

only contain Unicode text? That's ridiculous.

The above code indicates when perl 'thinks' it is a unicode code point.

No\, utf8 has nothing to do with Unicode. Don't confuse utf8 and UTF-8.

utf8 is how Perl stores a string of UV. "UV" stands for Unsigned value\, not Unicode value.

p5pRT commented 12 years ago

From @ikegami

On Thu\, Aug 30\, 2012 at 9:53 PM\, Eric Brine \ikegami@adaelis\.com wrote:

On Thu\, Aug 30\, 2012 at 9:37 PM\, Linda W \perl\-diddler@tlinx\.org wrote:

**

Perl defines something to be wide if it can't fit in 1 byte after doing a UTF8-decode.

First\, it's "decoding". At a high level\, it's downgrading. At a low level\, there's no such thing as decoding; it's changing the storage format (encoding) to UTF8=0 (bytes).

Ack\, that should read "it's NOT decoding".

Which is exactly what I said. A wide char is a char that's not a byte.

2. Perl has no idea it's a Unicode codepoint. You think strings can

only contain Unicode text? That's ridiculous.

The above code indicates when perl 'thinks' it is a unicode code point.

No\, utf8 has nothing to do with Unicode. Don't confuse utf8 and UTF-8.

p5pRT commented 12 years ago

From @ikegami

On Thu\, Aug 30\, 2012 at 9:37 PM\, Linda W \perl\-diddler@tlinx\.org wrote:

1\) you are ignoring the FACT that previous examples showed the warning

for chr(0xe9).

No\, you never got the warning for 0xE9. You got the warning for 0x192.

2) You ignore the perl source tree at your peril:

You're just complicating things a lot my bringing internal storage formats into the discussion. It's needless\, and it's just going to introduce equivocation fallacies.

p5pRT commented 12 years ago

From perl-diddler@tlinx.org

Leon Timmermans wrote:

When three or four clever people on this list tell you ....
yeah\, I'm well aware this side of the problem:

http://en.wikipedia.org/wiki/Illusory_superiority

http://www.newyorker.com/online/blogs/frontal-cortex/2012/06/daniel-kahneman-bias-studies.html

Ability to learn and correct is inversely proportional to self perceived cleverness and knowledge.

Insulting people who are putting an effort into helping you is exceedingly aggregating. This habit of yours is downright abusive and simply not appropriate on this list\, or any other place for that matter.

It wasn't intended as an insult. It's was presenting data backed by evidence. It really wasn't intended personally -- it describes a group of people who have certain characteristics -- most certainly those who have a high self-perception of their ability\, talent and/or knowledge.
It describes a tendency of *EVEN* those that ARE bright\, to be blind to their blind spots -- but it's much worse as the discrepancy increases with what the person actually knows and what they think they know.

I'm sorry if you took it personally - it wasn't intended as such.

For myself\, I am aware I know only a little about multiple things\, however\, in my experience\, the number of things and the amount usually is enough to put most people to sleep if I go into it too much. But I would definitely NOT think nor claim to know more about someone who is truly a master of their field (usually I find that they are ones that don't claim such -- but you find out in talking to them. Those that repeatedly tell you -- are more often trying to convince you for purposes of some point or argument).

Now anyone can take that as an insult if they want\, or not. It's not personally directed at anyone. But if it offends you\, I would suggest that perhaps it is touching on an uncomfortable truth.

Stop playing the victim of a conspiracy\, start taking responsibility for your own actions.

So far Nicholas Clark has been the only person who'd say has been on the level and truthful.

That doesn't mean I necessarily agree with his stance\, but he is someone with whom I **could** discuss the problem that I was trying to show/demonstrate/discuss for the past week or more (months if you count earlier discussions).

That he would 'get' what I was talking about in 1 response\, -- that's
someone who is is able to communicate (bidirectionally) me feeling like someone is playing word games.

You (Leon) at least got what the essential function of 'P' with out me feeling like I was talking to people who had no clue of programming or perl or had it bother them so much they couldn't understand the point I was trying to make. I felt the details and exact implementation of P would simply be another side track for discussion about it's internals when they had nothing to do with the point I was making.

And folks\, given the directions this has gone off on -- when Nick summed up the issue in 1 note -- you know\, I'm right. Anything and everything was picked at about how I said this or not having crossed a 't' or dotted an 'i'.

Eric went off on E9 vs. 0xe9... and my point wasn't about my thinking I was writing "E9" vs. "\xe9"\, but that I was using 'chr(0xe9)... which I would expect to produce different output than if I did a printf("%c"\, 0xe9) (cf. printf("%c"\, chr(0xe9)) ).

I could have posted post the module\, but I felt it would detract focus from the essential issue of perl -- instead of doing something useful with output -- throws up an warning (and maybe even an error someday).

Instead of throwing a warning\, on a wide char and then corrupting output\, it could do what it does for every OTHER wide char not in the 0x80-0xff range -- and put it it's unicode representation.

Whereas -- I knowingly\, **for this example*** didn't set UTF-8 -- this isn't a problem in most of my programs... BUT it comes up frequently enough because there are many gotches in perl related to this problem.

Having perl knowingly do the wrong thing (as it does now)\,or having it die altogether when it has a good idea of what the user wants -- cannot be called as something "serving backwards compatibility".

It's a warning that you want to make an error? How can that be backwards compatible with any code?

I assert that people refer back to basic perl design philosophy: DWIM.

If this was cobol or fortran\, I'd expect it to stay broken on principle / standard. But being 'hard assed' and deliberately throwing errors and warnings on output AND corrupting it to make sure they are screwed -- rather than following perl's internal design that would normally auto-convert to the right format.

Compare:

my $a="42"; $b="43"; my $c=$a+$b; print "c=$c"; c=85

Do you get a warning for string to integer to string conversion?

It happens automatically.

Why generate a warning when printing a wide char out to a terminal -- why not assume the user has a terminal that prints in unicode and just print it like you do with the string? You don't print "Warning integer encountered in string" or "strings encountered in addition".

"Perl is about helping you get from here to there with minimum fuss and maximum enjoyment." What about generating warnings and then converting output inconsistently is either?

"...One of the things that changes is how the community thinks Perl should behave by **default** [emphasis mine]. (This is in conflict with the desire for Perl to behae as it always did.)".... so added was strict\, threads came and morphed ... "Other things have come or gone.
Some experiments didn't work out and we took them out of Perl\, replacing them with other experiments. Pseudohashes\, for instance..."... (Camel)

The point is perl changes and changing to default to Unicode would be a move toward the future that wouldn't hurt compatibility -- as it's already an "illegal case". I simply propose to make it put out UTF-8 output and be consistent ACROSS it's characters set -- because right now\, it throws out a warning and only converts wide chars \<0x100 (& >0x7f) to binary -- the rest IT's ALREADY PUTTING OUT IN UTF-8. So why the "deadzone" in 0x7f-xff? It doesn't work without warnings in any program today. If some chase their kneejerk reactions\, it won't work at all -- so it CAN'T be for compatibility.

What is the point?

to be something that tried to "Do what you meant" -- it was a stated design philosophy.

This isn't about compatibility -- as it already warns anyone who would try to use the feature set the way I am describing it wouldn't be able to without suppressing warnings.

emerphq wrote:

On 30 August 2012 11:44\, Linda W \perl\-diddler@tlinx\.org wrote:

demerphq wrote:

Let me say this once again\, there is no bug.

By bug you mean there is nothing there that isn't "intentional". I'm not disputing that.

And stop arguing with people that are trying to help you.

By argue you mean try to get you to understand something that others have said they don't understand?

By help me\, what do you mean? Do you mean they are helping me to try to get perl to process characters uniformly on output by default?

As the fact that perl does not do that by default is my problem. It doesn't generate UTF-8 for characters\, it doesn't generate binary for charact

demerphq wrote:

C) Unless you tell it otherwise If you ask Perl to output a string which is flagged as "unicode" and that string contains "wide characters" which would require it to output octets whose values do not correspond 1 to 1 with the codepoints of the unicode string it warns that it is doing so.

Exactly.

p5pRT commented 12 years ago

From @nwc10

On Fri\, Aug 31\, 2012 at 12:26:42AM -0700\, Linda W wrote:

So far Nicholas Clark has been the only person who'd say has been on the level and truthful.

That doesn't mean I necessarily agree with his stance\, but he is someone with whom I **could** discuss the problem that I was trying to show/demonstrate/discuss for the past week or more (months if you count earlier discussions).

That he would 'get' what I was talking about in 1 response\, -- that's
someone who is is able to communicate (bidirectionally) me feeling like someone is playing word games.

Thanks for the complement\, *but* I fear that it may not be totally justified.

Many of your initial bug reports are very hard to understand\, in terms of what the actual problem is\, and often it has taken someone a series of e-mail exchanges to dig down to find it. I just happened to pounce on something that had been refined to the point of coming clear.

You $Leon$ at least got what the essential function of 'P' with out 
me feeling like I was talking to people who had no clue of programming or perl or had it bother them so much they couldn't understand the point I was trying to make. I felt the details and exact implementation of P would simply be another side track for discussion about it's internals when they had nothing to do with the point I was making.

Leon does know what he is doing.

And folks\, given the directions this has gone off on -- when Nick summed up the issue in 1 note -- you know\, I'm right. Anything and everything was picked at about how I said this or not having crossed a 't' or dotted an 'i'.

There is something that we all agree on here. Once we find a description of it that we agree on. A big problem seems to be refining the mutual understanding of what the problem is\, and what is incidental.

Instead of throwing a warning\, on a wide char and then corrupting output\, it could do what it does for every OTHER wide char not in the 0x80-0xff range -- and put it it's unicode representation.

Whereas -- I knowingly\, **for this example*** didn't set UTF-8 -- this isn't a problem in most of my programs... BUT it comes up frequently enough because there are many gotches in perl related to this problem.

Having perl knowingly do the wrong thing (as it does now)\,or having it die altogether when it has a good idea of what the user wants -- cannot be called as something "serving backwards compatibility".

It's not *unambiguously* doing the wrong thing.

What sequence of octets should this output?

perl -le 'print chr 233'

Because at the time the print op runs\, all it can know is that it has been offered up this scalar to print:

SV = PV(0x100801c98) at 0x100814330 REFCNT = 1 FLAGS = (PADTMP\,POK\,READONLY\,pPOK) PV = 0x100224ed0 "\351"\0 CUR = 1 LEN = 16

It's a warning that you want to make an error? How can that be backwards compatible with any code?

The change you suggest of outputting UTF-8 sequences is *also* not backwards compatible with all code.

$ ~/Sandpit/5005_03g/bin/perl -le 'print chr 233' | od -h 0000000 0ae9
0000002

It's impossible to get this right all the time. Knowing *that*\, then silently continuing with a mess is worse than alerting the system via an error.

I assert that people refer back to basic perl design philosophy: DWIM.
If this was cobol or fortran\, I'd expect it to stay broken on 
principle / standard. But being 'hard assed' and deliberately throwing errors and warnings on output AND corrupting it to make sure they are screwed -- rather than following perl's internal design that would normally auto-convert to the right format.

It's impossible to know what the right format is\, when all you have is a string containing code points between 0 and 255. Was this part of some JPEG file? Was this text?

Why generate a warning when printing a wide char out to a terminal -- why not assume the user has a terminal that prints in unicode and just print it like you do with the string? You don't print "Warning integer encountered in string" or "strings encountered in addition".

Are you proposing that Perl should keep the old default of outputting octet sequences when outputting to "not a terminal"\, and outputting UTF-8 when outputting to a terminal?

So this should output 3 octets\, 0xc3\, 0xa9 and 0x0a?

perl -le 'print chr 233'

In which case\, should this also output in UTF-8\, or as bytes?

perl -le 'print chr 233' | sed -e s/P/p/

because that ends up on the terminal.

But this doesn't end up on a terminal:

perl -le 'print chr 233' | sed -e s/P/p/ > foo

so how is the perl binary to know when to switch?

The point is perl changes and changing to default to Unicode would be a move toward the future that wouldn't hurt compatibility -- as it's already an "illegal case". I simply propose to make it put out UTF-8 output and be consistent ACROSS it's characters set -- because right now\, it throws out a warning and only converts wide chars \<0x100 (& >0x7f) to binary -- the rest IT's ALREADY PUTTING OUT IN UTF-8. So why the "deadzone" in 0x7f-xff?

0x80-0xFF actually.

Because historically\, an a fixed width 8 bit world\, if your line terminator is a single "\n"\, it didn't matter whether your input data was ASCII\, some 8 bit extension of ASCII\, or binary data. So neither Perl programs nor the perl internals really cared which it *actually* was - it read it in\, processed it\, and wrote it out. "Characters" were only in the range 0x00-0xFF

The problem came with 5.6.0\, which attempted to add Unicode semantics in without breaking anything else. Characters were now in the range 0x00-0xFFFFFFFF. And it's completely ambiguous when code writes:

chr $c;

where $c happens to have the value 233\, whether it meant a lower case e acute\, or a byte of binary data 0xE9

And 5.6.0 made two very bad assumptions

1) that it could "wing it" and generally ignore that ambiguity 2) that it didn't yet need to worry about I/O

(or possibly that the world was either all UTF-8\, or all what-it-was-before)

To keep the same behaviour as 5.005 and earlier\, it *had* to output octet sequences for a chunk of data with things in the range 0x00-0xFF\, for the everything else being the default\, because the default was "some 8 bit superset of ASCII".

So to be compatible\, this *has* to print out two octets\, 0xE9 0x0A:

$ perl -le 'sub out { print chr shift }; out(233);'

because it has to map (internal Perl) characters directly to octets in the file.

Which leaves the problem of what to do when being asked to output something outside the range 0x00-0xFF.

Down a filehandle\, everything can only be an octet in the range 0x00-0xFF. Code points outside that range simply don't fit.

So now\, consider this program:

$ perl -le 'sub out { print chr shift }; out(233); out(1234);'

After the first call to out()\, it's ALREADY WRITTEN OUT the octets\, 0xE9 0x0A. It can't go back and undo that. What does it do now?

5.6.0 chose to cheat by outputting the UTF-8 representation of the code points\, and issuing a warning.

But really\, it should be an I/O error\, because at this point there's no way to correctly output the code point 1234 down a stream which is using octets 1:1 to code points.

It doesn't work without warnings in any program today. If some chase their kneejerk reactions\, it won't work at all -- so it CAN'T be for compatibility.

It is for compatibility. As explained above.

Also\, perl 5.8.0 changed the default so that if it was running in a UTF-8 locale\, it would default all file handles to UTF-8.

This turned out to break a *lot* of things. Badly. It was reverted for 5.8.1

There is no easy answer here.

Nicholas Clark

p5pRT commented 12 years ago

From perl-diddler@tlinx.org

Nicholas Clark via RT wrote:

Having perl knowingly do the wrong thing (as it does now)\,or having it die altogether when it has a good idea of what the user wants -- cannot be called as something "serving backwards compatibility".

It's not *unambiguously* doing the wrong thing.

What sequence of octets should this output?
perl \-le 'print chr 233'

If the user has said\, interpret this as a character -- which is what the code appears to do\, it should print it out as code point \u00e9.
When perl recognizes it is still talking to an 8-bit pipe\, it knows there is a type mismatch. At that point\, it does a conversion just like $a+2+5; print "a=$a prints out '7' and not the string does the same thing when you take a number and strings -- perl converts them without warning: perl -wle '$a="2.5"; print "b=${\(1+$a)}"' string->float->string all without a warning and just "DWIM"...

Perl has take things that would cause warnings and errors in other languages and tried to do they write thing to make our lives easy.

Why not the same default for text text processing to automatically marshal internal chars

Because at the time the print op runs\, all it can know is that it has been offered up this scalar to print:

SV = PV(0x100801c98) at 0x100814330 REFCNT = 1 FLAGS = (PADTMP\,POK\,READONLY\,pPOK) PV = 0x100224ed0 "\351"\0 CUR = 1 LEN = 16

It's a warning that you want to make an error? How can that be backwards compatible with any code?

The change you suggest of outputting UTF-8 sequences is *also* not backwards compatible with all code.

$ ~/Sandpit/5005_03g/bin/perl -le 'print chr 233' | od -h 0000000 0ae9
0000002

It's impossible to get this right all the time. Knowing *that*\, then silently continuing with a mess is worse than alerting the system via an error.

People used to argue that about mixing strings and integers as well. Then it was decided it' was more of a pain than not\, -- let the computer sort it out.

If in doubt\, Look at the LC vars... that's why they are there. Perl runs off of libc no? I would somewhat expected it to benefit from LC_ vars and usage automatically\, if I was a user -- and If I needed 'C' semantics\, I'd use LC_ALL='C'. We have to do that now -- or minimally LC_SORT='C'\, so simple ranges will work correctly in POSIX.

But I'm only throwing that out as an option\, as it would also solve the problem.

I assert that people refer back to basic perl design philosophy: DWIM.
If this was cobol or fortran\, I'd expect it to stay broken on 
principle / standard. But being 'hard assed' and deliberately throwing errors and warnings on output AND corrupting it to make sure they are screwed -- rather than following perl's internal design that would normally auto-convert to the right format.
It's impossible to know what the right format is\, when all you have is a string containing code points between 0 and 255. Was this part of some JPEG file? Was this text?

If I used "chr" on it\, then it's pretty clear I'm wanting to treat it as as character data.

Why generate a warning when printing a wide char out to a terminal -- why not assume the user has a terminal that prints in unicode and just print it like you do with the string? You don't print "Warning integer encountered in string" or "strings encountered in addition".

Are you proposing that Perl should keep the old default of outputting octet sequences when outputting to "not a terminal"\, and outputting UTF-8 when outputting to a terminal?

When perl writes to a character device\, it encodes wide output into characters.

When perl writes to a block device -- it doesn't.

Seems like a safe dividing line.

(or follow the encoding in the user's environment -- i.e. use the LC_ var standards and not require a special case for the user using 'Perl'...

I most commonly do things with perl processing files and filenames -- everything else that works with such uses standard localization vars.
Why not perl?

When I want to write to binary -- I go out of my way to make sure it is binary -- I "use bytes". I don't rely on the default\, because the default will not write 0x100 as \x00\x01 (is it would if was doing binary)\, but UTF-8 encoded. I don't trust perl to process binary data unless I tell it to -- if I'm on some OS that turns \ into \ or \\, not specifying 'binary' when you really need binary is just not reliable -- and I know I'm not going to know my binary is corrupted until too late...so to make sure I get it right -- I will either use bytes\, or turn off line-processing and use a slurp.

I've used enough systems that corrupt binary data if read through a line-oriented interface\, to be 'gun-shy' -- even when it might work on my current system. I don't know what OS I might end up on next. But if it is text\, I'll have to trust that the OS knows how to display text on itself...

So this should output 3 octets\, 0xc3\, 0xa9 and 0x0a?
perl \-le 'print chr 233'
In which case\, should this also output in UTF-8\, or as bytes?
perl \-le 'print chr 233' | sed \-e s/P/p/
because that ends up on the terminal.

char dev v. block dev.

if perl goes with using the LC_vars\, which might be more work but more flexible\, and if the output encoding is 'C'\, then perl's current behavior for printing out a char of 0x100 as UTF-8 bytes would be wrong. If it is simply printing 'bytes' it would do exactly what 0x100 would print out in 'C' -- \x00\x01.

Because historically\, an a fixed width 8 bit world\, if your line terminator is a single "\n"\, it didn't matter whether your input data was ASCII\, some 8 bit extension of ASCII\, or binary data. So neither Perl programs nor the perl internals really cared which it *actually* was - it read it in\, processed it\, and wrote it out. "Characters" were only in the range 0x00-0xFF

characters were only in the range 0x00-0x7f and the 0x80 bit was used when code pages came -- a transition period between ASCII and Unicode. They were a mistake\, one that is no longer necessary and one that perl no longer need carry as a burden. This is over a decade into the 21st century. Unicode was born in the early 90's as code pages were dying. It would be bad form for perl 5.20 to default to >2 decade old standards. But if you are talking historically\, 0x80 was used for parity bits or as a Meta or escape bit on some terminals. It wasn't part of the character -- even many utils on the internet were not 8-bit clean. HTML wasn't 8-bit clean -- though HTML-5's **default** charset is 8-bit.

perl5.20 can default to today's standard.

The problem came with 5.6.0\, which attempted to add Unicode semantics in without breaking anything else. Characters were now in the range 0x00-0xFFFFFFFF. And it's completely ambiguous when code writes:
chr $c;
where $c happens to have the value 233\, whether it meant a lower case e acute\, or a byte of binary data 0xE9

And 5.6.0 made two very bad assumptions

1) that it could "wing it" and generally ignore that ambiguity 2) that it didn't yet need to worry about I/O

(or possibly that the world was either all UTF-8\, or all what-it-was-before)

Well\, I don't know how much better I would have done -- I was an early adopter of UTF-8\, and I've been hitting teething pains in just about every piece of software out there for well over 10 years...painful. Vim\, terminals\, charsets\, Windows\, sendmail\, perl\, shell\, et al\, and I've been in discussions similar to this one -- though not usually as 'tedious' with at least a half a dozen other products that all eventually went UTF-8\, but I won't say that any of their decisions to do so had anything to do with my influence...maybe in spite of my 'presentation skills'... ;-)...

To keep the same behaviour as 5.005 and earlier\, it *had* to output octet sequences for a chunk of data with things in the range 0x00-0xFF\, for the everything else being the default\, because the default was "some 8 bit superset of ASCII".

So to be compatible\, this *has* to print out two octets\, 0xE9 0x0A:
$ perl \-le 'sub out \{ print chr shift \}; out$233$;'
because it has to map (internal Perl) characters directly to octets in the file.

Which leaves the problem of what to do when being asked to output something outside the range 0x00-0xFF.

Down a filehandle\, everything can only be an octet in the range 0x00-0xFF. Code points outside that range simply don't fit.

So now\, consider this program:
$ perl \-le 'sub out \{ print chr shift \}; out$233$; out$1234$;'
After the first call to out()\, it's ALREADY WRITTEN OUT the octets\, 0xE9 0x0A. It can't go back and undo that. What does it do now?

See above -- either 1) treat all chars as needing UTF-8 encoding\, or B\, look at LC and see if locale='C' or has UTF-8 in it...and adjust output accordingly -- if 'c' it would print out \xd2\x04 (1234 = 042d hex). If it was in UTF8 -- then the 233 would have printed out as 0xc2\0xe9.

---Either that or fix it so it always prints out the binary value of the code point -- not-utf-8 encoded -- .. wouldn't be optimal\, BUT\, it would produce consistent output ... there would never be a case where perl switches output encoding mid stream as it does now.

5.6.0 chose to cheat by outputting the UTF-8 representation of the code points\, and issuing a warning.

But really\, it should be an I/O error\, because at this point there's no way to correctly output the code point 1234 down a stream which is using octets 1:1 to code points.

The correct way to put it out is to marshal it as a 2-byte string and put out the bytes.

I would 128 chars one way\, and all other binary values another -- as that gives you the worst of both words -- something that won't put out straight UTF8 and something that won't put out the straight binary representation....

Also\, perl 5.8.0 changed the default so that if it was running in a UTF-8 locale\, it would default all file handles to UTF-8.

This turned out to break a *lot* of things. Badly. It was reverted for 5.8.1

I think when people did that\, they were not used to having their locale's set correctly -- now\, they better\, or their shell and other things will behave unexpectedly with standard programs. Perl is one of the few major text processing programs that doesn't use LOCALE by default. I think at the time\, it\, sadly jumped too fast\, but being burnt by being too fast\, should not prevent doing something similar when\, now\, due to those LOCALE's -- perl is the odd prog out -- it no longer follows the order of sort or the shell\, in a standard distribution\, as they don't come setup in the 'C' locale these days......

There is no easy answer here.

Well\, growth is often awkward....ask any teenager. Old programs that are not touched -- can remain the way they are. I'm fine at this point in limiting it to only programs that opt-in by "using 5.20.0"\, maybe by 5.30.0 (if there is such)\, it would become the default\, regardless -- as other keywords intro'ed in 5.10 might eventually become default "in there"....

But certainly\, these decisions and discussions are not easy... no disagreement there! ;-)

p5pRT commented 12 years ago

From @ikegami

On Fri\, Aug 31\, 2012 at 3:55 PM\, Linda W \perl\-diddler@tlinx\.org wrote:

** When perl writes to a character device\, it encodes wide output into characters. When perl writes to a block device -- it doesn't.

What about pipes\, files and sockets?

p5pRT commented 12 years ago

From @csjewell

On Fri\, Aug 31\, 2012\, at 14:58\, Eric Brine wrote:

On Fri\, Aug 31\, 2012 at 3:55 PM\, Linda W \perl\-diddler@tlinx\.org wrote:

** When perl writes to a character device\, it encodes wide output into characters. When perl writes to a block device -- it doesn't.

What about pipes\, files and sockets?

I would think that the destination for a file or a socket would be either a character device\, or a block device. But I could be wrong.

As for pipes... hmmm... character device?

--Curtis -- Curtis Jewell csjewell@cpan.org http://csjewell.dreamwidth.org/ perl@csjewell.fastmail.us http://csjewell.comyr.org/perl/

"Your random numbers are not that random" -- perl-5.10.1.tar.gz/util.c

Strawberry Perl for Windows betas: http://strawberryperl.com/beta/

p5pRT commented 12 years ago

From @dmcbride

On Friday August 31 2012 4:20:55 PM Curtis Jewell wrote:

On Fri\, Aug 31\, 2012\, at 14:58\, Eric Brine wrote:

On Fri\, Aug 31\, 2012 at 3:55 PM\, Linda W \perl\-diddler@tlinx\.org wrote:

** When perl writes to a character device\, it encodes wide output into characters. When perl writes to a block device -- it doesn't.

What about pipes\, files and sockets?

I would think that the destination for a file or a socket would be either a character device\, or a block device. But I could be wrong.

As for pipes... hmmm... character device?

Bzzt! :-P

perl -MLWP::Simple -e 'getprint
"http://cpan.metacpan.org/authors/id/R/RJ/RJBS/perl-5.16.1.tar.gz"' | \ tar xvzf -

STDOUT is a pipe. But isn't spitting out characters.

Same goes for using open "-|" or open "|-" - these could be opening streams for text or data\, such as to or from tar.

So\, the same ambiguity applies here just as with files or sockets.

Basically\, I'd love not to have to tell Perl how to encode my output on any file stream and have it just do the right thing all the time. I have no idea on what it could base its decision in a way that would be right 100% of the time. (And I just recently spent 20+ hours with a coworker trying to get our perl app to play nice in all encodings - I think we got it\, but there's definitely some domain-specific knowledge we had to use to get us there.)

Unless\, of course\, that new "magic" stuff I keep hearing about from Chip can do it. Because\, short of actual magic\, I'm not convinced it can be done. (Of course\, I don't HAVE to be convinced before someone does it.)