decode_utf8 sets utf8 flag on plain ascii strings

Perl / perl5

🐪 The Perl programming language

https://dev.perl.org/perl5/

Other

1.85k stars 527 forks source link

decode_utf8 sets utf8 flag on plain ascii strings #8779

Closed p5pRT closed 12 years ago

p5pRT commented 17 years ago

Migrated from rt.perl.org#41527 (status was 'rejected')

Searchable as RT41527$

p5pRT commented 17 years ago

From jjberthels@gmail.com

Created by jjberthels@gmail.com

Hi.

The documentation for the 'decode' function in Encode.pm states:

...the utf8 flag for $string is on unless $octets entirely consists of ASCII data...

but it appears that decode turns on the flag even if the input string is plain ASCII. A test case demonstrating this is appended below.

I understand this doesn't make a difference from a correctness point of view\, but it does change the peformance characteristics\, presumably due to the use of the unicode regex engine (the profile showed something like SWASHNEW taking a lot of time).

An older version of Encode (I believe v 2.01) had the behaviour described in the docs\, and would have passed the test case below.

In my case\, the application is required to process utf8 data correctly\, but the vast majority of data is plain ascii. This change in behaviour from 2.01 is causing a noticeable increase in CPU usage.

I'm currently working around this with a regexp test /[\x80-\xff]/ on the byte string and avoiding calling Encode::decode in this case\, but a quick check on perlmonks led to a suggestion that I raise this as a perlbug: http://perlmonks.org/?node_id=600050 (although opinion was divided on whether this was a bug).

I've taken a quick look at the XS and can see an unconditional SvUTF8_on(dst) on line 453. I don't know whether a good fix would be to add an additional loop over the string to check the flag there or keep the 'only loop over the string once' behaviour by passing a "was the string plain ascii" flag back from process_utf8().

I'll happily try to whip up a patch of either solution if you agree this needs changing and let me know which approach you prefer.

regards\,

#!/usr/bin/perl use warnings; use strict; use Test::More (tests => 2);

use Encode;

my $ascii_bytes = "l\xf8\xf8k - a latin1 string"; my $latin1_bytes = "this is plain ascii";

my $encoded_str = Encode::decode_utf8($latin1_bytes); ok(Encode::is_utf8($encoded_str)\, "(check encode is working) non-ascii latin-1 byte string becomes char str");

$encoded_str = Encode::decode_utf8($ascii_bytes); ok(! Encode::is_utf8($encoded_str)\, "but ascii byte string untagged afeter decode");

Perl Info

``` Flags: category=library severity=low Site configuration information for perl v5.8.8: Configured by Debian Project at Thu Dec 7 13:58:37 UTC 2006. Summary of my perl5 (revision 5 version 8 subversion 8) configuration: Platform: osname=linux, osvers=2.6.15.7, archname=i486-linux-gnu-thread-multi uname='linux vernadsky 2.6.15.7 #1 smp sat sep 30 10:21:42 utc 2006 i686 gnulinux ' config_args='-Dusethreads -Duselargefiles -Dccflags=-DDEBIAN -Dcccdlflags=-fPIC -Darchname=i486-linux-gnu -Dprefix=/usr -Dprivlib=/usr/share/perl/5.8 -Darchlib=/usr/lib/perl/5.8 -Dvendorprefix=/usr -Dvendorlib=/usr/share/perl5 -Dvendorarch=/usr/lib/perl5 -Dsiteprefix=/usr/local -Dsitelib=/usr/local/share/perl/5.8.8 -Dsitearch=/usr/local/lib/perl/5.8.8 -Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3 -Dsiteman1dir=/usr/local/man/man1 -Dsiteman3dir=/usr/local/man/man3 -Dman1ext=1 -Dman3ext=3perl -Dpager=/usr/bin/sensible-pager -Uafs -Ud_csh -Uusesfio -Uusenm -Duseshrplib -Dlibperl=libperl.so.5.8.8 -Dd_dosuid -des' hint=recommended, useposix=true, d_sigaction=define usethreads=define use5005threads=undef useithreads=define usemultiplicity=define useperlio=define d_sfio=undef uselargefiles=define usesocks=undef use64bitint=undef use64bitall=undef uselongdouble=undef usemymalloc=n, bincompat5005=undef Compiler: cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -DDEBIAN -fno-strict-aliasing -pipe -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64', optimize='-O2', cppflags='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -DDEBIAN -fno-strict-aliasing -pipe -I/usr/local/include' ccversion='', gccversion='4.1.2 20061115 (prerelease) (Ubuntu 4.1.1-20ubuntu1)', gccosandvers='' intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12 ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 alignbytes=4, prototype=define Linker and Libraries: ld='cc', ldflags =' -L/usr/local/lib' libpth=/usr/local/lib /lib /usr/lib libs=-lgdbm -lgdbm_compat -ldb -ldl -lm -lpthread -lc -lcrypt perllibs=-ldl -lm -lpthread -lc -lcrypt libc=/lib/libc-2.5.so, so=so, useshrplib=true, libperl=libperl.so.5.8.8 gnulibc_version='2.5' Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E' cccdlflags='-fPIC', lddlflags='-shared -L/usr/local/lib' Locally applied patches: @INC for perl v5.8.8: /home/john/install/perl/share/perl/5.8.8 /home/john/install/perl/share/perl/5.8.7 /home/john/install/perl/share/perl /home/john/install/perl/lib/perl/5.8.8 /home/john/install/perl/lib/perl/5.8.7 /home/john/install/perl/lib/perl /etc/perl /usr/local/lib/perl/5.8.8 /usr/local/share/perl/5.8.8 /usr/lib/perl5 /usr/share/perl5 /usr/lib/perl/5.8 /usr/share/perl/5.8 /usr/local/lib/site_perl . Environment for perl v5.8.8: HOME=/home/john LANG=en_GB.UTF-8 LANGUAGE (unset) LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=/home/john/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/bin/X11:/usr/games PERL5LIB=/home/john/install/perl/share/perl:/home/john/install/perl/lib/perl PERL_BADLANG (unset) SHELL=/bin/bash ```

p5pRT commented 17 years ago

From @demerphq

On 2/17/07\, via RT John Berthels \perlbug\-followup@perl\.org wrote:

# New Ticket Created by "John Berthels" # Please include the string: [perl #41527] # in the subject line of all future correspondence about this issue. # \<URL: http://rt.perl.org/rt3/Ticket/Display.html?id=41527 >

This is a bug report for perl from jjberthels@gmail.com\, generated with the help of perlbug 1.35 running under perl v5.8.8.

----------------------------------------------------------------- [Please enter your report here]

Hi.

The documentation for the 'decode' function in Encode.pm states:
     \.\.\.the utf8 flag for $string is on unless $octets entirely
     consists of ASCII data\.\.\.
but it appears that decode turns on the flag even if the input string is plain ASCII. A test case demonstrating this is appended below.

I understand this doesn't make a difference from a correctness point of view\, but it does change the peformance characteristics\, presumably due to the use of the unicode regex engine (the profile showed something like SWASHNEW taking a lot of time).

An older version of Encode (I believe v 2.01) had the behaviour described in the docs\, and would have passed the test case below.

In my case\, the application is required to process utf8 data correctly\, but the vast majority of data is plain ascii. This change in behaviour from 2.01 is causing a noticeable increase in CPU usage.

I'm currently working around this with a regexp test /[\x80-\xff]/ on the byte string and avoiding calling Encode::decode in this case\, but a quick check on perlmonks led to a suggestion that I raise this as a perlbug: http://perlmonks.org/?node_id=600050 (although opinion was divided on whether this was a bug).

I've taken a quick look at the XS and can see an unconditional SvUTF8_on(dst) on line 453. I don't know whether a good fix would be to add an additional loop over the string to check the flag there or keep the 'only loop over the string once' behaviour by passing a "was the string plain ascii" flag back from process_utf8().

I looked into more or less this strategy\, but well\, im not sure if it works out.

I have to say the code in Encode.* is kinda confusing to this ascii type programmer.

I'll happily try to whip up a patch of either solution if you agree this needs changing and let me know which approach you prefer.

regards\,

jb

#!/usr/bin/perl use warnings; use strict; use Test::More (tests => 2);

use Encode;

my $ascii_bytes = "l\xf8\xf8k - a latin1 string"; my $latin1_bytes = "this is plain ascii";

Er\, arent these backwards? \xf8 isnt in ascii\, its in latin1. ascii is a 7 bit encoding.

my $encoded_str = Encode::decode_utf8($latin1_bytes); ok(Encode::is_utf8($encoded_str)\, "(check encode is working) non-ascii latin-1 byte string becomes char str");

$encoded_str = Encode::decode_utf8($ascii_bytes); ok(! Encode::is_utf8($encoded_str)\, "but ascii byte string untagged afeter decode");

I changed the code to the attached perl script\, encode.pl and I get the attached output with perl 5.8.6 encode version 2.09:

D:\dev\perl\ver\zoro\win32>perl encode.pl 1..2 SV = PV(0x15d5914) at 0x1a6c864 REFCNT = 1 FLAGS = (PADBUSY\,PADMY\,POK\,pPOK\,UTF8) PV = 0x15dd674 "l\303\270\303\270k - a latin1 string"\0 [UTF8 "l\x{f8}\x{f8}k - a latin1 string"] CUR = 24 LEN = 27 not ok 1 - (check encode is working) non-ascii latin-1 byte string becomes char str # Failed test '(check encode is working) non-ascii latin-1 byte string becomes char str' # in encode.pl at line 13. SV = NULL(0x0) at 0x1a6c720 REFCNT = 1 FLAGS = (PADBUSY\,PADMY)

SV = PV(0x1bdb7c4) at 0x1bde2f4 REFCNT = 1 FLAGS = (PADBUSY\,PADMY\,POK\,pPOK\,UTF8) PV = 0x1bf38f4 "this is plain ascii"\0 [UTF8 "this is plain ascii"] CUR = 19 LEN = 22 ok 2 - but ascii byte string untagged after decode SV = PVMG(0x1bda7b4) at 0x1bd7d74 REFCNT = 1 FLAGS = (PADBUSY\,PADMY\,POK\,pPOK) IV = 0 NV = 0 PV = 0x1bf095c "this is plain ascii"\0 CUR = 19 LEN = 20 # Looks like you failed 1 test of 2.

Note the null return for the unicode string with high byte chars in it.

Now here it is with a blead patch with the attached patch\, notice it has correct output for both strings and passes the tests:

D:\dev\perl\ver\zoro\win32>..\perl encode.pl 1..2 SV = PV(0x1a46cc4) at 0x1a6802c REFCNT = 1 FLAGS = (PADMY\,POK\,pPOK\,UTF8) PV = 0x1a4fd5c "l\303\270\303\270k - a latin1 string"\0 [UTF8 "l\x{f8}\x{f8}k - a latin1 string"] CUR = 24 LEN = 28 ok 1 - (check encode is working) non-ascii latin-1 byte string becomes char str SV = PV(0x1b400bc) at 0x1b3bf94 REFCNT = 1 FLAGS = (PADMY\,POK\,pPOK\,UTF8) PV = 0x1b9cdfc "l\303\270\303\270k - a latin1 string"\0 [UTF8 "l\x{f8}\x{f8}k - a latin1 string"] CUR = 24 LEN = 28

SV = PV(0x1bb6f1c) at 0x1b68ccc REFCNT = 1 FLAGS = (PADMY\,POK\,pPOK\,UTF8) PV = 0x1b6012c "this is plain ascii"\0 [UTF8 "this is plain ascii"] CUR = 19 LEN = 24 ok 2 - but ascii byte string untagged after decode SV = PV(0x1bb6f1c) at 0x1b68bdc REFCNT = 1 FLAGS = (PADMY\,POK\,pPOK) PV = 0x1b2985c "this is plain ascii"\0 CUR = 19 LEN = 20

Now here it is with an unpatched blead:

Everything is up to date. 'nmake test' to run test suite. 1..2 SV = PV(0x1a46cc4) at 0x1a6802c REFCNT = 1 FLAGS = (PADMY\,POK\,pPOK\,UTF8) PV = 0x1a4fd5c "l\303\270\303\270k - a latin1 string"\0 [UTF8 "l\x{f8}\x{f8}k - a latin1 string"] CUR = 24 LEN = 28 ok 1 - (check encode is working) non-ascii latin-1 byte string becomes char str SV = PVMG(0x1b60b5c) at 0x1b3bf94 REFCNT = 1 FLAGS = (PADMY\,POK\,pPOK\,UTF8) IV = 0 NV = 0 PV = 0x1b4aa14 "l\303\270\303\270k - a latin1 string"\0 [UTF8 "l\x{f8}\x{f8}k - a latin1 string"] CUR = 24 LEN = 28 MAGIC = 0x1b4b544 MG_VIRTUAL = &PL_vtbl_utf8 MG_TYPE = PERL_MAGIC_utf8(w) MG_LEN = 22

SV = PV(0x1bb7084) at 0x1b68ccc REFCNT = 1 FLAGS = (PADMY\,POK\,pPOK\,UTF8) PV = 0x1bbf8bc "this is plain ascii"\0 [UTF8 "this is plain ascii"] CUR = 19 LEN = 24 not ok 2 - but ascii byte string untagged after decode # Failed test 'but ascii byte string untagged after decode' # at encode.pl line 21. SV = PVMG(0x1b60b9c) at 0x1b68bdc REFCNT = 1 FLAGS = (PADMY\,POK\,pPOK\,UTF8) IV = 0 NV = 0 PV = 0x1a8d404 "this is plain ascii"\0 [UTF8 "this is plain ascii"] CUR = 19 LEN = 20 MAGIC = 0x1bb9d94 MG_VIRTUAL = &PL_vtbl_utf8 MG_TYPE = PERL_MAGIC_utf8(w) MG_LEN = 19 # Looks like you failed 1 test of 2.

Guess why we get this output? Because current decode_utf8 no-ops when the input string is already utf8 (contrary to the docs). Remove that noop line (line 196 in Encode.pm) and here is what happens:

Everything is up to date. 'nmake test' to run test suite. 1..2 SV = PV(0x1a46cc4) at 0x1a6802c REFCNT = 1 FLAGS = (PADMY\,POK\,pPOK\,UTF8) PV = 0x1a4fd5c "l\303\270\303\270k - a latin1 string"\0 [UTF8 "l\x{f8}\x{f8}k - a latin1 string"] CUR = 24 LEN = 28 ok 1 - (check encode is working) non-ascii latin-1 byte string becomes char str SV = PV(0x1b400bc) at 0x1b3bf94 REFCNT = 1 FLAGS = (PADMY\,POK\,pPOK\,UTF8) PV = 0x1b9cdfc "l\357\277\275\357\277\275k - a latin1 string"\0 [UTF8 "l\x{fffd}\x{fffd}k - a latin1 string"] CUR = 26 LEN = 28

SV = PV(0x1bb6f1c) at 0x1b68ccc REFCNT = 1 FLAGS = (PADMY\,POK\,pPOK\,UTF8) PV = 0x1b6012c "this is plain ascii"\0 [UTF8 "this is plain ascii"] CUR = 19 LEN = 24 not ok 2 - but ascii byte string untagged after decode # Failed test 'but ascii byte string untagged after decode' # at encode.pl line 21. SV = PV(0x1bb6f1c) at 0x1b68bdc REFCNT = 1 FLAGS = (PADMY\,POK\,pPOK\,UTF8) PV = 0x1b2985c "this is plain ascii"\0 [UTF8 "this is plain ascii"] CUR = 19 LEN = 20 # Looks like you failed 1 test of 2.

Notice the \x{fffd}\x{fffd}\, which are because the code (line 431 in Encode.xs)

if (SvUTF8(src)) { s = utf8_to_bytes(s\,&slen); if (s) { SvCUR_set(src\,slen); SvUTF8_off(src); e = s+slen; } else { croak("Cannot decode string with wide characters"); } }

Which doesnt seem logical\, and when tracing the code doesnt work. The valid utf8 sequence gets converted to its byte form and then passed to utf8n_to_uvuni() which naturally fails to decode it.

I hope this analysis is useful to someone\, it seems to me that the current behaviour of is wrong\, but i dont understand it well enough to say for sure.

Cheers\, Yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 17 years ago

From @demerphq

encode.pl

p5pRT commented 17 years ago

From @demerphq

encode_provisional.patch

```diff Index: Encode.pm =================================================================== --- Encode.pm (revision 968) +++ Encode.pm (working copy) @@ -193,7 +193,7 @@ sub decode_utf8($;$) { my ( $str, $check ) = @_; - return $str if is_utf8($str); + #return $str if is_utf8($str); if ($check) { return decode( "utf8", $str, $check ); } Index: Encode.xs =================================================================== --- Encode.xs (revision 968) +++ Encode.xs (working copy) @@ -305,17 +305,22 @@ { UV uv; STRLEN ulen; + bool is_ascii=1; SvPOK_only(dst); SvCUR_set(dst,0); + SvUTF8_off(dst); while (s < e) { + /* printf("s>%d\n",*s); */ if (UTF8_IS_INVARIANT(*s)) { sv_catpvn(dst, (char *)s, 1); s++; continue; } - + + is_ascii = 0; + if (UTF8_IS_START(*s)) { U8 skip = UTF8SKIP(s); if ((s + skip) > e) { @@ -326,11 +331,15 @@ goto malformed_byte; } - + /* printf("s=%d\n",*s); */ + uv = utf8n_to_uvuni(s, e - s, &ulen, UTF8_CHECK_ONLY | (strict ? UTF8_ALLOW_STRICT : UTF8_ALLOW_NONSTRICT) ); + /* printf("uv=%d ulen=%d\n",uv,ulen); */ + + #if 1 /* perl-5.8.6 and older do not check UTF8_ALLOW_LONG */ if (strict && uv > PERL_UNICODE_MAX) ulen = (STRLEN) -1; @@ -387,6 +396,7 @@ } s += ulen; } + if(!is_ascii) SvUTF8_on(dst); *SvEND(dst) = '\0'; return s; @@ -428,6 +438,7 @@ FREETMPS; LEAVE; /* end PerlIO check */ + /* if (SvUTF8(src)) { s = utf8_to_bytes(s,&slen); if (s) { @@ -439,6 +450,7 @@ croak("Cannot decode string with wide characters"); } } + */ s = process_utf8(aTHX_ dst, s, e, check, 0, strict_utf8(aTHX_ obj), renewed); @@ -450,7 +462,7 @@ } SvCUR_set(src, slen); } - SvUTF8_on(dst); + /*SvUTF8_on(dst);*/ ST(0) = sv_2mortal(dst); XSRETURN(1); } ```

p5pRT commented 17 years ago

The RT System itself - Status changed from 'new' to 'open'

p5pRT commented 17 years ago

From @demerphq

On 2/17/07\, demerphq \demerphq@gmail\.com wrote:

I looked into more or less this strategy\, but well\, im not sure if it works out.

It works out (meaning all the Encode tests pass) with the exception of the test 6 of ext/Encode/t/mime-header.t

Which is:

is(Encode::decode('MIME-Header'\, $qheader)\, $dheader\, "decode Q");

But the test itself\, and the test file contains utf8 byte sequences which im not really set up to work with so its hard to tell.

Im kinda wondering if putting utf8 sequences in a test file is the right thing to do actually. Dont we have problems with binary data and source control?

Cheers\, Yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 17 years ago

From @demerphq

On 2/17/07\, demerphq \demerphq@gmail\.com wrote:

On 2/17/07\, via RT John Berthels \perlbug\-followup@perl\.org wrote:
# New Ticket Created by "John Berthels" # Please include the string: [perl #41527] # in the subject line of all future correspondence about this issue. # \<URL: http://rt.perl.org/rt3/Ticket/Display.html?id=41527 >

This is a bug report for perl from jjberthels@gmail.com\, generated with the help of perlbug 1.35 running under perl v5.8.8.

----------------------------------------------------------------- [Please enter your report here]

Hi.

The documentation for the 'decode' function in Encode.pm states:
     \.\.\.the utf8 flag for $string is on unless $octets entirely
     consists of ASCII data\.\.\.
but it appears that decode turns on the flag even if the input string is plain ASCII. A test case demonstrating this is appended below.

I understand this doesn't make a difference from a correctness point of view\, but it does change the peformance characteristics\, presumably due to the use of the unicode regex engine (the profile showed something like SWASHNEW taking a lot of time).

An older version of Encode (I believe v 2.01) had the behaviour described in the docs\, and would have passed the test case below.

In my case\, the application is required to process utf8 data correctly\, but the vast majority of data is plain ascii. This change in behaviour from 2.01 is causing a noticeable increase in CPU usage.

I'm currently working around this with a regexp test /[\x80-\xff]/ on the byte string and avoiding calling Encode::decode in this case\, but a quick check on perlmonks led to a suggestion that I raise this as a perlbug: http://perlmonks.org/?node_id=600050 (although opinion was divided on whether this was a bug).

I've taken a quick look at the XS and can see an unconditional SvUTF8_on(dst) on line 453. I don't know whether a good fix would be to add an additional loop over the string to check the flag there or keep the 'only loop over the string once' behaviour by passing a "was the string plain ascii" flag back from process_utf8().
I looked into more or less this strategy\, but well\, im not sure if it works out.

I have to say the code in Encode.* is kinda confusing to this ascii type programmer.

I'll happily try to whip up a patch of either solution if you agree this needs changing and let me know which approach you prefer.

regards\,

jb

#!/usr/bin/perl use warnings; use strict; use Test::More (tests => 2);

use Encode;

my $ascii_bytes = "l\xf8\xf8k - a latin1 string"; my $latin1_bytes = "this is plain ascii";

Er\, arent these backwards? \xf8 isnt in ascii\, its in latin1. ascii is a 7 bit encoding.

my $encoded_str = Encode::decode_utf8($latin1_bytes); ok(Encode::is_utf8($encoded_str)\, "(check encode is working) non-ascii latin-1 byte string becomes char str");

$encoded_str = Encode::decode_utf8($ascii_bytes); ok(! Encode::is_utf8($encoded_str)\, "but ascii byte string untagged afeter decode");

I changed the code to the attached perl script\, encode.pl and I get the attached output with perl 5.8.6 encode version 2.09:

D:\dev\perl\ver\zoro\win32>perl encode.pl 1..2 SV = PV(0x15d5914) at 0x1a6c864 REFCNT = 1 FLAGS = (PADBUSY\,PADMY\,POK\,pPOK\,UTF8) PV = 0x15dd674 "l\303\270\303\270k - a latin1 string"\0 [UTF8 "l\x{f8}\x{f8}k - a latin1 string"] CUR = 24 LEN = 27 not ok 1 - (check encode is working) non-ascii latin-1 byte string becomes char str # Failed test '(check encode is working) non-ascii latin-1 byte string becomes char str' # in encode.pl at line 13. SV = NULL(0x0) at 0x1a6c720 REFCNT = 1 FLAGS = (PADBUSY\,PADMY) ---------- SV = PV(0x1bdb7c4) at 0x1bde2f4 REFCNT = 1 FLAGS = (PADBUSY\,PADMY\,POK\,pPOK\,UTF8) PV = 0x1bf38f4 "this is plain ascii"\0 [UTF8 "this is plain ascii"] CUR = 19 LEN = 22 ok 2 - but ascii byte string untagged after decode SV = PVMG(0x1bda7b4) at 0x1bd7d74 REFCNT = 1 FLAGS = (PADBUSY\,PADMY\,POK\,pPOK) IV = 0 NV = 0 PV = 0x1bf095c "this is plain ascii"\0 CUR = 19 LEN = 20 # Looks like you failed 1 test of 2.

Note the null return for the unicode string with high byte chars in it.

Now here it is with a blead patch with the attached patch\, notice it has correct output for both strings and passes the tests:

D:\dev\perl\ver\zoro\win32>..\perl encode.pl 1..2 SV = PV(0x1a46cc4) at 0x1a6802c REFCNT = 1 FLAGS = (PADMY\,POK\,pPOK\,UTF8) PV = 0x1a4fd5c "l\303\270\303\270k - a latin1 string"\0 [UTF8 "l\x{f8}\x{f8}k - a latin1 string"] CUR = 24 LEN = 28 ok 1 - (check encode is working) non-ascii latin-1 byte string becomes char str SV = PV(0x1b400bc) at 0x1b3bf94 REFCNT = 1 FLAGS = (PADMY\,POK\,pPOK\,UTF8) PV = 0x1b9cdfc "l\303\270\303\270k - a latin1 string"\0 [UTF8 "l\x{f8}\x{f8}k - a latin1 string"] CUR = 24 LEN = 28 ---------- SV = PV(0x1bb6f1c) at 0x1b68ccc REFCNT = 1 FLAGS = (PADMY\,POK\,pPOK\,UTF8) PV = 0x1b6012c "this is plain ascii"\0 [UTF8 "this is plain ascii"] CUR = 19 LEN = 24 ok 2 - but ascii byte string untagged after decode SV = PV(0x1bb6f1c) at 0x1b68bdc REFCNT = 1 FLAGS = (PADMY\,POK\,pPOK) PV = 0x1b2985c "this is plain ascii"\0 CUR = 19 LEN = 20

Now here it is with an unpatched blead:

Everything is up to date. 'nmake test' to run test suite. 1..2 SV = PV(0x1a46cc4) at 0x1a6802c REFCNT = 1 FLAGS = (PADMY\,POK\,pPOK\,UTF8) PV = 0x1a4fd5c "l\303\270\303\270k - a latin1 string"\0 [UTF8 "l\x{f8}\x{f8}k - a latin1 string"] CUR = 24 LEN = 28 ok 1 - (check encode is working) non-ascii latin-1 byte string becomes char str SV = PVMG(0x1b60b5c) at 0x1b3bf94 REFCNT = 1 FLAGS = (PADMY\,POK\,pPOK\,UTF8) IV = 0 NV = 0 PV = 0x1b4aa14 "l\303\270\303\270k - a latin1 string"\0 [UTF8 "l\x{f8}\x{f8}k - a latin1 string"] CUR = 24 LEN = 28 MAGIC = 0x1b4b544 MG_VIRTUAL = &PL_vtbl_utf8 MG_TYPE = PERL_MAGIC_utf8(w) MG_LEN = 22 ---------- SV = PV(0x1bb7084) at 0x1b68ccc REFCNT = 1 FLAGS = (PADMY\,POK\,pPOK\,UTF8) PV = 0x1bbf8bc "this is plain ascii"\0 [UTF8 "this is plain ascii"] CUR = 19 LEN = 24 not ok 2 - but ascii byte string untagged after decode # Failed test 'but ascii byte string untagged after decode' # at encode.pl line 21. SV = PVMG(0x1b60b9c) at 0x1b68bdc REFCNT = 1 FLAGS = (PADMY\,POK\,pPOK\,UTF8) IV = 0 NV = 0 PV = 0x1a8d404 "this is plain ascii"\0 [UTF8 "this is plain ascii"] CUR = 19 LEN = 20 MAGIC = 0x1bb9d94 MG_VIRTUAL = &PL_vtbl_utf8 MG_TYPE = PERL_MAGIC_utf8(w) MG_LEN = 19 # Looks like you failed 1 test of 2.

Guess why we get this output? Because current decode_utf8 no-ops when the input string is already utf8 (contrary to the docs). Remove that noop line (line 196 in Encode.pm) and here is what happens:

Everything is up to date. 'nmake test' to run test suite. 1..2 SV = PV(0x1a46cc4) at 0x1a6802c REFCNT = 1 FLAGS = (PADMY\,POK\,pPOK\,UTF8) PV = 0x1a4fd5c "l\303\270\303\270k - a latin1 string"\0 [UTF8 "l\x{f8}\x{f8}k - a latin1 string"] CUR = 24 LEN = 28 ok 1 - (check encode is working) non-ascii latin-1 byte string becomes char str SV = PV(0x1b400bc) at 0x1b3bf94 REFCNT = 1 FLAGS = (PADMY\,POK\,pPOK\,UTF8) PV = 0x1b9cdfc "l\357\277\275\357\277\275k - a latin1 string"\0 [UTF8 "l\x{fffd}\x{fffd}k - a latin1 string"] CUR = 26 LEN = 28 ---------- SV = PV(0x1bb6f1c) at 0x1b68ccc REFCNT = 1 FLAGS = (PADMY\,POK\,pPOK\,UTF8) PV = 0x1b6012c "this is plain ascii"\0 [UTF8 "this is plain ascii"] CUR = 19 LEN = 24 not ok 2 - but ascii byte string untagged after decode # Failed test 'but ascii byte string untagged after decode' # at encode.pl line 21. SV = PV(0x1bb6f1c) at 0x1b68bdc REFCNT = 1 FLAGS = (PADMY\,POK\,pPOK\,UTF8) PV = 0x1b2985c "this is plain ascii"\0 [UTF8 "this is plain ascii"] CUR = 19 LEN = 20 # Looks like you failed 1 test of 2.

Notice the \x{fffd}\x{fffd}\, which are because the code (line 431 in Encode.xs)
if $SvUTF8\(src$\) \{
s = utf8\_to\_bytes$s\,&slen$;
if $s$ \{
    SvCUR\_set$src\,slen$;
    SvUTF8\_off$src$;
    e = s\+slen;
\}
else \{
    croak$"Cannot decode string with wide characters"$;
\}
\}
Which doesnt seem logical\, and when tracing the code doesnt work. The valid utf8 sequence gets converted to its byte form and then passed to utf8n_to_uvuni() which naturally fails to decode it.

I hope this analysis is useful to someone\, it seems to me that the current behaviour of is wrong\, but i dont understand it well enough to say for sure.

Warnocked?

cheers\, yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 17 years ago

From @demerphq

Hello Gentlemen\,

Was wondering if either of you had any comments or thoughts on the attached patches and test files. This matter seems to be warnocked until one or both of you utf8/unicode/encoding experts give your feedback...

Cheers\, Yves

On 2/17/07\, demerphq \demerphq@gmail\.com wrote:

On 2/17/07\, via RT John Berthels \perlbug\-followup@perl\.org wrote:
# New Ticket Created by "John Berthels" # Please include the string: [perl #41527] # in the subject line of all future correspondence about this issue. # \<URL: http://rt.perl.org/rt3/Ticket/Display.html?id=41527 >

This is a bug report for perl from jjberthels@gmail.com\, generated with the help of perlbug 1.35 running under perl v5.8.8.

----------------------------------------------------------------- [Please enter your report here]

Hi.

The documentation for the 'decode' function in Encode.pm states:
     \.\.\.the utf8 flag for $string is on unless $octets entirely
     consists of ASCII data\.\.\.
but it appears that decode turns on the flag even if the input string is plain ASCII. A test case demonstrating this is appended below.

I understand this doesn't make a difference from a correctness point of view\, but it does change the peformance characteristics\, presumably due to the use of the unicode regex engine (the profile showed something like SWASHNEW taking a lot of time).

An older version of Encode (I believe v 2.01) had the behaviour described in the docs\, and would have passed the test case below.

In my case\, the application is required to process utf8 data correctly\, but the vast majority of data is plain ascii. This change in behaviour from 2.01 is causing a noticeable increase in CPU usage.

I'm currently working around this with a regexp test /[\x80-\xff]/ on the byte string and avoiding calling Encode::decode in this case\, but a quick check on perlmonks led to a suggestion that I raise this as a perlbug: http://perlmonks.org/?node_id=600050 (although opinion was divided on whether this was a bug).

I've taken a quick look at the XS and can see an unconditional SvUTF8_on(dst) on line 453. I don't know whether a good fix would be to add an additional loop over the string to check the flag there or keep the 'only loop over the string once' behaviour by passing a "was the string plain ascii" flag back from process_utf8().
I looked into more or less this strategy\, but well\, im not sure if it works out.

I have to say the code in Encode.* is kinda confusing to this ascii type programmer.

I'll happily try to whip up a patch of either solution if you agree this needs changing and let me know which approach you prefer.

regards\,

jb

#!/usr/bin/perl use warnings; use strict; use Test::More (tests => 2);

use Encode;

my $ascii_bytes = "l\xf8\xf8k - a latin1 string"; my $latin1_bytes = "this is plain ascii";

Er\, arent these backwards? \xf8 isnt in ascii\, its in latin1. ascii is a 7 bit encoding.

my $encoded_str = Encode::decode_utf8($latin1_bytes); ok(Encode::is_utf8($encoded_str)\, "(check encode is working) non-ascii latin-1 byte string becomes char str");

$encoded_str = Encode::decode_utf8($ascii_bytes); ok(! Encode::is_utf8($encoded_str)\, "but ascii byte string untagged afeter decode");

I changed the code to the attached perl script\, encode.pl and I get the attached output with perl 5.8.6 encode version 2.09:

D:\dev\perl\ver\zoro\win32>perl encode.pl 1..2 SV = PV(0x15d5914) at 0x1a6c864 REFCNT = 1 FLAGS = (PADBUSY\,PADMY\,POK\,pPOK\,UTF8) PV = 0x15dd674 "l\303\270\303\270k - a latin1 string"\0 [UTF8 "l\x{f8}\x{f8}k - a latin1 string"] CUR = 24 LEN = 27 not ok 1 - (check encode is working) non-ascii latin-1 byte string becomes char str # Failed test '(check encode is working) non-ascii latin-1 byte string becomes char str' # in encode.pl at line 13. SV = NULL(0x0) at 0x1a6c720 REFCNT = 1 FLAGS = (PADBUSY\,PADMY) ---------- SV = PV(0x1bdb7c4) at 0x1bde2f4 REFCNT = 1 FLAGS = (PADBUSY\,PADMY\,POK\,pPOK\,UTF8) PV = 0x1bf38f4 "this is plain ascii"\0 [UTF8 "this is plain ascii"] CUR = 19 LEN = 22 ok 2 - but ascii byte string untagged after decode SV = PVMG(0x1bda7b4) at 0x1bd7d74 REFCNT = 1 FLAGS = (PADBUSY\,PADMY\,POK\,pPOK) IV = 0 NV = 0 PV = 0x1bf095c "this is plain ascii"\0 CUR = 19 LEN = 20 # Looks like you failed 1 test of 2.

Note the null return for the unicode string with high byte chars in it.

Now here it is with a blead patch with the attached patch\, notice it has correct output for both strings and passes the tests:

D:\dev\perl\ver\zoro\win32>..\perl encode.pl 1..2 SV = PV(0x1a46cc4) at 0x1a6802c REFCNT = 1 FLAGS = (PADMY\,POK\,pPOK\,UTF8) PV = 0x1a4fd5c "l\303\270\303\270k - a latin1 string"\0 [UTF8 "l\x{f8}\x{f8}k - a latin1 string"] CUR = 24 LEN = 28 ok 1 - (check encode is working) non-ascii latin-1 byte string becomes char str SV = PV(0x1b400bc) at 0x1b3bf94 REFCNT = 1 FLAGS = (PADMY\,POK\,pPOK\,UTF8) PV = 0x1b9cdfc "l\303\270\303\270k - a latin1 string"\0 [UTF8 "l\x{f8}\x{f8}k - a latin1 string"] CUR = 24 LEN = 28 ---------- SV = PV(0x1bb6f1c) at 0x1b68ccc REFCNT = 1 FLAGS = (PADMY\,POK\,pPOK\,UTF8) PV = 0x1b6012c "this is plain ascii"\0 [UTF8 "this is plain ascii"] CUR = 19 LEN = 24 ok 2 - but ascii byte string untagged after decode SV = PV(0x1bb6f1c) at 0x1b68bdc REFCNT = 1 FLAGS = (PADMY\,POK\,pPOK) PV = 0x1b2985c "this is plain ascii"\0 CUR = 19 LEN = 20

Now here it is with an unpatched blead:

Everything is up to date. 'nmake test' to run test suite. 1..2 SV = PV(0x1a46cc4) at 0x1a6802c REFCNT = 1 FLAGS = (PADMY\,POK\,pPOK\,UTF8) PV = 0x1a4fd5c "l\303\270\303\270k - a latin1 string"\0 [UTF8 "l\x{f8}\x{f8}k - a latin1 string"] CUR = 24 LEN = 28 ok 1 - (check encode is working) non-ascii latin-1 byte string becomes char str SV = PVMG(0x1b60b5c) at 0x1b3bf94 REFCNT = 1 FLAGS = (PADMY\,POK\,pPOK\,UTF8) IV = 0 NV = 0 PV = 0x1b4aa14 "l\303\270\303\270k - a latin1 string"\0 [UTF8 "l\x{f8}\x{f8}k - a latin1 string"] CUR = 24 LEN = 28 MAGIC = 0x1b4b544 MG_VIRTUAL = &PL_vtbl_utf8 MG_TYPE = PERL_MAGIC_utf8(w) MG_LEN = 22 ---------- SV = PV(0x1bb7084) at 0x1b68ccc REFCNT = 1 FLAGS = (PADMY\,POK\,pPOK\,UTF8) PV = 0x1bbf8bc "this is plain ascii"\0 [UTF8 "this is plain ascii"] CUR = 19 LEN = 24 not ok 2 - but ascii byte string untagged after decode # Failed test 'but ascii byte string untagged after decode' # at encode.pl line 21. SV = PVMG(0x1b60b9c) at 0x1b68bdc REFCNT = 1 FLAGS = (PADMY\,POK\,pPOK\,UTF8) IV = 0 NV = 0 PV = 0x1a8d404 "this is plain ascii"\0 [UTF8 "this is plain ascii"] CUR = 19 LEN = 20 MAGIC = 0x1bb9d94 MG_VIRTUAL = &PL_vtbl_utf8 MG_TYPE = PERL_MAGIC_utf8(w) MG_LEN = 19 # Looks like you failed 1 test of 2.

Guess why we get this output? Because current decode_utf8 no-ops when the input string is already utf8 (contrary to the docs). Remove that noop line (line 196 in Encode.pm) and here is what happens:

Everything is up to date. 'nmake test' to run test suite. 1..2 SV = PV(0x1a46cc4) at 0x1a6802c REFCNT = 1 FLAGS = (PADMY\,POK\,pPOK\,UTF8) PV = 0x1a4fd5c "l\303\270\303\270k - a latin1 string"\0 [UTF8 "l\x{f8}\x{f8}k - a latin1 string"] CUR = 24 LEN = 28 ok 1 - (check encode is working) non-ascii latin-1 byte string becomes char str SV = PV(0x1b400bc) at 0x1b3bf94 REFCNT = 1 FLAGS = (PADMY\,POK\,pPOK\,UTF8) PV = 0x1b9cdfc "l\357\277\275\357\277\275k - a latin1 string"\0 [UTF8 "l\x{fffd}\x{fffd}k - a latin1 string"] CUR = 26 LEN = 28 ---------- SV = PV(0x1bb6f1c) at 0x1b68ccc REFCNT = 1 FLAGS = (PADMY\,POK\,pPOK\,UTF8) PV = 0x1b6012c "this is plain ascii"\0 [UTF8 "this is plain ascii"] CUR = 19 LEN = 24 not ok 2 - but ascii byte string untagged after decode # Failed test 'but ascii byte string untagged after decode' # at encode.pl line 21. SV = PV(0x1bb6f1c) at 0x1b68bdc REFCNT = 1 FLAGS = (PADMY\,POK\,pPOK\,UTF8) PV = 0x1b2985c "this is plain ascii"\0 [UTF8 "this is plain ascii"] CUR = 19 LEN = 20 # Looks like you failed 1 test of 2.

Notice the \x{fffd}\x{fffd}\, which are because the code (line 431 in Encode.xs)
if $SvUTF8\(src$\) \{
s = utf8\_to\_bytes$s\,&slen$;
if $s$ \{
    SvCUR\_set$src\,slen$;
    SvUTF8\_off$src$;
    e = s\+slen;
\}
else \{
    croak$"Cannot decode string with wide characters"$;
\}
\}
Which doesnt seem logical\, and when tracing the code doesnt work. The valid utf8 sequence gets converted to its byte form and then passed to utf8n_to_uvuni() which naturally fails to decode it.

I hope this analysis is useful to someone\, it seems to me that the current behaviour of is wrong\, but i dont understand it well enough to say for sure.

Cheers\, Yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

-- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 17 years ago

From @demerphq

encode.pl

p5pRT commented 17 years ago

From @demerphq

encode_provisional.patch

p5pRT commented 17 years ago

From BQW10602@nifty.com

my $ascii_bytes = "l\xf8\xf8k - a latin1 string"; my $latin1_bytes = "this is plain ascii"; my $encoded_str = Encode::decode_utf8($latin1_bytes); ok(Encode::is_utf8($encoded_str)\, "(check encode is working) non-ascii latin-1 byte string becomes char str");

$encoded_str = Encode::decode_utf8($ascii_bytes); ok(! Encode::is_utf8($encoded_str)\, "but ascii byte string untagged afeter decode");

I think decoding means the conversion from some octets in a certain encoding to the correspoinding string that is recognized as a unicode string by perl. Thus decode_utf8() should take a octet sequence in utf8 (that is UTF-8 or UTF-EBCDIC to perl)\, while "l\xf8\xf8k" *is* ill-formed utf8.

To decode latin1 bytes\, do decode('latin1'\, $latin1_bytes). Don't say decode_utf8($latin1_bytes) nor decode('utf8'\, $latin1_bytes).

I think the following shows the typical usage of decode_utf8()

#!/usr/bin/perl use warnings; use strict; use Test::More (tests => 1); use Encode; my $utf8_octets = "l\xC3\xB8\xC3\xB8k - a utf-8 string"; # U+00F8 is represented as "\xC3\xB8" in UTF-8.

my $decoded_str = Encode::decode_utf8($utf8_octets); ok(Encode::is_utf8($decoded_str)); # should be ok

Regards\, SADAHIRO Tomoyuki

p5pRT commented 17 years ago

From BQW10602@nifty.com

Certainly decode_utf8($ascii) had a change\, though I'm not sure whether that was intended or not.

use Encode; print "$Encode::VERSION\n"; use strict; my $ascii_bytes = "plain ascii"; my $decoded_by_Encode = Encode::decode_utf8($ascii_bytes); use Devel::Peek; Dump($decoded_by_Encode); # old Encode (..2.09): UTF8 flag off # recent Encode (2.10..): UTF8 flag on

But this old behavior for plain ascii relied on utf8::decode().

Upgrade to Encode 2.10 cf. http://public.activestate.com/cgi-bin/perlbrowse/p/24490

@@ -204\,7 +203\,7 @@ if ($check){ return decode("utf8"\, $str\, $check); }else{ - return undef unless utf8::decode($str); + return decode("utf8"\, $str); return $str; } }

See utf8 manpage that clearly mentions utf8::decode() won't mark ascii strings with utf8 flag:

utf8::decode($string) [...] The UTF-8 flag is turned on only if the source string contains multiple-byte UTF-X characters.

Regards\, SADAHIRO Tomoyuki

p5pRT commented 17 years ago

From darren@DarrenDuncan.net

John Berthels said:

The documentation for the 'decode' function in Encode.pm states:
      \.\.\.the utf8 flag for $string is on unless $octets entirely
      consists of ASCII data\.\.\.
but it appears that decode turns on the flag even if the input string is plain ASCII. A test case demonstrating this is appended below.

Considering that I like to write modern programs that simply use Unicode end-to-end as possible\, and at least internally\, which keeps everything simple and compatible\, it would be easier for me if the meaning of the utf8 flag was updated to officially be the new behaviour.

I believe that a true utf8 flag should mean that the string contains data that is valid utf8\, not just that it has utf8 characters outside the ASCII range.

As far as I know\, the conceptual purpose of the utf8 flag is to indicate whether Perl considers a string to be unambiguous character data or binary data which could be ambiguous character data\, and thus how Perl will treat it by default.

If I have a library that wants to work internally with unambiguous character data\, and to keep things simple will require the user code to remove any ambiguity by doing any decoding itself and passing the library the result\, then it would be simpler if the input checking code of the library could just do this:

sub expects_text { my ($v) = @_; confess q{Bad arg; it is undefined.} if !defined $v; confess q{Bad arg; Perl 5 does not consider it to be a char str.} if !Encode::is_utf8( $v ); # $v is okay\, so do whatever ... }

Instead\, the older documented utf8 flag behaviour would require this unnecessary extra work in order to accept all valid input:

sub expects_text { my ($v) = @_; confess q{Bad arg; it is undefined.} if !defined $v; confess q{Bad arg; Perl 5 does not consider it to be a char str.} if !Encode::is_utf8( $v ) and $v =~ m/[^\x00-\x7F]/xs; # $v is okay\, so do whatever ... }

I would expect the use of the regular expression\, which would be called for any ASCII data\, would be considerably slower than just checking the flag\, especially since we already know the data is valid Unicode characters in order for decode() to possibly set the flag in the first place.

Now\, if there is some concern that character-oriented regexes and such are considerably slower for ASCII data than alternatives\, and this is a problem and it can't be otherwise dealt with\, we could perhaps have an additional flag which has the meaning that I ascribed to utf8; eg\, is_chars() or is_text() etcetera; but in my mind it would be simpler to just leave the meaning of is_utf8 adjusted to mean is unambiguous character data.

Thank you. -- Darren Duncan

P.S. On a tangent\, it would be nice if there was a simple test to see if an SV currently considered its numerical or integer or string etc component to be the authoratative one\, so eg I could just check that rather than using looks_like_number or some such more complicated solution. Though maybe there is already\, perhaps in a bundled debugging or some such module\, and I haven't found it yet?

p5pRT commented 17 years ago

From @Juerd

Darren Duncan skribis 2007-03-27 15:52 (-0700):

I believe that a true utf8 flag should mean that the string contains data that is valid utf8\, not just that it has utf8 characters outside the ASCII range.

How often should Perl check for this? Directly after decoding only\, or also after mutating operations like substr\, or s///?

As far as I know\, the conceptual purpose of the utf8 flag is to indicate whether Perl considers a string to be unambiguous character data or binary data which could be ambiguous character data\, and thus how Perl will treat it by default.

The *conceptual* purpose of the UTF8 flag isn't there. Conceptually\, every string can be a unicode string\, and you're not supposed to look at\, know\, or set the UTF8 flag yourself. It's an internal bit\, like IOK and NOK. [1]

    confess q\{Bad arg; Perl 5 does not consider it to be a char str\.\}
        if \!Encode&#8203;::is\_utf8\( $v \);

As said\, this is not the purpose of the flag\, and you're not supposed to use is_utf8 for this. It is documented with the "[INTERNAL]" flag\, for a good reason.

Perl conceptually has a single numeric type\, and a single string type. The distinction between integer and float\, and between iso-8859-1 and utf-8\, is internal.

This could be changed\, but will introduce incompatibilities and a severe loss of performance for strings that fit in iso-8859-1.

What I want (and I think you want too) is a real type system\, to have two different distinct types: byte strings and character strings. It would be bad to use a flag called "UTF8" for this\, because a byte string can also be UTF8 encoded. Perl already suffers from this problem\, but because the UTF8 flag is *INTERNAL*\, it's not a big deal. It would be if it surfaced and was used by Perl coders.

A whole type system is a bit too much to implement in Perl 5\, I think. Our current unicode string semantics are a great way to deal with not having types\, in my opinion.

Instead\, the older documented utf8 flag behaviour would require this unnecessary extra work in order to accept all valid input:

No.

If your subroutine expects text\, it can only assume that it gets text\, and it should not (must not?) make any distinction based on the internal encoding.

The string it gets is a Unicode string. Not a UTF8 string\, not a latin1 string.

        if \!Encode&#8203;::is\_utf8\( $v \) and $v =~ m/\[^\\x00\-\\x7F\]/xs;

This check is wrong. If the flag is not set\, that means only that the internal encoding is iso-8859-1 if the string is a text string\, not that the string is a byte string.

The reverse is true\, however: if the flag is set\, the string will not be a byte string. But lack of UTF8 flag is no indication of byte versus character.

I would expect the use of the regular expression\, which would be called for any ASCII data

Note that Perl internally uses iso-8859-1 (8 bit) and utf-8 (variable whole-octet)\, not ascii (7 bit).

The character é (eacute) may be stored internally as the single octet 233 (decimal) and does not by itself cause an internal upgrade to UTF-8.

[1] Some parts of Perl break this concept. The regex engine is one of them\, and has different semantics depending on the presence of the flag. This is a bug\, but any fix would be incompatible. -- korajn salutojn\,

juerd waalboer: perl hacker \juerd@juerd\.nl \<http://juerd.nl/sig> convolution: ict solutions and consultancy \sales@convolution\.nl

Ik vertrouw stemcomputers niet. Zie \<http://www.wijvertrouwenstemcomputersniet.nl/>.

p5pRT commented 17 years ago

From jjberthels@gmail.com

Considering that I like to write modern programs that simply use Unicode end-to-end as possible\, and at least internally\, which keeps everything simple and compatible\, it would be easier for me if the meaning of the utf8 flag was updated to officially be the new behaviour.

Well\, perl goes to some lengths (implicit conversion) for you to be able to mix untagged-all-ascii string values and tagged-non-ascii transparently in your program. And you can happily write modern programs using Unicode end-to-end doing so. Both types of strings consist of character data.

I believe that a true utf8 flag should mean that the string contains data that is valid utf8\, not just that it has utf8 characters outside the ASCII range.

Well\, I think is_utf8 is poorly named either way (with several years of hindsight - I don't think I would have made a better choice at the time). I don't think that Perl's internal representation for unicode strings is guaranteed to be utf8. The flag more properly means "please treat this as character data\, taking special care to realise that some of the character values may be > 255". And it's the 'special care' bit which can cost performance.

As far as I know\, the conceptual purpose of the utf8 flag is to indicate whether Perl considers a string to be unambiguous character data or binary data which could be ambiguous character data\, and thus how Perl will treat it by default.

Yes\, agreed. And it's really a bit of perl's internals which application code shouldn't really want to examine or change directly.

[snip example of using is_utf8 to check that a perl value contains 'character data']

Why would your library routine care? It can manipulate the string as a sequence of characters in either case. It will produce the wrong results if passed the wrong data\, but that will always be true\, since it could be passed wrong data tagged as utf8. If your routine wants specific sequences of characters it can check for those\, regardless of the is_utf8ness of the string.

Now\, if there is some concern that character-oriented regexes and such are considerably slower for ASCII data than alternatives\, and this is a problem and it can't be otherwise dealt with

I think the unicode regex engine can never be as fast as the byte-oriented one. It has more to consider. There's some example code (vaguely like the sort of templating where I noticed the problem)\, which shows unicode running 2-3 times as slow (17s instead of 6s) as the byte engine.

we could perhaps have an additional flag which has the meaning that I ascribed to utf8; eg\, is_chars() or is_text() etcetera; but in my mind it would be simpler to just leave the meaning of is_utf8 adjusted to mean is unambiguous character data.

I'm having trouble thinking of an example where application code might want to check this. It's part of perl's internals\, surely?

P.S. On a tangent\, it would be nice if there was a simple test to see if an SV currently considered its numerical or integer or string etc component to be the authoratative one\, so eg I could just check that rather than using looks_like_number or some such more complicated solution. Though maybe there is already\, perhaps in a bundled debugging or some such module\, and I haven't found it yet?

I'd rather is_utf8 disappeared from the public API\, since it's really an internal flag and (I think) poorly named. Internally\, it could then be renamed requires_unicode_engine or something.

But what I really care about is the ability to just tell perl "data from this source is in this encoding"\, "data going to this destination is in this encoding" and get all the nice automagic handling of conversions for me without paying the unicode engine cost on ascii data.

regards\,

Bench output:

Rate udata data udata 588/s -- -63% data 1572/s 167% --

Code: #!/usr/bin/perl use warnings; use strict; use Encode; use Benchmark;

my $data = ""; my $count = 10; while ($count-- > 0) { $data = "\<%-$count tag with some text $data $count-%>"; } my $udata = $data; Encode::_utf8_on($udata);

my $do_what = shift || "bench"; my $run_count = shift || 10000;

if ($do_what eq 'bench') { Benchmark::cmpthese(-20\, { data => sub { stress($data); }\, udata => sub { stress($udata); }\, }); } elsif ($do_what eq 'bytes') { stress($data) for (1..$run_count); } elsif ($do_what eq 'chars') { stress($udata) for (1..$run_count); } else { die "Don't understand what you wanted me to do: $do_what"; }

sub stress { my $data = shift; my $oldlen; while ($data =~ s/\<%-(\d+)([^\<]*?).*%-\1>/reverse($2)/e) { if ($oldlen) { die "didn't match [$data]" unless length $data \< $oldlen; } $oldlen = length $data; } }

p5pRT commented 17 years ago

From jjberthels@gmail.com

I think decoding means the conversion from some octets in a certain encoding to the correspoinding string that is recognized as a unicode string by perl.

You're absolutely right. I'm very sorry my original test case was so poor. I've updated it to the code below\, which I think is correct and still shows the problem.

regards\,

#!/usr/bin/perl use warnings; use strict; use Test::More (tests => 2);

use Encode;

my $latin1_bytes = "l\xf8\xf8k - a latin1 string with 8bit bytes"; my $ascii_bytes = "this is latin1 and plain ascii ";

my $encoded_str = Encode::decode('iso-8859-1'\, $latin1_bytes); ok(Encode::is_utf8($encoded_str)\, "(check encode is working) non-ascii latin-1 byte string becomes char str");

$encoded_str = Encode::decode('iso-8859-1'\, $ascii_bytes); ok(! Encode::is_utf8($encoded_str)\, "but ascii byte string untagged after decode");

p5pRT commented 17 years ago

From schmorp@schmorp.de

On Wed\, Mar 28\, 2007 at 11:12:15AM +0200\, Juerd Waalboer \juerd@convolution\.nl wrote:

As far as I know\, the conceptual purpose of the utf8 flag is to indicate whether Perl considers a string to be unambiguous character data or binary data which could be ambiguous character data\, and thus how Perl will treat it by default.

The *conceptual* purpose of the UTF8 flag isn't there. Conceptually\, every string can be a unicode string\, and you're not supposed to look at\, know\, or set the UTF8 flag yourself. It's an internal bit\, like IOK and NOK. [1]

Thats not how current perl works.

Perl conceptually has a single numeric type\, and a single string type. The distinction between integer and float\, and between iso-8859-1 and utf-8\, is internal.

I would love if that were the case\, but the powers to be decided that every perl progarmmer has to know those internals\, and needs to be able to deal with them.

Note that Perl internally uses iso-8859-1 (8 bit) and utf-8 (variable whole-octet)\, not ascii (7 bit).

No\, Perl exposes this. For example\, see the recent example of Compress::Zlib:

unpack ('CCCCVCC'\, $$string);

that code is broken because the powers to be decided that "C" exposes the internal encoding\, while "V" doesn't. That requires every perl programmer who decodes file headers etc. using unpack to know about those internals.

This is especially bad as not only has the meaning of "C" been shifted from decoding bytes to something else (instead of using a new modifier)\, but no alternative has been provided to get the old meaning of "C"\, so basically all code that doesn't utf8::downgrade is broken now by this change in meaning.

(Worse is the fact that its wrongly documented to decode an octet even in the presence of Unicode\, but it doesn't decode an octet\, unless you define "octet" in Perl to mean that "\xa0" is either one or two octets)

The same is true for many XS modules: in older versions of perl\, SvPV gave you the 8-bit version of a scalar\, but in current versions\, it randomly gives you either 8-bit or utf-8 encoded. SvPV was renamed to SvPVbyte.

Both of those gratitiously backwards-incompatible changes break lots of existing code.

And the problem is that those bugs are not considered bugs but features.

[1] Some parts of Perl break this concept. The regex engine is one of them\, and has different semantics depending on the presence of the flag. This is a bug\, but any fix would be incompatible.

In fact\, some parts of perl break this concept and make perfectly working code (in 5.005) not working anymore\, or working randomly\, and thats not considered a bug.

I wonder why it is ok to break large amounts of perl and xs code silently\, without even documenting how to fix it[1]\, while at the same time 5.10 introduced "use feature" to shield against possible breakage with far less of an impact then the changes above.

[1] If it is documented\, then anybody please show me why this:

utf8::downgrade $s; unpack "C"\, $s;

is documented to have different effects from:

unpack "C"\, $s;

i.e.\, where is it documented that perl doesn't upgrade the scalar in between those lines? If you think it is obvious\, how about this:

my $s = chr 255; # to me\, this is one octet. to perl\, it might be one or # two\, or maybe more\, who knows. warn unpack "C"\, $s; "$s\x{672c}"; warn unpack "C"\, $s; $s .= "\x{672c}"; substr $s\, 1\, 1\, ""; warn unpack "C"\, $s;

Can a pure-Perl programmer tell what the output of this program is without trying it? Should he be able to? I would say the answer is no to both.

It is beyond me how people can introduce so much breakage to existing code so lightly\, forcing many modules to be changed and forcing pure-Perl programmer to understand the perl interpreter sources to get their unicode right.

Thats a broken unicode model\, and as long as those kind of bugs are considered features\, perl programmers very well have to care about that internal utf-x\, utf-8\, whatever flag.

-- The choice of a -----==- _GNU_ ----==-- _ generation Marc Lehmann ---==---(_)__ __ ____ __ pcg@goof.com --==---/ / _ \/ // /\ \/ / http://schmorp.de/ -=====/_/_//_/\_\,_/ /_/\_\ XX11-RIPE

p5pRT commented 17 years ago

From @nwc10

On Fri\, Mar 30\, 2007 at 02:02:32PM +0200\, Marc Lehmann wrote:

The same is true for many XS modules: in older versions of perl\, SvPV gave you the 8-bit version of a scalar\, but in current versions\, it randomly gives you either 8-bit or utf-8 encoded. SvPV was renamed to SvPVbyte.

Both of those gratitiously backwards-incompatible changes break lots of existing code.

And the problem is that those bugs are not considered bugs but features.

I certainly consider this one a bug.

I didn't create the release that messed this up\, and didn't realise the implications of the change until some time after it happened. You might consider me slow for this.

I wonder why it is ok to break large amounts of perl and xs code silently\, without even documenting how to fix it[1]\, while at the same time 5.10 introduced "use feature" to shield against possible breakage with far less of an impact then the changes above.

Problem is now that I can't see how to fix it without breaking other code that plays by the different\, new\, rules.

Nicholas Clark

p5pRT commented 17 years ago

From schmorp@schmorp.de

On Fri\, Mar 30\, 2007 at 01:07:22PM +0100\, Nicholas Clark \nick@ccl4\.org wrote:

And the problem is that those bugs are not considered bugs but features.

I certainly consider this one a bug.

So fix it. It is easy to do\, and I documented it years ago (during 5.6).

I didn't create the release that messed this up\, and didn't realise the implications of the change until some time after it happened. You might consider me slow for this.

I do not consider you slow for not creating the release that messed this up\, no :) If it all\, its a pity you didn't.

I wonder why it is ok to break large amounts of perl and xs code silently\, without even documenting how to fix it[1]\, while at the same time 5.10 introduced "use feature" to shield against possible breakage with far less of an impact then the changes above.

Problem is now that I can't see how to fix it without breaking other code that plays by the different\, new\, rules.

I have yet to see that other code outside the testsuite that reliably relies on broken 5.6 unicoded semantics and is considered worth keeping. I challenge you to show me\, and I promise to show you another example from CPAN or elsewhere that breaks. OR maybe even two. Or three.

Besides\, without any doubt\, the code that relies on psuedo-random behaviour is certainkly in the minority. The amount of code in the wild that relies on "C" having 5.5 semantics is much larger. I doubt _anybody_ except me (or at leats not very many people) understands that he has to downgrade scalars before passing them into unpack to decode structures.

As the amount of breakage will only increase over time as unicode becomes more and more used in perl.

The solution to this bug is fixing. The earlier\, the better.

At the very least\, it needs to be documented\, and *hard* rules on when perl uphgraded or downgrades would need to be established\, as\, right now\, behaviour is pretty random over versions. Of course\, down that path lies madness and perl5 will ever stay a failed experiment of how to do unicode correctly (namely\, abstracted away from the actual encoding).

Besides\, don't you think an agrument of the form "yes\, it breaks lots of code\, but some code might rely on it\, so lets keep it" sounds pretty\, sorry to be so honest\, stupid?

-- The choice of a -----==- _GNU_ ----==-- _ generation Marc Lehmann ---==---(_)__ __ ____ __ pcg@goof.com --==---/ / _ \/ // /\ \/ / http://schmorp.de/ -=====/_/_//_/\_\,_/ /_/\_\ XX11-RIPE

p5pRT commented 17 years ago

From schmorp@schmorp.de

Forgot something\, sorry :)

At the very least\, it needs to be documented\, and *hard* rules on when perl uphgraded or downgrades would need to be established\, as\, right now\, behaviour is pretty random over versions. Of course\, down that path lies madness and perl5 will ever stay a failed experiment of how to do unicode correctly (namely\, abstracted away from the actual encoding).

In fact\, I teach a lot of people about unicode in perl. And the problem is not that the unicode model doesn't work or isn't simple\, the problem is those "features" that remind people that perl has no abstract unicode model\, that they do have to understand the internals of the UTF-X bit.

Thats really the problem: if perl had a pure 5.005_5x model\, then it would be far less easy\, but at least consistent. If perl had the abstract model juerd dreams of\, then perl would have a very easy unicode model that boils down to what I talked about on the perl workshop: encode/decode when doing I/O\, oherwise\, enjoy.

As it is\, perl has this abstract model that nobody understands because they are constantly reminded that it isn't fully implemented and they do have to care about the UTF-X flag themselves\, as perl seemingly randomly doesn't.

-- The choice of a -----==- _GNU_ ----==-- _ generation Marc Lehmann ---==---(_)__ __ ____ __ pcg@goof.com --==---/ / _ \/ // /\ \/ / http://schmorp.de/ -=====/_/_//_/\_\,_/ /_/\_\ XX11-RIPE

p5pRT commented 17 years ago

From @nwc10

On Fri\, Mar 30\, 2007 at 02:18:14PM +0200\, Marc Lehmann wrote:

On Fri\, Mar 30\, 2007 at 01:07:22PM +0100\, Nicholas Clark \nick@ccl4\.org wrote:

And the problem is that those bugs are not considered bugs but features.

I certainly consider this one a bug.

So fix it. It is easy to do\, and I documented it years ago (during 5.6).

"this one" that I was confident is a bug is the change of meaning on SvPV() And in turn what I'm not confident about is the fix.

Besides\, without any doubt\, the code that relies on psuedo-random behaviour is certainkly in the minority. The amount of code in the wild that relies on "C" having 5.5 semantics is much larger. I doubt _anybody_ except me (or at leats not very many people) understands that he has to downgrade scalars before passing them into unpack to decode structures.

I don't know enough about "C" in pack offhand to know what the right thing to do is.

I don't like anything Perl space that lets the abstraction leak\, and "C" is one of them.

The third thing that you didn't mention which I consider distinct from the two behaviours you did is that the encoding effects how regexps match\, and lc/uc/lcfirst/ucfirst.

Nicholas Clark

p5pRT commented 17 years ago

From schmorp@schmorp.de

On Fri\, Mar 30\, 2007 at 01:31:22PM +0100\, Nicholas Clark \nick@ccl4\.org wrote:

So fix it. It is easy to do\, and I documented it years ago (during 5.6).

"this one" that I was confident is a bug is the change of meaning on SvPV() And in turn what I'm not confident about is the fix.

Sorry. I can understand that it might be difficult as perl itself likely relies on the current meaning of SvPV.

However\, some of the obvious fixes would be to change ExtUtils/typemap so that stuff such as "const char *" does no longer boil down to random bytes. Example:

SV *compress (const char *data);

the right thing here is to use SvPVbyte\, at leats in the majority of cases. The reason is that existing users either have to clal downgrade explicitly themselves or suffer from random problems.

etc.

Besides\, without any doubt\, the code that relies on psuedo-random behaviour is certainkly in the minority. The amount of code in the wild that relies on "C" having 5.5 semantics is much larger. I doubt _anybody_ except me (or at leats not very many people) understands that he has to downgrade scalars before passing them into unpack to decode structures.

I don't know enough about "C" in pack offhand to know what the right thing to do is.

The right thing to do is the follow the documentation and existing code.

Could you tell me why almost every other 5.6 bug was fixed in 5.8\, but gratitious breakage of large parts of CPAN are accepted with this change? Whats the rationale behind keeping this 5.6 bug\, while fixing the rest?

For example\, take a network protocol that sends packets prefixed with a 2-byte length header\, a type\, and data. There is currently no unpack format available to do this\, as:

unpack "Cn"\, $data

Gives different results depending in the history of the string in $data.

If there were a pack type that gave me 5.005 behaviour of returning a single character\, I could use it:

unpack "Wn"\, $data;

but there simply isn't. Besides\, all code does use "C"\, so the right thing is to move the new pack type to a different modifier.

(In my personal opinion\, of course\, pack should not expose internal encoding at all. Use Devel::Peek or so\, or one of the functions in the utf8:: module. The first one who shows me code that would need the peculiar nondeterministic behaviour of unpack "C" gets a prize).

I don't like anything Perl space that lets the abstraction leak\, and "C" is one of them.

So why not fix it? Nobody made such a fuss when they fixed the remaining bugs from 5.6. For example\, PApp\, one of my older modules using unicode\, is full of code such as this:

Convert::Scalar::utf8_on($_); # DEVEL7952 bug workaround #d# #FIXME#

For various values of DEVEL and workaround. Some of that code broke in 5.8 because 5.8 did the right thing (not 5.8.0\, mind you\, as this fixing went on during 5.8.x).

*Nobody* argued my case of "it breaks existing code"\, not even me\, because its clearly a bugfix that lets perl code just work\, both old code and new code (which is the beauty of the perl unicode model).

The third thing that you didn't mention which I consider distinct from the two behaviours you did is that the encoding effects how regexps match\, and lc/uc/lcfirst/ucfirst.

The difference is that I haven't seen code break so badly because of that. I see lots of code break because of the incompatible change in the meaning of "C"\, though.

(In fact\, I haven't even seen a difference\, apart from when use locale is active\, which is a rare case).

The other difference to that case is that those bugs are getting fixed\, while in the case of "C"\, people just ignore the problem\, which increases over time\, saying they don't know why to fix this bug.

And as I said\, there is no pack-type that gives me the old meaning of "C" that every structure-decoding program relies on. Thats gratitious undocumented breakage. (It really is undocumented because all of the perl documentation tells me that the internal encoding doesn't surface\, and the small hint in the pack description for "C" seems to reinforce this as it tells me it works "even in the presence of Unicode"!).

In any case\, please could you answer to me why you accept obvious breakage of old code in this case? I really wanna know.

The only argument in favour I have heard os far is that the camelbook documents it in some obscure way. But that cannot be a reason to keep a bug. If the camelbook describes buggy behaviour\, it needs a fix. It is insane to force every existing perl program that uses that feature to be changed in a way that contradicts the rest of the documentation\, is unintuitive and generaly useless (again\, show me a useful application for unpack "C" with 5.8 semantics).

-- The choice of a -----==- _GNU_ ----==-- _ generation Marc Lehmann ---==---(_)__ __ ____ __ pcg@goof.com --==---/ / _ \/ // /\ \/ / http://schmorp.de/ -=====/_/_//_/\_\,_/ /_/\_\ XX11-RIPE

p5pRT commented 17 years ago

From @jhi

Could you tell me why almost every other 5.6 bug was fixed in 5.8\, but gratitious breakage of large parts of CPAN are accepted with this change? Whats the rationale behind keeping this 5.6 bug\, while fixing the rest?

So why not fix it? Nobody made such a fuss when they fixed the remaining bugs from 5.6.

Oh\, for heavens sake. I'm sorry but I have VERY hard time of listening to your wailing and sitting still. So I won't.

Perl 5.8 was in development for quite close to two years (5.7.0 in 2000-Sep\, but work started already in July or so - 5.8.0 in 2002-Jul)\, and 5.8.1 (the "cleanup for oopses" for 5.8.0) took another year. So three years before we had really a useable 5.8.

Since then Nicholas picked up and has admirably and thanklessly released SEVEN maintenance releases of 5.8 over three years\, meaning that about every six months there has been a change of fixing something that is very broken.

How serious a breakage can be if in three years of development and three years of maintenance it hasn't gotten enough attention to be fixed? There is no hidden conspiracy of keeping things broken.

I'm the first one to admit that I wasn't brave enough to REALLY fix the Unicode brokenness of 5.6.

(1) The more strongly typed scheme\, where there would be really forcibly separate "byte strings" and "Unicode strings" *would* have been possible\, if I only had had the guts. But it was mostly the regex engine that scared me too much. For basic strings manipulation and I/O it would not have been a problem to implement.

(2) Another big mistake (due to lack of courage) was the decision to stick with "Latin-1" as the default 8-bit legacy "type". I should have broken that assumption\, too\, and stuck with pure ASCII (or EBCDIC). (As a side thing\, the 8-bit locale support should have been ejected\, too: it is just not worth the trouble: it should have been replaced with something pluggable so that people could have plugged in CLDR or Windows locales or whatever they want.)

I'm just getting really\, really tired of people whining about Perl's Unicode.

-- There is this special biologist word we use for 'stable'. It is 'dead'. -- Jack Cohen

p5pRT commented 17 years ago

From @nwc10

On Fri\, Mar 30\, 2007 at 08:00:36PM +0200\, Marc Lehmann wrote:

On Fri\, Mar 30\, 2007 at 01:31:22PM +0100\, Nicholas Clark \nick@ccl4\.org wrote:

However\, some of the obvious fixes would be to change ExtUtils/typemap so that stuff such as "const char *" does no longer boil down to random bytes. Example:

SV *compress (const char *data);

the right thing here is to use SvPVbyte\, at leats in the majority of cases. The reason is that existing users either have to clal downgrade explicitly themselves or suffer from random problems.

This seems a sane idea. However\, I'm not going to change it for 5.8.9

5.10 is a different matter\, but also not my call.

Could you tell me why almost every other 5.6 bug was fixed in 5.8\, but gratitious breakage of large parts of CPAN are accepted with this change? Whats the rationale behind keeping this 5.6 bug\, while fixing the rest?

No\, I can't. 5.8.0 and 5.8.1 were not my releases\, *and* I wasn't aware that 'C' was a problem at that time.

I *think* that the reason may have been because "it is documented in Programming Perl" that it behaves the 5.6.0 way.

*but*

I went looking\, and the closest I can find to an assertion about how it works is:

* the pack/unpack letters "c" and "C" do /not/ change\, since they're often used for byte-orientated formats. (Again\, think "char" in the C language.) However\, there is a new "U" specifier that will convert between UTF-8 characters an integers:

pack("U*"\, 1\, 20 \,300\, 4000) eq v1.20.300.4000

* The chr and ord functions work on characters

chr(1).chr(20).chr(300).chr(4000) eq v1.20.3000.4000

In other words\, chr and ord are like pack("U") and unpack("U")\, not like pack("C") and unpack("C"). In fact\, the latter two are how you now emulate byte-orientated chr and ord if you're too lazy to use bytes.

[3rd edition\, page 408]

I don't like anything Perl space that lets the abstraction leak\, and "C" is one of them.

So why not fix it? Nobody made such a fuss when they fixed the remaining bugs from 5.6. For example\, PApp\, one of my older modules using unicode\, is full

I'm not going to change anything this late in 5.8.x. Whether 5.10 changes is not something I have the final say on.

And as I said\, there is no pack-type that gives me the old meaning of "C" that every structure-decoding program relies on. Thats gratitious undocumented breakage. (It really is undocumented because all of the perl documentation tells me that the internal encoding doesn't surface\, and the small hint in the pack description for "C" seems to reinforce this as it tells me it works "even in the presence of Unicode"!).

In any case\, please could you answer to me why you accept obvious breakage of old code in this case? I really wanna know.

The only argument in favour I have heard os far is that the camelbook documents it in some obscure way. But that cannot be a reason to keep a bug. If the camelbook describes buggy behaviour\, it needs a fix. It is insane to force every existing perl program that uses that feature to be changed in a way that contradicts the rest of the documentation\, is unintuitive and generaly useless (again\, show me a useful application for unpack "C" with 5.8 semantics).

I agree with the obscure now.

Reading the wording of the Camel book carefully\, this behaviour

$ perl5.00503 -le 'print unpack "c"\, chr (256+78)' 78 $ perl5.00503 -le 'print unpack "C"\, chr (256+78)' 78

"unchanged" actually means to me that it would produce the same output.

The only thing that seems to define the current 5.6 behaviour is the comparison of unpack("C") with ord under use bytes in the paragraph on chr and ord.

Nicholas Clark

p5pRT commented 17 years ago

From @nwc10

On Fri\, Mar 30\, 2007 at 02:33:59PM -0400\, Jarkko Hietaniemi wrote:

Since then Nicholas picked up and has admirably and thanklessly released SEVEN maintenance releases of 5.8 over three years\, meaning that about every six months there has been a change of fixing something that is very broken.

But to be fair to Marc\, I think that he reported issues with pack after 5.8.3 was released and at the time I wasn't sure how to fix it and punted. (He may have reported it earlier\, that's the earliest I remember saying roughly "I don't know how to fix this safely")

So I haven't dealt with it for about 2 years. It's one of those icky things where it's always easier to volunteer to deal with something else first.

The only point I partially dealt with it was during my TPF grant.

5.8.9 will be the first stable release after that\, due to the small delay of a job that tried to drive me insane.

Nicholas Clark

p5pRT commented 17 years ago

From @Juerd

John Berthels skribis 2007-03-28 9:52 (+0100):

Well\, perl goes to some lengths (implicit conversion) for you to be able to mix untagged-all-ascii string values and tagged-non-ascii transparently in your program.

As Jarkko already mentioned\, Perl internally makes a distinction between latin1 and utf8. BOTH are fully ASCII-compatible\, but no special case exists for strings that are fully ASCII-only.

Well\, I think is_utf8 is poorly named either way (with several years of hindsight - I don't think I would have made a better choice at the time).

Agreed\, but for different reasons. I think it should be called Internals::internal_encoding_is_utf8_not_latin1\, with a user friendly wrapper called Internals::encoding that returns either "latin1" or "utf8".

I don't think that Perl's internal representation for unicode strings is guaranteed to be utf8.

Indeed. It can also be latin1. The flag indicates (negated) if this is the case.

The flag more properly means "please treat this as character data

As far as Perl is concerned\, ALL strings consist of character data.

Internally-latin1 strings are special because there\, bytes and characters can safely be considered equal.

And it's the 'special care' bit which can cost performance.

My guess is that the performance costs are mostly associated with utf8 being variable width\, which means that you need to scan through the string to do just about anything.

[The UTF8 flag is] really a bit of perl's internals which application code shouldn't really want to examine or change directly.

Well said.

Now\, if there is some concern that character-oriented regexes and such are considerably slower for ASCII data than alternatives\, and this is a problem and it can't be otherwise dealt with I think the unicode regex engine can never be as fast as the byte-oriented one.

Unicode versus bytes is a weird comparison. Unicode strings are stored as bytes too!

But it's true that a naive octet matcher is faster. When you're particularly sure about the encoding and you're matching literal strings (no case insensitive stuff) and don't care about character offsets and don't care that the captures might not be correct character strings\, you can sometimes gain some performance by encoding (e.g. utf8::encode for max performance) both the subject and the regex before matching and decoding afterwards. But be careful that this way also have an adverse effect\, so always benchmark first.

It has more to consider. There's some example code (vaguely like the sort of templating where I noticed the problem)\, which shows unicode running 2-3 times as slow (17s instead of 6s) as the byte engine.

I'd like to see and examine that. Templating is a trade that sometimes allows for naive handling\, so there might be room for improvement.

I'd rather is_utf8 disappeared from the public API\, since it's really an internal flag and (I think) poorly named.

There's nothing wrong with an internal thing being part of a public API. Perl has that everywhere. There is\, however\, something wrong with people who access these internals not realising that they are\, in fact\, internals\, even though the documentation clearly indicated this.

Encode::is_utf8 is very clearly labeled as "[INTERNAL]" in the documentation.

The function may certainly be useful sometimes\, like when you can output either latin1 or utf8 and just want to get the data out\, without the performance loss of re-encoding:

binmode $fh\, ":raw"; print {$fh} "Content-Type: text/plain; charset="\, (Encode::is_utf8($body) ? "UTF-8" : "ISO-8859-1")\, "\n\n"\, $body;

Internally\, it could then be renamed requires_unicode_engine or something.

Unicode semantics are also needed\, when the string is not encoded as utf-8 internally. Don't forget that Unicode and UTF-8 are different things.

Regardless of the internal encoding\, "x" and "é" are unicode characters\, with lots of unicode properties\, like that they are lower case alphabetic characters.

But what I really care about is the ability to just tell perl "data from this source is in this encoding"

binmode $source\, ":encoding(...)";

Though when your source is different\, you may need to write your own wrappers.

"data going to this destination is in this encoding"

binmode $destination\, ":encoding(...)";

Same caveat.

and get all the nice automagic handling of conversions for me without paying the unicode engine cost on ascii data.

The conversions themselves may need "the unicode engine" to realise that no further action is required for your ASCII data.

while ($data =~ s/\<%-(\d+)([^\<]*?).*%-\1>/reverse($2)/e) {

If you reverse\, you need to know where characters end. If the internal encoding is utf-8\, knowing where individual characters end\, requires scanning through the string.

That's one of the reasons that Perl DOES NOT USE utf-8\, when latin1 suffices. Here\, you forced the issue with _utf8_on.

Perl already has this optimized. It's not in the regex engine\, but in the very implementation of strings themselves\, so that other operations may benefit too. -- korajn salutojn\,

juerd waalboer: perl hacker \juerd@juerd\.nl \<http://juerd.nl/sig> convolution: ict solutions and consultancy \sales@convolution\.nl

Ik vertrouw stemcomputers niet. Zie \<http://www.wijvertrouwenstemcomputersniet.nl/>.

p5pRT commented 17 years ago

From @Juerd

John Berthels skribis 2007-03-28 9:59 (+0100):

my $encoded_str = Encode::decode('iso-8859-1'\, $latin1_bytes);

It is unfortunate that decoding latin1 isn't optimized into a no-op. Well\, a copying no-op :)

Unfortunately\, because the regex engine has different semantics depending on the internal encoding of a string\, changing this may result in existing code no longer behaving like before.

Perl has a few issues with Unicode\, but the only issue that I consider a big problem is the regex engine. Its semantics for often used functionality (\w\, \s) are hard to predict.

I see three options\, but may be suffering from tunnel vision:

1. Leave things the way they are; the only way to get predictable semantics is to utf8::upgrade both sides before matching.

2. Fix things by always assuming Unicode semantics\, or by always assuming ASCII semantics\, for those expressions that are now relying on the UTF8 flag for this decision. This will break existing code.

3. Fix things by adding syntax\, which indicates whether the regex engine should use Unicode semantics\, or ASCII semantics. All future regex writers should use this\, so the syntax must be super compact and legible. (My guess is that adding /u and /a flags is the best option\, because this is stringifyable in (?:) and thus qr. A pragma (implied by "use v5.10"\, please!) could pick a default for all regexes in its scope.)

A short description of the problem: there are characters within the latin1 range\, that may or may not match character classes like \w or \s\, depending on the internal encoding of the enclosing string.

It can de demonstrated easily:

use Test::More tests => 3;

my $eacute1 = chr 233; my $eacute2 = $eacute1;

$eacute2 .= chr 256; chop $eacute2;

is($eacute1\, $eacute2\, "Same string\, conceptually"); # but internally encoded differently: eacute1 is in latin1\, eacute2 is # in utf-8.

like($eacute1\, qr/\w/\, "eacute is a word character I"); like($eacute2\, qr/\w/\, "eacute is a word character II");

This should ok three times\, but the middle test fails.

I suggest that $eacute =~ /\w/a never match and that $eacute =~ /\w/u always match\, given any $eacute containing only the single character 233 (decimal). I further suggest a use feature 'unire' that implies /u for all further non-/a regexes\, which is included in use feature ':5.10' and thus in use v5.10.

I have no idea how to implement it. I get dizzy when I try to read regex engine source. -- korajn salutojn\,

juerd waalboer: perl hacker \juerd@juerd\.nl \<http://juerd.nl/sig> convolution: ict solutions and consultancy \sales@convolution\.nl

Ik vertrouw stemcomputers niet. Zie \<http://www.wijvertrouwenstemcomputersniet.nl/>.

p5pRT commented 17 years ago

From @Juerd

Marc Lehmann skribis 2007-03-30 14:02 (+0200):

The *conceptual* purpose of the UTF8 flag isn't there. Conceptually\, every string can be a unicode string\, and you're not supposed to look at\, know\, or set the UTF8 flag yourself. It's an internal bit\, like IOK and NOK. [1] Thats not how current perl works.

We must have differing definitions\, somewhere.

Perl conceptually has a single numeric type\, and a single string type. The distinction between integer and float\, and between iso-8859-1 and utf-8\, is internal. I would love if that were the case\, but the powers to be decided that every perl progarmmer has to know those internals\, and needs to be able to deal with them.

The best approach to programming with unicode in mind\, in Perl\, is to (pretend to) be completely ignorant about Perl's internals with regards to encoding and the UTF8 flag.

The only exception is the regex engine\, which has a big bug. This can be worked around\, again without any knowledge of the internals\, by utf8::upgrade'ing both sides of the regex before trying the match.

Your powers-that-be\, might be different. Also\, don't confuse "you can know what Perl does internally" with "you have to know what Perl does internally".

Just being able to access internal metadata doesn't mean you should actually do so on a daily basis.

It's entirely possible to make undef writable\, and have it equal 42. No-one is complaining about that\, and only very few people ever get the idea of changing the value of undef.

It's also entirely possible to set the internal flag "UTF8" on an existing string. But for some reason a lot of people are complaining about that\, and even more people have actually set UTF8 flags themselves...

Note that Perl internally uses iso-8859-1 (8 bit) and utf-8 (variable whole-octet)\, not ascii (7 bit). No\, Perl exposes this. For example\, see the recent example of Compress::Zlib: unpack ('CCCCVCC'\, $$string); that code is broken because the powers to be decided that "C" exposes the internal encoding\, while "V" doesn't.

Yes\, any byte-specific operation on a text string (which I keep separate from character strings) will use the internal encoding. It has to use /some/ encoding\, because it cannot see whether the string was meant as a byte string or a text string. Perl does not have strong typing.

Personally\, I think that unpack with a byte-specific signature should die\, or at least warn\, when its operand has the UTF8 flag set. That'll catch at least some of the cases\, because the UTF8 flag always positively indicates that the string is a text string. (The reverse\, however\, is not true: a string without the UTF8 string might be either a text string or a byte string.)

That requires every perl programmer who decodes file headers etc. using unpack to know about those internals.

No\, it requires every Perl programmer to keep track of the function of every string.

Byte strings and text strings must never be combined\, and text strings must never undergo byte-specific operations.

This again requires no knowledge of the actual encoding that Perl uses internally\, whatsoever.

The same is true for many XS modules: in older versions of perl\, SvPV gave you the 8-bit version of a scalar\, but in current versions\, it randomly gives you either 8-bit or utf-8 encoded. SvPV was renamed to SvPVbyte.

Unfortunately\, I lack knowledge of these internals\, so I cannot comment about this (yet).

Note that XS writers must have knowledge of Perl's internals. This has always been true\, and is not specific to this fancy new Unicode thing.

And the problem is that those bugs are not considered bugs but features.

Some bugs are acknowledged as bugs\, but won't be fixed anyway\, because there is already a lot of code in the wild that depends on the bugs.

[1] Some parts of Perl break this concept. The regex engine is one of them\, and has different semantics depending on the presence of the flag. This is a bug\, but any fix would be incompatible. In fact\, some parts of perl break this concept and make perfectly working code (in 5.005) not working anymore\, or working randomly\, and thats not considered a bug.

Personally I'm only interested in 5.8.2 and later\, but I still would like to learn about this history.

unpack "C"\, $s;

The C template for unpack is specifically documented as byte-specific. It should never be used on text strings. If you properly keep text and byte strings separate\, that means that your byte string was never upgraded\, and that unpacking with "C" is reliable and predictable.

If upgrading happened even though the string was not mixed with text strings or used with unicode semantics\, that is a bug. I'm very interested in these silent upgrades that you are experiencing.

If you think it is obvious\, how about this:

my $s = chr 255; # to me\, this is one octet. to perl\, it might be one or # two\, or maybe more\, who knows. warn unpack "C"\, $s; "$s\x{672c}"; warn unpack "C"\, $s; $s .= "\x{672c}"; substr $s\, 1\, 1\, ""; warn unpack "C"\, $s; Can a pure-Perl programmer tell what the output of this program is without trying it?

Not relevant.

Should he be able to?

No\, because the author of this program made a big mistake in the line "$s\x{672c}".

The casual reader can easily figure out that $s was meant as a byte string: it is used with unpack "C"\, which is known to be a byte operation. Because it is a byte string\, the chr 255 is just a 0xFF octet\, not a ÿ (ÿ) conceptually.

The casual reader can also easily figure out that \x{672c} is meant as a text string: any codepoint higher than \x{FF} is always a character\, never a single byte.

Then\, the author of this snippet uses both the byte string $s and the text sting "\x{672c}" joined in one string "$s\x{672c}". People not interested in fixing the code can stop reading there: the code is broken and its semantics not terribly relevant. People who wish to fix it\, will have to try and figure out what the author really wanted to do here.

Because it's a contrived case\, that's very hard to figure out. But I'm sure that given real world values and variable names\, there would be a clear and logical solution\, to be found somewhere along the lines of encoding and decoding explicitly.

Thats a broken unicode model

So far\, I've only seen a broken understanding of the unicode model\, and a broken regex engine. -- korajn salutojn\,

juerd waalboer: perl hacker \juerd@juerd\.nl \<http://juerd.nl/sig> convolution: ict solutions and consultancy \sales@convolution\.nl

Ik vertrouw stemcomputers niet. Zie \<http://www.wijvertrouwenstemcomputersniet.nl/>.

p5pRT commented 17 years ago

From @Juerd

Marc Lehmann skribis 2007-03-30 14:24 (+0200):

In fact\, I teach a lot of people about unicode in perl.

At the German Perl Workshop\, I saw your unicode presentation. I don't know if this is a good representation for your teaching of unicode\, but I noticed that you used utf8::encode and utf8::decode\, not the similar functions from Encode.pm that are more commonly used and advised. These utf8:: in-place encode/decode functions are efficient\, but using them means that the same SV changes from byte string to text sting or vice versa\, which makes the code hard to follow\, and any attempt to use hungarian notation in code examples impossible.

Whenever I teach the Perl Unicode model\, I try to call my strings $byte_string and $text_string\, or similar. But utf8::decode($byte_string) makes $byte_string a text string\, and utf8::encode($text_string) makes $text_string a byte string\, so after these statements\, the names are no longer correct.

(And of course\, I try not to teach people the Unicode model\, because that's something that's quite internal. I try to teach the difference between text strings and byte strings\, and how to use encodings (which are byte representations of text strings). I treat UTF-8 exactly the same way as KOI8-R. That helps a lot!)

If perl had the abstract model juerd dreams of

and uses in day-to-day coding\, without encountering ANY of the problems that you describe (only the regex engine still manages to surprise me\, but that's because I'm too stubborn to utf8::upgrade explicitly).

It kind of makes one wonder if this dream might be reality (and your reality a dream?)

then perl would have a very easy unicode model that boils down to what I talked about on the perl workshop: encode/decode when doing I/O\, oherwise\, enjoy.

And keep text strings and byte strings separate!!!!!!!!!!!!!eleven

Whenever you must mix text strings and byte strings\, consider the byte strings I/O and encode/decode accordingly.

So\, recap: encode/decode when doing I/O\, keep text strings and byte strings separate\, otherwise\, enjoy. -- korajn salutojn\,

juerd waalboer: perl hacker \juerd@juerd\.nl \<http://juerd.nl/sig> convolution: ict solutions and consultancy \sales@convolution\.nl

Ik vertrouw stemcomputers niet. Zie \<http://www.wijvertrouwenstemcomputersniet.nl/>.

p5pRT commented 17 years ago

From nospam-abuse@bloodgate.com

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Moin\,

On Friday 30 March 2007 20:09:29 Juerd Waalboer wrote:

Marc Lehmann skribis 2007-03-30 14:24 (+0200):

In fact\, I teach a lot of people about unicode in perl.

At the German Perl Workshop\, I saw your unicode presentation. I don't know if this is a good representation for your teaching of unicode\, but I noticed that you used utf8::encode and utf8::decode\, not the similar functions from Encode.pm that are more commonly used and advised. These utf8:: in-place encode/decode functions are efficient\, but using them means that the same SV changes from byte string to text sting or vice versa\, which makes the code hard to follow\, and any attempt to use hungarian notation in code examples impossible.

However\, if you have 200Mbyte of ASCII string\, it is more efficient to *not* copy the data around just to find out that\, yes\, all of it is 7bit :)

But otherwise\, I basically agree with you.

All the best\,

Tels

- -- Signed on Fri Mar 30 22:31:28 2007 with key 0x93B84C15. View my photo gallery: http://bloodgate.com/photos PGP key on http://bloodgate.com/tels.asc or per email.

"Retsina?" - "Ja\, Papa?" - "Schach Matt." - "Is gut\, Papa."

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2 (GNU/Linux)

iQEVAwUBRg2QGXcLPEOTuEwVAQIy+Qf8DETRmN30yEFJSgd2yO8kezpOiT6yErsB c2EUa0XJ1nl+pEQ1givBZ4Y/Ci7QlfeyuDFCBL30Ld1JKPBqP2p6AJgwoOAKk2VU AQcnTUloimSqzanuzs8+v5S7APUDQbBuEpaxliepHuMAJvfxFjN81A8nWcXDWNUO XG/YSLiDvoZoj8RE5rpE5DQ7hIuoyxq/h6fBlIwNB7ATl3XOPC8Ji8rKCIglzW88 DNHiovC0Mo5V6VNE2tYfKlkxZBm1qtOjenUurgUjdh4NoivyxAg9CCvbFoWDE6f4 zeOio94e9JEf5e4ZlK+plwqFSonpVbRO7Fdk1EcxrjG5sQaaBNzNxQ== =a8Qq -----END PGP SIGNATURE-----

p5pRT commented 17 years ago

From marvin@rectangular.com

On Mar 30\, 2007\, at 12:53 PM\, Juerd Waalboer wrote:

Perl does not have strong typing.

If it is so deadly to collide byte-oriented data with character data\,
it should not be so easy to do so accidentally.

Thats a broken unicode model

So far\, I've only seen a broken understanding of the unicode model\,
and a broken regex engine.

That so many users\, including those as expert as Marc\, possess a
"broken" understanding of Perl's Unicode model suggests a flawed
design. We have been set up to fail.

(My admiration for the Unicode integration effort remains
undiminished by its flaws.)

Marvin Humphrey Rectangular Research http://www.rectangular.com/

p5pRT commented 17 years ago

From nospam-abuse@bloodgate.com

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Moin\,

On Friday 30 March 2007 21:00:37 Marvin Humphrey wrote:

On Mar 30\, 2007\, at 12:53 PM\, Juerd Waalboer wrote:

Perl does not have strong typing.

If it is so deadly to collide byte-oriented data with character data\, it should not be so easy to do so accidentally.

It can happen everytime you concatenate two strings. Maybe we could add a new warning?

use warnings 'upgrade';

my $a = 'a'; $a .= "\x100"; # warns

In an application I am currently bringing up to speed in regard to Unicode I opted for a "string" struct\, that contains essentially:

* the lenght in bytes * the lenght in characters (not always set\, e.g. can be unknown) * the storage buffer (containing the data\, plus some optional padding) * the encoding

Every action between two stings thus becomes very clearly defined as you can compare their encodings before doing anything. (for instance upgrading one or both strings before comparing them etc.)

In Perl\, you have only one bit to tell you the encoding (utf8)\, and it seems this is not enough as strings without that bit set can be either ASCII\, or ISO-8859-1\, or the local locale (maybe?)\, or utf-8 which hasn't yet tagged as UTF-8 etc. In short\, it becomes a mess.

All the best\,

Tels

- -- Signed on Fri Mar 30 23:11:40 2007 with key 0x93B84C15. View my photo gallery: http://bloodgate.com/photos PGP key on http://bloodgate.com/tels.asc or per email.

"Call me Justin\, Justin Case."

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2 (GNU/Linux)

iQEVAwUBRg2ag3cLPEOTuEwVAQKxjwf/Tu2blhDuAawXoTbNOCA9wBnWtvxvwL05 PoIZOI9vSivXF78ooL8/Hta8pC4o2/TgFdYzORyzNGCGNSdkkj/4vnriZ+f67uV2 BQGhzceu7r5U2Byl1xBS/egDB8FOSzB9kX3BcviD+ePjB/gAys0XagCQxfzLiFEa mCAp3LVVANmXei0/AgoI/Mj2gO+iz4XX3QvqoL/4tr7Dg734pG/SkYvNE5DL2sc0 OfTvQPGc8NmLHseEM8Vt0jY/gApHLK0LFn9yh98BbJaGNIaCzNZxtPABGYWjFoFS JI1qEVVO4xu0FOJktdEaOSdONTGBincL+4jZ4HbXpi7EMCCZJNLLyw== =t2+L -----END PGP SIGNATURE-----

p5pRT commented 17 years ago

From @Juerd

Marvin Humphrey skribis 2007-03-30 14:00 (-0700):

Perl does not have strong typing. If it is so deadly to collide byte-oriented data with character data\,
it should not be so easy to do so accidentally.

I agree. But Perl chose to have the same single data type for all strings\, and to maintain compatibility with older Perls by assuming that your byte string is a latin1 string if you start using it as a text string. After all\, in a strictly 8 bit world\, there's no need for a distinction\, so people were never careful about it.

(Well\, there was a need\, but ignorance being bliss ignoring that was better for anyone's sanity.)

It kind of bothers me that people constantly whine about this decision years after it was made. The time to influence the decision has past. It just seems so counter-productive to keep bringing it up\, while there are bugs to be discovered and fixed.

I wasn't active in p5p back then\, and if I had been\, I would probably not have overseen the consequences\, just like the porters then didn't. But wonderfully\, a rather consistent and usable plus useful model was invented\, with better/easier Unicode/encodings support than any other programming language. Of course it's never good enough\, but let's first focus on finding and fixing bugs.

That so many users\, including those as expert as Marc\, possess a "broken" understanding of Perl's Unicode model suggests a flawed design.

I think the design is solid\, but the implementation (see regex) slightly broken and documentation wildly misleading.

The documentation thing I'm trying to fix with perlunitut\, perlunifaq\, and a lot of changes to existing documentation\, all of which are now part of bleadperl and will probably be part of the next Perl release.

In addition\, I'm maintaining a consise list of best practices at http://juerd.nl/perluniadvice\, and spending tuits on teaching people (including module maintainers) about the One Way To Do It\, because there is\, in fact\, just one way that really works well in this case. You just have to find it\, and stick to it. TIMTOWTDI doesn't always apply.

We have been set up to fail.

Maybe so\, but you haven't given up yet\, and I hope you won't. Please join us in the effort to deal with the problems at hand. It's a hell of a lot more productive than praying for the opportunity to undo recent years of Perl.

Surely you must know a way in which Perl's unicode support can be improved\, or accidents avoided\, without trying to change all of Perl\, CPAN\, and a gazillion lines of code that we can't even reach. Let's hear it! :)

Thanks\, -- korajn salutojn\,

juerd waalboer: perl hacker \juerd@juerd\.nl \<http://juerd.nl/sig> convolution: ict solutions and consultancy \sales@convolution\.nl

Ik vertrouw stemcomputers niet. Zie \<http://www.wijvertrouwenstemcomputersniet.nl/>.

p5pRT commented 17 years ago

From @Juerd

Tels skribis 2007-03-30 22:32 (+0000):

However\, if you have 200Mbyte of ASCII string\, it is more efficient to *not* copy the data around just to find out that\, yes\, all of it is 7bit :)

Indeed\, but this is an optimization. Optimization isn't part of teaching how things work\, it always comes after.

Information overload is probably the single most problematic thing in Perl's unicode documentation. Constantly people are told all those internal implementation details that they don't have to know. It's no wonder that they start assuming that they actually need this information\, and use manual setting of UTF8 flags as their first resort in case of trouble. -- korajn salutojn\,

juerd waalboer: perl hacker \juerd@juerd\.nl \<http://juerd.nl/sig> convolution: ict solutions and consultancy \sales@convolution\.nl

Ik vertrouw stemcomputers niet. Zie \<http://www.wijvertrouwenstemcomputersniet.nl/>.

p5pRT commented 17 years ago

From @Juerd

Tels skribis 2007-03-30 23:17 (+0000):

If it is so deadly to collide byte-oriented data with character data\, it should not be so easy to do so accidentally. It can happen everytime you concatenate two strings. Maybe we could add a new warning?

Eh\, no\, because Perl does not have any metadata telling you if this non-UTF8 string is a latin1 text string\, or just a random byte string.

There is no way to tell Perl how you intended your string to be used\, and there is no way for Perl to tell you the same thing about a string it returned.

use warnings 'upgrade';

This already exists on CPAN\, authored by Audrey Tang\, as encoding::warnings:

use encoding::warnings;

But it will warn when Perl upgrades latin1 to utf-8\, without knowing if that is a bug or a feature\, because it doesn't know if the "latin1" string was meant as a text string or a byte string.

It's a useful debugging tool\, to find unintended upgrades\, but you shouldn't try to avoid upgrading altogether. That just hurts\, because upgrading is part of the way the Perl Unicode model was intended.

\* the lenght in bytes
\* the lenght in characters \(not always set\, e\.g\. can be unknown\)
\* the storage buffer \(containing the data\, plus some optional padding\)
\* the encoding

Hey\, cool\, Perl has almost the same thing\, only it supports just two encodings: latin1 and utf8. It uses a single bit to indicate the encoding\, the UTF8 flag\, which can be on or off. When it's off\, the string is latin1\, when it's on\, the string is UTF-8.

Maybe you should try Perl; you'll like the way it's built\, because it very closely matches your own design!

The same type of string can be used for binary data\, because in the unicode encoding "latin1"\, all 256 codepoints map to the same byte values.

In short\, it becomes a mess.

Yes\, with strong typing\, especially with string subtypes for arbitrary encodings\, it would be cleaner. But it would also not look like Perl 5. -- korajn salutojn\,

juerd waalboer: perl hacker \juerd@juerd\.nl \<http://juerd.nl/sig> convolution: ict solutions and consultancy \sales@convolution\.nl

Ik vertrouw stemcomputers niet. Zie \<http://www.wijvertrouwenstemcomputersniet.nl/>.

p5pRT commented 17 years ago

From nospam-abuse@bloodgate.com

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Moin\,

On Friday 30 March 2007 21:28:44 Juerd Waalboer wrote:

Tels skribis 2007-03-30 22:32 (+0000):

However\, if you have 200Mbyte of ASCII string\, it is more efficient to *not* copy the data around just to find out that\, yes\, all of it is 7bit :)

Indeed\, but this is an optimization. Optimization isn't part of teaching how things work\, it always comes after.

I almost agree. :)

Some decisions really need to be done early on\, in the design phase. You cannot optimize when the design is broken. E.g. if your data needs to be copied around *per design*\, the best you can achive is O(N). When you do not have to copy the data\, you suddenly can achive O(1). This distinctions is quite important\, and not something you can fix aftwards apart from redesigning (aka let's break and re-assemble it :)

A recent (non-Perl) example for such a methodology/design change was zero-copy networking - I remember there being a lot of talk about this\, especially in Unix/Linux world. Basically\, when you want to send data to the network it is wastefull to copy it many times around just to output it to the hardware - up to the point where the copy takes more time than all the rest of work to be done. However\, avoidn the copy isn't that easy :)

I know it is hard to design your code so that it works fine for small data ("A") and large data ("A" x 10000000) alike\, but usually\, these things need to be considered early on\, or you end up with a system that is only usefull for demos and toying around and breaks under real-world access :)

Just like security\, a performant design usually can't just bolted on later.

And how to design your program to be secure\, ast\, reliable etc. should be teached\, too. Maybe not in the same hour\, but close :-)

Just saying... :)

Information overload is probably the single most problematic thing in Perl's unicode documentation. Constantly people are told all those internal implementation details that they don't have to know. It's no wonder that they start assuming that they actually need this information\, and use manual setting of UTF8 flags as their first resort in case of trouble.

I think I agree. Luckily I managed to completely avoid this whole issue by ignoring unicode until very recently - and then the doc and code had improved quit a lot so that Unicode is really usable in Perl (Thank you guys! especially Jarkko!)

All the best\,

Tels

- -- Signed on Fri Mar 30 23:55:12 2007 with key 0x93B84C15. View my photo gallery: http://bloodgate.com/photos PGP key on http://bloodgate.com/tels.asc or per email.

"Elliot\, Sie Schwachkopf!"

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2 (GNU/Linux)

iQEVAwUBRg2lXXcLPEOTuEwVAQKsEQf/REU2lTQdaOjP7MBeC+Uw6zdQaSB26FgY cZn9ob0M6Jz2l2+2hukhZQpFbff09QxzVPIPmL3RtUx3SIEdF/3WjFQ7CvLxfQR8 S0KG3zkhMclrdEAspOlUrW2g+PlC9PuWGSPhUGg+LSvGVkNmQtor7dMoEVQ0BD1b 4kVRU4s7Jb4A7kyoFYksBumofNg/Qw1Y2Jr2ccn9WU3G6EHNOM4dYWDieq+BW1Ci YcGAx+gSS523OvBh73VxYsCDz3RgY1aRWqULmvCCp38F6fluDcDAc14PQnoDz8j0 PAgkS4wiChq/uSY28wp9IZuoYU8k8+gB3eJtraRGTem+DiW7vgT/yA== =3Z3P -----END PGP SIGNATURE-----

p5pRT commented 17 years ago

From nospam-abuse@bloodgate.com

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Moin\,

On Friday 30 March 2007 21:44:12 Juerd Waalboer wrote:

Tels skribis 2007-03-30 23:17 (+0000):

If it is so deadly to collide byte-oriented data with character data\, it should not be so easy to do so accidentally.

It can happen everytime you concatenate two strings. Maybe we could add a new warning?

Eh\, no\, because Perl does not have any metadata telling you if this non-UTF8 string is a latin1 text string\, or just a random byte string.

There is no way to tell Perl how you intended your string to be used\, and there is no way for Perl to tell you the same thing about a string it returned.
use warnings 'upgrade';
This already exists on CPAN\, authored by Audrey Tang\, as encoding::warnings:
use encoding&#8203;::warnings;
But it will warn when Perl upgrades latin1 to utf-8\, without knowing if that is a bug or a feature\, because it doesn't know if the "latin1" string was meant as a text string or a byte string.

It's a useful debugging tool\, to find unintended upgrades\, but you shouldn't try to avoid upgrading altogether. That just hurts\, because upgrading is part of the way the Perl Unicode model was intended.
\* the lenght in bytes
\* the lenght in characters $not always set\, e\.g\. can be unknown$
\* the storage buffer $containing the data\, plus some optional padding$
\* the encoding
Hey\, cool\, Perl has almost the same thing\, only it supports just two encodings: latin1 and utf8. It uses a single bit to indicate the encoding\, the UTF8 flag\, which can be on or off. When it's off\, the string is latin1\, when it's on\, the string is UTF-8.

Maybe you should try Perl; you'll like the way it's built\, because it very closely matches your own design!

First for the record:

The application I am outfitting is written in C\, for speed\, and quite large. So there is NO way I would even consider to rewrite it in Perl. I'm just using the right tool for the right job. That doesn't mean I do not like Perl\, or the way Perl does things. Sorry if this sounded like it.

Anyway\, I wasn't aware that any non-utf8 data in Perl is *always* ISO-8859-1\, I thought that\, when not specified\, this depended on some other stuff. Guess I need to reread the tutorials. :)

However\, this also poses the question: How does Perl know that your data is in KOI8-R?

(Yes\, that's a trick question\, but I would like to hear your answer to that\, in any case\, just to make it clear to me. No offence meant!)

One of the limitations of the "there can be only two encodings" of Perl seems to be that strings are permanently upgraded:

$iso_8859_1 = '...'; $utf8 = '...';

if ($iso_8859_1 eq $utf8) { ... }

Please correct me if I am wrong\, but I do think it is not be possible to keep both variables in their current encoding and only temporarily upgrade them to utf8 (for the common encoding that contains both of them)?

After reading this discussion here\, a lot of problems also seem to stem from the fact that the upgrade to utf8 is permanent\, silently and done "behind-the-scenes". Just like 1 + 2.0 will result in 3.0 and not 3 and we all know how much confusion this creates :) (heh\, I fell for it today\, even tho I should have know better :)

The same type of string can be used for binary data\, because in the unicode encoding "latin1"\, all 256 codepoints map to the same byte values.

This sounds like a circular definition\, because in CP1250\, also all 256 codepoints map to the same byte values. Except it are different byte values :)

In my application\, I also considered having a "BINARY" encoding\, but in the end I opted to make ISO-8859-1 the default encoding for BINARY stuff. (Ha\, great minds sink alike or so) And since unlike in Perl\, upgradings are never done permanently\, you can keep your BINARY string and compare it to UTF-8 whatever\, and it never gets "corrupted".

I am not sure how one could achive that in Perl. Making the SV read-only?

In short\, it becomes a mess.

Yes\, with strong typing\, especially with string subtypes for arbitrary encodings\, it would be cleaner. But it would also not look like Perl 5.

Over the years\, I come to the insight that I want to build reliable and fast programs. (easy to maintain\, reliable\, fast\, pick two :-)

So maybe we really need "use strict 'encodings';" :-)

All the best\,

Tels

- -- Signed on Sat Mar 31 00:04:29 2007 with key 0x93B84C15. Get one of my photo posters: http://bloodgate.com/posters PGP key on http://bloodgate.com/tels.asc or per email.

"Blogebrity: Wow\, guess what this one stands for? Too easy. Hey\, anyone can do it: take a blogger who's a chef\, and you get: BLEF. A blogger who's a dentist? BENTIST. A female blogger with an itch? You guessed it: a BITCH."

-- maddox from xmission -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2 (GNU/Linux)

iQEVAwUBRg2pBHcLPEOTuEwVAQKxJQf/UKYZhHUkTkH6wpP/uLQ+zkEO/8ptDA4i 7lQipjOIkGlcLc0peF0sr2jlNu59XWSVbDeYdSSdJGWYvydYbeToP180xaBms40a GdL/5QWlgUalQ1sifs93r1pfx+AQv1Pc4TivybFj/SbYY5WYe7pcaZDZ80/luYtp ftxd+96KLVshZ/2bMtxjJ7yo2k7oD0uwA2MF1SFiytjSFZZ+QRol2G7PbsIaqonc ITDrEm+R+djp9FLFKlXQIs3/jNx2wOhoS5z6Q3HKIi9KrXfMngyZa4cvpSmm071l ETbRT4gy+1O7fFvsFG8xrtyajO95LpSPhZ1aeYR7fPpj0zLP6KNqxQ== =jV6Z -----END PGP SIGNATURE-----

p5pRT commented 17 years ago

From schmorp@schmorp.de

On Fri\, Mar 30\, 2007 at 10:09:29PM +0200\, Juerd Waalboer \juerd@convolution\.nl wrote:

Marc Lehmann skribis 2007-03-30 14:24 (+0200):

In fact\, I teach a lot of people about unicode in perl.

At the German Perl Workshop\, I saw your unicode presentation. I don't know if this is a good representation for your teaching of unicode\, but

It is\, if a bit short (and I consider it a matter of taste).

If perl had the abstract model juerd dreams of

and uses in day-to-day coding\, without encountering ANY of the problems that you describe

Frankly\, that is not a very good sign. It means eitehr you are extremely lucky or you don't use any of the many XS modules that silently break\, or even the Perl modules (such as the example from Compress::Zlib) that break less silently\, but more miraciously.

It kind of makes one wonder if this dream might be reality (and your reality a dream?)

The dream isn't reality. If it ere\, people would not report bugs against JSON::XS because it happens to create scalar values with the UTF-X bit set.

And they do so for some of my other modules doing that\, too. And there are two options to me: either tlel them perl is broken w.r.t. to e.g. "C"\, or their code is broken becasue they do not call downgrade.

Obviously\, I prefer the former over the latter\, but last time I was told unpack "C" was mentioned to break the abstraction in the camelbook\, so its correct.

Which suddenly invalidates a lot of code.

then perl would have a very easy unicode model that boils down to what I talked about on the perl workshop: encode/decode when doing I/O\, oherwise\, enjoy.

And keep text strings and byte strings separate!!!!!!!!!!!!!eleven

I find "text strings" and "byte strings" not adequate either\, as Perl makes no difference between those two concepts (being typeless)\, and they do not map well to encoded/decoded text either. Perl only knows how toc oncatenate characters\, it does not know anything about byte or text\, so utf8::encode does not necesarily create a byte string out of a text string. It could juts as well create a text string out of a byte string (think JSON\, which creates json _text_ out of e.g. byte strings by encoding them to UTF-8).

So\, recap: encode/decode when doing I/O\, keep text strings and byte strings separate\, otherwise\, enjoy.

I do not think that maps clearly to Perl (or my programs either). It might be a good and simplified advice to a beginner\, though\, although I prefer to never tell people simplified (but wrong) things. The perl unicode model is rather simple\, but leaves you in control\, and I found teaching people about how perl just allows more than 0..255 for a character index works best (although people differ).

-- The choice of a -----==- _GNU_ ----==-- _ generation Marc Lehmann ---==---(_)__ __ ____ __ pcg@goof.com --==---/ / _ \/ // /\ \/ / http://schmorp.de/ -=====/_/_//_/\_\,_/ /_/\_\ XX11-RIPE

p5pRT commented 17 years ago

From @Juerd

Tels skribis 2007-03-31 0:19 (+0000):

Anyway\, I wasn't aware that any non-utf8 data in Perl is *always* ISO-8859-1\, I thought that\, when not specified\, this depended on some other stuff. Guess I need to reread the tutorials. :)

Note that they are unicode strings\, and that Perl is theoretically free to change the internal representation at any time.

However\, this also poses the question: How does Perl know that your data is in KOI8-R?

Because you tell it that it is with "decode". The resulting string is a unicode string\, which may have any encoding internally. (Practically\, this is limited to latin1 and utf8.)

my $text_string = decode("koi8-r"\, $byte_string);

or\, if you prefer different terminology:

my $unicode_string = decode("koi8-r"\, $koi8r_string);

One of the limitations of the "there can be only two encodings" of Perl seems to be that strings are permanently upgraded: $iso_8859_1 = '...'; $utf8 = '...'; if ($iso_8859_1 eq $utf8) { ... }

$iso_8859_1 is temporarily upgraded to utf8 for this comparison.

(Yes\, this copies data\, and then throws it away. Again\, optimization does require knowing internals. The easiest optimization here is to utf8::upgrade $iso_8859_1\, after which the variable name no longer makes sense :))

Just like 1 + 2.0 will result in 3.0 and not 3 and we all know how much confusion this creates :) (heh\, I fell for it today\, even tho I should have know better :)

Doesn't really cause me any headaches\, to be honest.

The same type of string can be used for binary data\, because in the unicode encoding "latin1"\, all 256 codepoints map to the same byte values. This sounds like a circular definition\, because in CP1250\, also all 256 codepoints map to the same byte values. Except it are different byte values :)

I said "unicode encoding"\, but should have said "unicode codepoints".

Codepoints 0..256 in latin1 map to byte values 0..256. That makes it special.

In short\, it becomes a mess. Yes\, with strong typing\, especially with string subtypes for arbitrary encodings\, it would be cleaner. But it would also not look like Perl 5. Over the years\, I come to the insight that I want to build reliable and fast programs. (easy to maintain\, reliable\, fast\, pick two :-)

I do that with Perl. Really\, you should check that language out! You'll LOVE it! :) -- korajn salutojn\,

juerd waalboer: perl hacker \juerd@juerd\.nl \<http://juerd.nl/sig> convolution: ict solutions and consultancy \sales@convolution\.nl

Ik vertrouw stemcomputers niet. Zie \<http://www.wijvertrouwenstemcomputersniet.nl/>.

p5pRT commented 17 years ago

From schmorp@schmorp.de

On Fri\, Mar 30\, 2007 at 07:46:41PM +0100\, Nicholas Clark \nick@ccl4\.org wrote:

This seems a sane idea. However\, I'm not going to change it for 5.8.9

Sure.

5.10 is a different matter\, but also not my call.

Sure.

I know all that...

Could you tell me why almost every other 5.6 bug was fixed in 5.8\, but gratitious breakage of large parts of CPAN are accepted with this change? Whats the rationale behind keeping this 5.6 bug\, while fixing the rest?

No\, I can't. 5.8.0 and 5.8.1 were not my releases\, *and* I wasn't aware that 'C' was a problem at that time.

Yes\, you can. You control 5.8\, and you said it won't gonna happen. So either you have a reason and can tell me of it\, or not.

The reason I wanna know is because I want to know what to tell people. Either it is "your code is broken\, unpack "C" without downgrade is a bug in your code" or "it is a bug in perl\, you can work around by enabling ->shrink for the time being".

I *think* that the reason may have been because "it is documented in Programming Perl" that it behaves the 5.6.0 way.

I would argue it doesn't behave the 5.6 way\, though: 5.6 had a completely broken unicode implementation\, and lots of bugs. In 5.6 it would give me one "character"\, because 5.6 often exposed the utf-8 encoding explicitly\, so one character in the 5.6 model often was a single internal byte.

Also\, I still think it is a mistake to break working code without giving an alternative(!) for unpack that isn't "you have to downgrade and keep your fingers crossed".

I went looking\, and the closest I can find to an assertion about how it works is:

* the pack/unpack letters "c" and "C" do /not/ change\, since they're often used for byte-orientated formats. (Again\, think "char" in the C language.) However\, there is a new "U" specifier that will convert between UTF-8 characters an integers:
pack$"U\*"\, 1\, 20 \,300\, 4000$ eq v1\.20\.300\.4000

Exactly. But "C" somehow works on UTF-8\, while it shouldn't. It should work on characters\, as documented (just like in C\, char array[]; array[i] is one character\, regardless of how many bits a character in C has\, or how it is encoded).

* The chr and ord functions work on characters
chr$1$\.chr$20$\.chr$300$\.chr$4000$ eq v1\.20\.3000\.4000
In other words\, chr and ord are like pack("U") and unpack("U")\, not like pack("C") and unpack("C"). In fact\, the latter two are how you now emulate byte-orientated chr and ord if you're too lazy to use bytes.

So due to that documentation insanity it is now suggested that all code that used "C" beforee muts use "U" now to get the same effect as in earlier perl versions?

Then why was "use feature" introduced in the first place? Just document existing programs to be broken. I am quite convinced (whatever that means to you :) that that would result in less and less silent breakage then renimong "C" to "U".

Besides\, perl 5.8 does not follow that description:

perl -e '$x = "\xc3\xbc"; die unpack "U*"\, $x'

This gives me 195188\, two characters\, although it is a single UTF-8 character\, so why does it wrongly give me two? $x certainly is utf-8-encoded (try Encode::encode_utf8 chr 252\, it results in the above string).

Whoever wrote that part\, simply said\, was completely confused about unicode. Thats fine\, Sarathy had to hammer it into me too\, and then made a mistake himself after he did so. And it took me years to understand how it should be. It is hard to do from an implementors standpoint because you are so near the actual code.

But that doesn't mean it is right. Fact is\, the above documentation is simply wrong\, either with regards to how it should be\, and in regards to how it is implemented.

[3rd edition\, page 408]

(Thanks for digging it out\, btw\, I haven't seen that yet).

So why not fix it? Nobody made such a fuss when they fixed the remaining bugs from 5.6. For example\, PApp\, one of my older modules using unicode\, is full

I'm not going to change anything this late in 5.8.x. Whether 5.10 changes is not something I have the final say on.

Ok\, so I will tell people to replace "C" by "U" in theor code then.

Thanks! (And go on with your good work\, btw.\, it seems that wasn't quite clear to some people\, so again: you are doing tremendously good work! :).

"unchanged" actually means to me that it would produce the same output.

The only thing that seems to define the current 5.6 behaviour is the comparison of unpack("C") with ord under use bytes in the paragraph on chr and ord.

Right\, while the documentation on unpack "U" disagrees with it\, as it talks about UTF-8. The documentation clearly does not apply to current perls\, it clearly applies to the 5.005_5x model where perl ahd no UTF-X flag.

-- The choice of a -----==- _GNU_ ----==-- _ generation Marc Lehmann ---==---(_)__ __ ____ __ pcg@goof.com --==---/ / _ \/ // /\ \/ / http://schmorp.de/ -=====/_/_//_/\_\,_/ /_/\_\ XX11-RIPE

p5pRT commented 17 years ago

From schmorp@schmorp.de

On Fri\, Mar 30\, 2007 at 11:28:44PM +0200\, Juerd Waalboer \juerd@convolution\.nl wrote:

Information overload is probably the single most problematic thing in Perl's unicode documentation. Constantly people are told all those internal implementation details that they don't have to know.

Exactly. If they wouldn't have to care for those internals it would be much simpler\, abstracted away. But thats not reality.

wonder that they start assuming that they actually need this information\, and use manual setting of UTF8 flags as their first resort in case of trouble.

If you send a compressed string over the network using JSON and decompress it\, you need to know that. Evem if you do pure perl only. And\, as I do a lot of network protocols\, this never hurts me as I know how and when perl upgrades/downgrades and whats broken or not.

It does hurt other people constantly though\, and I do not understand why it has to be that way if the fix were conceptually simple and aligned with existing usage.

In my talk for example I only hinted at the implementation details and told people to ignore it\, but when they get weird bugs\, they might look into that.

My problem is not that there are bugs. My problem is that those bugs are not beign fixed because of truely hilarious reasons such as that obscure rfereence in the camelbook\, so all have to suffer\, while other\, similar bugs\, have official bug status and get fixed.

I am really frustrated at that. It makes perl as a whole rather questionable for unicode use\, as you constantly have to think about the internals.

And yes\, that simply shouldn't be the case.

-- The choice of a -----==- _GNU_ ----==-- _ generation Marc Lehmann ---==---(_)__ __ ____ __ pcg@goof.com --==---/ / _ \/ // /\ \/ / http://schmorp.de/ -=====/_/_//_/\_\,_/ /_/\_\ XX11-RIPE

p5pRT commented 17 years ago

From @Juerd

Marc Lehmann skribis 2007-03-31 0:20 (+0200):

If perl had the abstract model juerd dreams of and uses in day-to-day coding\, without encountering ANY of the problems that you describe Frankly\, that is not a very good sign. It means eitehr you are extremely lucky or you don't use any of the many XS modules that silently break\, or even the Perl modules (such as the example from Compress::Zlib) that break less silently\, but more miraciously.

Most of the time\, it's a question of realising that the module doesn't do the Perl unicode model\, and considering communication with the module I/O\, i.e. only feed it bytes\, and only get bytes back. Encode and decode as appropriate.

I maintain a short list of some modules at http://juerd.nl/perluniadvice. If you encounter modules that I can test easily without setting up complete environments\, please let me know!

Compress::Zlib sounds like it uses zlib\, which compresses byte streams. i.e. don't give it unicode strings\, because unicode strings have no bytes (the bytes are internal only\, but you don't know what encoding is used there). Encode explicitly.

And they do so for some of my other modules doing that\, too. And there are two options to me: either tlel them perl is broken w.r.t. to e.g. "C"\, or their code is broken becasue they do not call downgrade.

Their code is probably broken because they mix text strings with byte strings. This can be solved most easily by explicitly encoding your text string as soon as you feel you must join it with a byte string. The joined string as a byte string. Decoding it to make a text string may or may not make sense\, depending on the data format.

I find "text strings" and "byte strings" not adequate either\, as Perl makes no difference between those two concepts (being typeless)

Indeed. Programmers have to track this themselves. Sometimes that sucks\, but in my experience\, you need to know what kind of data your variable contains anyway.

If you ++ a reference\, you're in for trouble too. How come that's never been a problem? Probably because programmers are pretty good at knowing what functions their variables have.

It's just that this is something you haven't needed to know before\, so you're not /trained/ yet to think about it. But you can't go from 256 characters to several thousands without changing the way you think :)

they do not map well to encoded/decoded text either

Oh\, but they do. Please read perlunitut\, which tries to redefine the universe into four important definitions (and succeeds).

1. Byte strings (aka binary strings)

2. Text strings (aka unicode strings or "internal format" strings)

3. Decoding is byte --> text

4. Encoding is text --> byte

Perl only knows how toc oncatenate characters\, it does not know anything about byte or text\, so utf8::encode does not necesarily create a byte string out of a text string.

I don't get the causal connection you're illustrating.

utf8::encode takes any text string (or unicode string\, if you prefer that term) and turns it into a UTF-8 encoded byte string in place.

That is\,

utf8::encode($foo);

is the efficient equivalent of:

$foo = encode("utf8"\, $foo);

Note that whenever a string has an encoding attach to it\, conceptually\, it's automatically a byte string. Text strings don't have encodings\, because encodings are a byte thing\, and text strings don't have bytes; they have characters. (Text strings have encodings and bytes /internally/\, just like numbers do have bytes /internally/\, encoded in one way or another\, that allows values greater than 255 or less than 0.)

It could juts as well create a text string out of a byte string (think JSON\, which creates json _text_ out of e.g. byte strings by encoding them to UTF-8).

utf8::encode is a text operation. It will assume that whatever you give it\, is a text string. Its characters are considered Unicode codepoints.

You shouldn't give it a byte string.

To understand what happens if you do give utf8::encode a byte string\, you need to know some internals. But I stress that this is not required knowledge\, because it's so much easier to just remember not to do this weird thing. Why would you try to encode a byte string to UTF-8\, anyway? That makes no sense\, because UTF-8 is a means of representing characters. Byte strings consist of bytes\, not characters.

Here's what happens internally: Any byte string used as a text string is considered to be encoded in latin1\, because Perl doesn't know the difference.

(or my programs either). It might be a good and simplified advice to a beginner

The theory is very simple\, but not simplified. It just isn't any harder.

I'm sorry if you want a more complex programming tool. But apparently you have found ways to make it hard for yourself already :)

though\, although I prefer to never tell people simplified (but wrong) things.

I agree. Whenever I use a simplified view\, that will be obvious or mentioned. Metadata ("this information is wrong\, but useful anyway") is very important.

The perl unicode model is rather simple\, but leaves you in control\, and I found teaching people about how perl just allows more than 0..255 for a character index works best (although people differ).

That's a great explanation of how unicode strings work. But when people write programs\, these programs typically accept input and also have some output. And then you're doing I/O\, which is done with bytes\, and requires character encodings in order to communicate characters. You used to be able to ignore this fact when everyone still used iso-8859-1\, I mean CP437\, I mean CP850\, I mean koi8-r\, I mean Windows-1252. Right\, we never did all use exactly the same encoding. We've just chosen to remain ignorant all this time. Explicit re-encoding\, or decoding and encoding has been necessary all this time. It's just that with more than 256 codepoints\, it became much more apparent :) -- korajn salutojn\,

juerd waalboer: perl hacker \juerd@juerd\.nl \<http://juerd.nl/sig> convolution: ict solutions and consultancy \sales@convolution\.nl

Ik vertrouw stemcomputers niet. Zie \<http://www.wijvertrouwenstemcomputersniet.nl/>.

p5pRT commented 17 years ago

From schmorp@schmorp.de

On Fri\, Mar 30\, 2007 at 09:53:52PM +0200\, Juerd Waalboer \juerd@convolution\.nl wrote:

at\, know\, or set the UTF8 flag yourself. It's an internal bit\, like IOK and NOK. [1] Thats not how current perl works.

We must have differing definitions\, somewhere.

No. I have explained elsewhere that we quite agree on how it should be. It is just that you make strange claims:

The best approach to programming with unicode in mind\, in Perl\, is to (pretend to) be completely ignorant about Perl's internals with regards to encoding and the UTF8 flag.

It doesn't work even when not having unicode in mind. See unpack.

The only exception is the regex engine\, which has a big bug.

Uhm\, no.

Your powers-that-be\, might be different. Also\, don't confuse "you can know what Perl does internally" with "you have to know what Perl does internally".

In the example I gave\, you have to.

Just being able to access internal metadata doesn't mean you should actually do so on a daily basis.

Whats the alternative? Replace all my uses of unpack with explicit calls to ord? Sorry\, but thats completely unrealistic.

It's also entirely possible to set the internal flag "UTF8" on an existing string. But for some reason a lot of people are complaining about that\, and even more people have actually set UTF8 flags themselves...

Yes. Because you have to when interfacing with a gazillion of existing modules (or at the very least clear or downgrade).

If perl wouldn't force people to know the internals so often\, one could certainly get away with telling them: do not touch downgrade/upgrade\, and certainly never utf8_on or is_utf8\, it is form the dveil.

But thats far from reality.

    unpack $'CCCCVCC'\, $$string$;
that code is broken because the powers to be decided that "C" exposes the internal encoding\, while "V" doesn't.
Yes\, any byte-specific operation on a text string (which I keep separate from character strings) will use the internal encoding. It has to use /some/ encoding\, because it cannot see whether the string was meant as a byte string or a text string. Perl does not have strong typing.

Thats wrong. There is a perfectly good definition for character and byte: the one from C. It is a single element of a string. The same thing was true in perl: one byte is one character\, and it should be true under the new model.

Nothing in pack or unpack requires a speciifc encoding\, just as nothign in perl should require me to know the specific encoidng of "chr 200". It is a single byte/character\, regardles sof how perl stores it internally.

Personally\, I think that unpack with a byte-specific signature should die\, or at least warn\, when its operand has the UTF8 flag set.

Thats pure insanity. Then people would again be forced to know the internal encoding. How can you tell people to not worry about internal encoding and in the next paragraph force them to know because suddenly they are not allowed to call unpack unless some _internal_ flag has some specific value.

I severely doubt you understood perls unicode model: It works by abstracting away the internal flag completely\, not forcing the user to deal with it. Forcing her to deal with it is *wrong*.

catch at least some of the cases\, because the UTF8 flag always positively indicates that the string is a text string.

No\, absolutely not. You are confused. The UTF-X flag only marks a specific encoding used by perl internally. It says nothing about text or not text. You cna store binary just fine in a UTF-X marked string.

(The reverse\, however\, is not true: a string without the UTF8 string might be either a text string or a byte string.)

As might a string with the UTF-X flag set. Perl is typeless\, it doesn't know anything about text vs. binary.

That requires every perl programmer who decodes file headers etc. using unpack to know about those internals.

No\, it requires every Perl programmer to keep track of the function of every string.

No. A binary string is a binary string because it contains no characters higher then 255. It is that simple.

Byte strings and text strings must never be combined\, and text strings must never undergo byte-specific operations.

That is certainly wrong.

This again requires no knowledge of the actual encoding that Perl uses internally\, whatsoever.

It does\, for unpack\, both in current perl as well as in your proposed change.

Note that XS writers must have knowledge of Perl's internals. This has always been true\, and is not specific to this fancy new Unicode thing.

Right. But why gratitiously break old code? In perl\, it is broken by at least unpack\, in XS\, it is broken by changing the meaning of SvPV.

And the problem is that those bugs are not considered bugs but features.

Some bugs are acknowledged as bugs\, but won't be fixed anyway\, because there is already a lot of code in the wild that depends on the bugs.

Again\, I know a lot of code that is currently broken because of that bug. I asked\, but nobody found code "in the wild" that relies on that specific bug.

unpack "C"\, $s;

The C template for unpack is specifically documented as byte-specific.

No\, it is specifically documented as being character-specific. Read your manpage carefully:

c A signed char value. C An unsigned C char (octet) even under Unicode.

(Note that byte and character is the same thing in C). That leavs us with "octet". An octet is a number between 0 and 255 (you can give alternative definitions thta are equivalent to mine\, though).

In perl this is an octet:

$x = chr 200;

Yet unpack under some circumstances returns two values for this single octet\, and sometimes not. And the only way to know is to inspect the internal UTF-X flag.

It should never be used on text strings.

Perl is typeless. There is no such thing as a text string in Perl. The problem\, however\, is not that it doesn't work on "text strings"\,m whatever that might be\, the problem is that unpack doesn't work on binary strings\, ro at least not all the time.

If you properly keep text and byte strings separate\, that means that your byte string was never upgraded\, and that unpacking with "C" is reliable and predictable.

Uhhh\, who guarentees that? JSON::XS does no such thing\, and cannot guarantee that\, because Perl has no type for "text string" vs. "binary string". So how do you suggest JSON::XS keeps text and byte strings separate\, if there is no way to detect the type of a string or make a useful difference between those two?

If upgrading happened even though the string was not mixed with text strings or used with unicode semantics\, that is a bug. I'm very interested in these silent upgrades that you are experiencing.

Concatenating strings might upgrade them (e.g. in debugging output). More so\, JSON::XS currently can return either UTF-X encoded strings or non UTF-X-encoded strings.

You can that buggy. So please tell me how to fix that bug. How do I\, when decoding a JSON string\, know wther it is one of your text or byte strings? Whats the difference\, if neither JSON nor Perl make one?

If you think it is obvious\, how about this:

my $s = chr 255; # to me\, this is one octet. to perl\, it might be one or # two\, or maybe more\, who knows. warn unpack "C"\, $s; "$s\x{672c}"; warn unpack "C"\, $s; $s .= "\x{672c}"; substr $s\, 1\, 1\, ""; warn unpack "C"\, $s; Can a pure-Perl programmer tell what the output of this program is without trying it?

Not relevant.

Very relevant.

Should he be able to?

No\, because the author of this program made a big mistake in the line "$s\x{672c}".

Are you sure that upgraded? And why is it a mistake? I very much differ in that-

The casual reader can easily figure out that $s was meant as a byte string

I cannot\, from that short fragment. Neither can Perl.

it is used with unpack "C"\, which is known to be a byte operation. Because it is a byte string\, the chr 255 is just a 0xFF octet\, not a ÿ (ÿ) conceptually.

Exactly. But unpac does not return 255 for that byte string.

The casual reader can also easily figure out that \x{672c} is meant as a text string: any codepoint higher than \x{FF} is always a character\, never a single byte.

Why? Lots of people use those higher codepoints. Perl certainly does not mandate anything like that\, so why do you try to enforce it? People routinely do stuff like join "\x{100}"\, @png_images to seperate them\, and it works fine.

Perls unicode model does not enforce a meaning of the codepoints used in strings. It simply allows me to use more character indices than in 5.005.

Then\, the author of this snippet uses both the byte string $s and the text sting "\x{672c}" joined in one string "$s\x{672c}". People not interested in fixing the code can stop reading there: the code is broken and its semantics not terribly relevant.

Thanks for gratitiously calling my code broken. In any case\, explain to me how to fix it in general\, I only gave an example of silent upgrades.

use JSON::XS;

my $x = (from_json to_json [$y])[0];

is another silent upgrade users need to know about.

People who wish to fix it\, will have to try and figure out what the author really wanted to do here.

Exactly that.

Because it's a contrived case\, that's very hard to figure out.

Not at all. You are just guessing\, and getting it wrong.

sure that given real world values and variable names\, there would be a clear and logical solution\, to be found somewhere along the lines of encoding and decoding explicitly.

See above\, figure it out in the real world then.

Thats a broken unicode model

So far\, I've only seen a broken understanding of the unicode model\, and a broken regex engine.

Same here. Your model requires people knowing about the UTF-X flag (at leats in unpack). Mine doesn't\, and I think mine is much closer to what you want to achieve: not having to tell people about it. In your model you would have to tell people to downgrade before unpacking string\, or alternatively\, you rule out a lot of perfectly fine Perl code on the assumption that it is easy to figure out that it is broken. Sorry\, but I differ very much.

-- The choice of a -----==- _GNU_ ----==-- _ generation Marc Lehmann ---==---(_)__ __ ____ __ pcg@goof.com --==---/ / _ \/ // /\ \/ / http://schmorp.de/ -=====/_/_//_/\_\,_/ /_/\_\ XX11-RIPE

p5pRT commented 17 years ago

From schmorp@schmorp.de

On Fri\, Mar 30\, 2007 at 02:00:37PM -0700\, Marvin Humphrey \marvin@rectangular\.com wrote:

Thats a broken unicode model

So far\, I've only seen a broken understanding of the unicode model\,
and a broken regex engine.

That so many users\, including those as expert as Marc\, possess a
"broken" understanding of Perl's Unicode model suggests a flawed
design. We have been set up to fail.

My "broken" understand bails down to users not having to know about the internal UTF-X flag. If that is indeed wrong\, then going back to the explicit 5.005_5x model where you indeed had to track your encoding manually is the right thing.

However\, I claim it would a great loss. Back to the assembly programming of unicode.

-- The choice of a -----==- _GNU_ ----==-- _ generation Marc Lehmann ---==---(_)__ __ ____ __ pcg@goof.com --==---/ / _ \/ // /\ \/ / http://schmorp.de/ -=====/_/_//_/\_\,_/ /_/\_\ XX11-RIPE

p5pRT commented 17 years ago

From marvin@rectangular.com

On Mar 30\, 2007\, at 2:25 PM\, Juerd Waalboer wrote:

That so many users\, including those as expert as Marc\, possess a "broken" understanding of Perl's Unicode model suggests a flawed design.

I think the design is solid\, but the implementation (see regex)
slightly broken and documentation wildly misleading.

I strongly disagree with this assessment. In particular\, I think
insisting that the user be responsible for manually segregating
character and byte-oriented data without any help from Perl is
totally unreasonable.

Look at how easily Marc made the "mistake" of commingling the two
types of data. It's debatable whether the fact that Perl allowed him
to do that without complaint is a flaw with the design or the
implementation\, but it's one or the other and it's serious.

Additionally\, as Marc points out\, there are lots of broken XS modules
out there -- including one of mine. (KinoSearch 0.15 -- Unicode
support is fixed as of 0.20_01\, which breaks backwards
compatibility.) Few or none of them would be broken if Perl made it
more difficult to move between character data and byte-oriented data
-- errors would be flying right and left and the broken modules would
get fixed right away.

Of course I understand why that cannot be the case\, but it's
astonishing to me that you see this as a problem which can be solved
via documentation.

I hope that Perl 6 does not opt to replicate Perl 5's behavior in
this area (my understanding is that it will not\, but I'm not
following development closely). I hope that project is taking into
account the lessons we have learned in the wake of very difficult
compromises about how to balance the addition of Unicode with
preserving backwards compatibility.

Surely you must know a way in which Perl's unicode support can be improved\, or accidents avoided\, without trying to change all of Perl\, CPAN\, and a gazillion lines of code that we can't even reach. Let's
hear it! :)

How about encouraging the use of encoding::warnings in perlunitut?

How about adding it to core and having 'use 5.10;' turn it on?

Marvin Humphrey Rectangular Research http://www.rectangular.com/

p5pRT commented 17 years ago

From @Juerd

Marc Lehmann skribis 2007-03-31 0:25 (+0200):

If you send a compressed string over the network using JSON and decompress it\, you need to know that.

Does JSON compress arbitrary data? If so\, then the user must do the decoding and encoding\, because arbitrary data only exists in byte form. Once you dictate any specific encoding\, it's no longer arbitrary.

On the other hand\, if JSON does text data only\, it can just use any UTF encoding on both sides\, and document it like that.

Unless both sides are exactly the same platform (e.g. both Perl)\, you need to establish a protocol for sending data anyway. And that protocol should also describe encoding. If sender and receiver don't agree\, you have a problem.

I am really frustrated at that. It makes perl as a whole rather questionable for unicode use\, as you constantly have to think about the internals. And yes\, that simply shouldn't be the case.

I maintain that it isn't the case\, for almost any programming job\, unless you're indeed doing things with internals. -- korajn salutojn\,

juerd waalboer: perl hacker \juerd@juerd\.nl \<http://juerd.nl/sig> convolution: ict solutions and consultancy \sales@convolution\.nl

Ik vertrouw stemcomputers niet. Zie \<http://www.wijvertrouwenstemcomputersniet.nl/>.

p5pRT commented 17 years ago

From @Juerd

Marc Lehmann skribis 2007-03-31 0:41 (+0200):

The reason I wanna know is because I want to know what to tell people. Either it is "your code is broken\, unpack "C" without downgrade is a bug in your code" or "it is a bug in perl\, you can work around by enabling ->shrink for the time being".

If a downgrade is "needed"\, it means that your byte string was accidentally upgraded. This should only happen if you mix it with a text string. If it happens without mixing it with a text string\, that is a bug. Please report.

So\, neither "your code is broken\, unpack "C" without downgrade is a bug in your code" nor "it is a bug in perl".

Instead: "your code is broken\, don't mix text strings with byte strings" or "it is a bug in perl that your string got upgraded in the first place."

Exactly. But "C" somehow works on UTF-8\, while it shouldn't.

Agreed!

Things that specifically handle bytes\, and bytes only\, should DIE (or at least warn) when used with a string that has the UTF-8 flag on. This still lets users get away with naively assuming that byte == character for latin1 strings\, as designed\, but at least catches the cases when you know that the user does something stupid.

It should work on characters\, as documented (just like in C\, char array[]; array[i] is one character\, regardless of how many bits a character in C has\, or how it is encoded).

A C "char" is a byte\, not a multibyte character\, ever.

Besides that\, the "C" in Perl's pack() is documented as a single byte.

I think that "char value" should be either removed from perlfunc\, or explained in more detail. It's NOT OBVIOUS to those who don't know C.

* The chr and ord functions work on characters chr(1).chr(20).chr(300).chr(4000) eq v1.20.3000.4000 In other words\, chr and ord are like pack("U") and unpack("U")\, not like pack("C") and unpack("C"). In fact\, the latter two are how you now emulate byte-orientated chr and ord if you're too lazy to use bytes. So due to that documentation insanity it is now suggested that all code that used "C" beforee muts use "U" now to get the same effect as in earlier perl versions?

The earlier Perl versions didn't support character values greater than 255\, and if you never have those characters\, C still works perfectly.

But yes\, if you're dealing with characters and want your program to be able to handle those fancy new >255 characters\, you should change that C to a U.

Besides\, perl 5.8 does not follow that description: perl -e '$x = "\xc3\xbc"; die unpack "U*"\, $x' This gives me 195188\, two characters\, although it is a single UTF-8 character\, so why does it wrongly give me two? $x certainly is utf-8-encoded (try Encode::encode_utf8 chr 252\, it results in the above string).

You asked for the codepoints U+00C3 and U+00BC\, and got them.

It's a UTF-8 encoded byte string\, alright\, but "U" is for Unicode\, not UTF-8.

Ok\, so I will tell people to replace "C" by "U" in theor code then.

If they do Unicode text strings\, that's indeed very good advice.

But you still want C for byte strings\, simply because some protocols or formats expect a byte value. :)

Right\, while the documentation on unpack "U" disagrees with it\, as it talks about UTF-8.

That would be a bug\, but I can't find it in my copy (5.8.8). It only says "Encodes to UTF-8 internally" for pack()\, which as far as I can tell\, is true. -- korajn salutojn\,

juerd waalboer: perl hacker \juerd@juerd\.nl \<http://juerd.nl/sig> convolution: ict solutions and consultancy \sales@convolution\.nl

Ik vertrouw stemcomputers niet. Zie \<http://www.wijvertrouwenstemcomputersniet.nl/>.

p5pRT commented 17 years ago

From schmorp@schmorp.de

On Sat\, Mar 31\, 2007 at 01:03:35AM +0200\, Juerd Waalboer \juerd@convolution\.nl wrote:

I maintain a short list of some modules at http://juerd.nl/perluniadvice. If you encounter modules that I can test easily without setting up complete environments\, please let me know!

Compress::Zlib sounds like it uses zlib\, which compresses byte streams. i.e. don't give it unicode strings\, because unicode strings have no bytes (the bytes are internal only\, but you don't know what encoding is used there). Encode explicitly.

The difference between us\, and thats what it boils down to\, is that you give the internal UTF-X bit meaning. You equate UTF-X flag set == Unicode string.

To me\, a unicode string is a concept outside of perl. I would consider any text string using the unicode codepoints a unicode string. For example: "hallo" is a unicode string. Any any binary string is not a unicode string.

The problem with your approach is that you have to expose the UTF-X flag to users. Which comes with a lot of problems.

Please note that in the actual problem\, nobody is passing unicode to compress::zlib. Instead\, a binary string is passed to Compress::Zlib that happens to be UTF-X encoded internally because it was transferred using a protocol that encodes bytes as UTF-8 (namely JSON)\, and the decoder opted not to make another copy of the data for speed reasons.

Compress::Zlib is not buggy. Neither is the caller. The bug is that unpack treats the same string differently depending on an internal flag that might be set for a variety of reasons outside the programmers control.

Initially I thought you\, too\, wanted a unicode model where the UTF-X bit is not exposed to the perl level. But in fact the opposite is true: you force knowledge of the UTF-X bit on users\, even though it should be transparent.

Thats the problem. As logn as you call UTF-X-encoded strings Unicode strings and something else byte strings and try to give them meaning the programmer has to know about it\, as functions behave semantically differently depending on that flag.

All I want is a perl that behaves semnatically consistent\, regardless of some internal flag that is documented not to be of concern to a Perl programmer.

Their code is probably broken because they mix text strings with byte strings. This can be solved most easily by explicitly encoding your text string as soon as you feel you must join it with a byte string. The joined string as a byte string. Decoding it to make a text string may or may not make sense\, depending on the data format.

my $bytestring = "zlib-encoded string"; my $transfer = Encode::encode_utf8 $bytestring; my $bytes = Encode::decode_uf8 $transfer;

$bytes is the same string\, but depending on implementation details of Perl\, it is treated diferently in different contexts\, sometimes it is treated like the binary string it is\, sometimes it is trated as if it were utf-8 encoded\, which it isn't\, as I decoded it.

I find "text strings" and "byte strings" not adequate either\, as Perl makes no difference between those two concepts (being typeless)

Indeed. Programmers have to track this themselves. Sometimes that sucks\, but in my experience\, you need to know what kind of data your variable contains anyway.

the problem is you want them to track the UTF-X flag in addition to that. Because putting a "byte string" into unpack should not work if that bit happens to be set. So you force people who want to use unpack to learn about that flag\, when it is set\, when they have to downgrade etc. etc.

If you ++ a reference\, you're in for trouble too. How come that's never been a problem?

Because perl treats it consistently.

It's just that this is something you haven't needed to know before\, so you're not /trained/ yet to think about it. But you can't go from 256 characters to several thousands without changing the way you think :)

Yes. Thats not a problem\, I understand unicode quite well\, and I udnerstand quite well how Perl stores unicode.

What the problem is is that I separate internal encoding (unicode can be encoded both in UTF-X as well as in octets\, as can byte strings) from the unicode model in Perl\, while you mix them together\, forcing the user to know their UTF-X bits on their scalars in addition to tracking wether they are binary or not.

they do not map well to encoded/decoded text either

Oh\, but they do. Please read perlunitut\, which tries to redefine the universe into four important definitions (and succeeds).

I do not have that manpage.

1. Byte strings (aka binary strings)

2. Text strings (aka unicode strings or "internal format" strings)

3. Decoding is byte --> text

4. Encoding is text --> byte

That doesn't reflect reality\, of course\, if it were so.

However\, those four definitions\, as I said\, do not map well to encoded/decoded text. Because "internal format" strings can store binary data just as well\, and often does.

I am talking purely about the perl level strings. If perlunitut confused the issue by talking about internal encoding it completely failed its mission\, imho.

I don't get the causal connection you're illustrating.

utf8::encode takes any text string (or unicode string\, if you prefer that term) and turns it into a UTF-8 encoded byte string in place.

No. It converts characters to UTF-X encoded octets. Wether my characters are bytes or not is of no consequence.

Note that whenever a string has an encoding attach to it\, conceptually\, it's automatically a byte string.

Yes. And that encoding is completely independent of the internal UTF-X flag. Or should be\, but isn't\, in current perls.

Text strings don't have encodings\, because encodings are a byte thing\, and text strings don't have bytes; they have characters. (Text strings have encodings and bytes

Perl doesn't know about that. It only knows about characters. The problem is that some parts of perl make a difference bewteen the very same string\, depending on how it is encoded internally\, _even if the encoding is the same on the Perl level_.

/internally/\, just like numbers do have bytes /internally/\, encoded in one way or another\, that allows values greater than 255 or less than 0.)

Exatcly. But nothing in perl forces those indices to be unicode characters. Certainly not the indices 0..255. Yet still\, the UTF-X flag might be set or cleared\, resulting in changes in interpretation.

I want those to go away and make perl treat my binary data as binary data\, regardless of how the interpreter treats them.

utf8::encode is a text operation. It will assume that whatever you give it\, is a text string. Its characters are considered Unicode codepoints.

Where does it say so?

You shouldn't give it a byte string.

Please leave it up to me what I should or should not to. This whole discussion of what I should or should not to is completey besides the point.

The point is that Perl treats my strings the same in utf8::encode\, regardless of how the UTF-X flag is set\, because upgrading or downgrading does not change the semantics of my characters.

But in unpack\, it does. Thats the problem. Bot what I should or should not do. The problem is givign unpack a binary strings makes it return garbage sometimes (if the binary string happens to be encoded internally in UTF-X).

This whole "force the user to track the UTF-X bit is useless". If you really want that\, then go back to 5.005_5x\, which forces you to track your UTF-8 on your own. The whole point of the big change in 5.6 was that programmers should not care about how perl internally encodes stuff\, and I certainly do not want to give this up. Thats what makes perl so good.

To understand what happens if you do give utf8::encode a byte string\,

A byte string is a string containing only octets\, that is\, values between 0 and 255.

Without knowing any intenals\, utf8::encode will encode it into a UTF-8 encoded sequence.

you need to know some internals.

Wrong. I need know no internals\, the result is always well-defined: put characters into utf8::encoede\, and get utf-8-encoded characters. No need for internals knowledge\, regardless of wether my characters are 0..255 or some of them happen to be larger. Perl doesn't care\, nor does UTf-8 care\, nor do I care.

The problem is\, perl cares in unpack\, and when handing strings over to XS modules.

That makes no sense\, because UTF-8 is a means of representing characters. Byte strings consist of bytes\, not characters.

Not in C\, which is what the documentation constantly refers to\, mind you. And no\, a byte always has been a character. It is the very definition of byte in C\, regardless of how many bits it has. And the same is true in perl: a single bate is represented by a single character\, havign an index no higher than 255.

(or my programs either). It might be a good and simplified advice to a beginner

The theory is very simple\, but not simplified. It just isn't any harder.

It doesn't map to reality.

I'm sorry if you want a more complex programming tool. But apparently you have found ways to make it hard for yourself already :)

Just stop your ad-hominem\, please. I told you before that I find it rather easy\, but users of my module find it rather hard\, for example. I worked around a lot of bugs in 5.6 easily\, and can slap an occasional utf8::up/downgrade into my code. But I think its simply wrong to force every programmer to know as much about the internals as I do.

The perl unicode model is rather simple\, but leaves you in control\, and I found teaching people about how perl just allows more than 0..255 for a character index works best (although people differ).

That's a great explanation of how unicode strings work.

You think so? Then why do you want to force people to know about how 128..255 is encoded internally then? Because you do when say that UTF-X always means text (which is not true in reality\, mind you)\, and you want unpack to fail on binary strings that happen to be UTF-X encoded?

we never did all use exactly the same encoding. We've just chosen to remain ignorant all this time. Explicit re-encoding\, or decoding and encoding has been necessary all this time. It's just that with more than 256 codepoints\, it became much more apparent :)

Right. But at leats when dealing with decoded stuff (such as binary data)\, Perl should behave consistently and correctly\, but it doesn't.

-- The choice of a -----==- _GNU_ ----==-- _ generation Marc Lehmann ---==---(_)__ __ ____ __ pcg@goof.com --==---/ / _ \/ // /\ \/ / http://schmorp.de/ -=====/_/_//_/\_\,_/ /_/\_\ XX11-RIPE

p5pRT commented 17 years ago

From @Juerd

Marc Lehmann skribis 2007-03-31 1:04 (+0200):

one byte is one character\, and it should be true under the new model.

You must be kidding. -- korajn salutojn\,

juerd waalboer: perl hacker \juerd@juerd\.nl \<http://juerd.nl/sig> convolution: ict solutions and consultancy \sales@convolution\.nl

Ik vertrouw stemcomputers niet. Zie \<http://www.wijvertrouwenstemcomputersniet.nl/>.

p5pRT commented 17 years ago

From nospam-abuse@bloodgate.com

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Moin\,

On Friday 30 March 2007 22:38:19 Juerd Waalboer wrote:

Tels skribis 2007-03-31 0:19 (+0000):

Anyway\, I wasn't aware that any non-utf8 data in Perl is *always* ISO-8859-1\, I thought that\, when not specified\, this depended on some other stuff. Guess I need to reread the tutorials. :)

Note that they are unicode strings\, and that Perl is theoretically free to change the internal representation at any time.

However\, this also poses the question: How does Perl know that your data is in KOI8-R?

Because you tell it that it is with "decode". The resulting string is a unicode string\, which may have any encoding internally. (Practically\, this is limited to latin1 and utf8.)
my $text\_string = decode$"koi8\-r"\, $byte\_string$;
or\, if you prefer different terminology:
my $unicode\_string = decode$"koi8\-r"\, $koi8r\_string$;

I thought you would say this :)

My question was posed because I wanted to know how to *keep* a KOI8 (or any other random binary) string in Perl without converting it to Unicode. It seems to me this is not easily possible because there are literally dozend places where your KOI8 string might get suddenly upgraded to UTF-8 (and thus get corrupted because Perl treats it is ISO-8859-1). Or did I get this wrong?

In an ideal world\, you could either just keep everything in utf-8 (that's too slow for some things and not fool-proof either)\, or rely on no other code to corrupt your data - especially this random third party module you pulled from CPAN last night. :)

OMHO the problem arises from the fact that Perl makes no distinction between a byte string like "a" and a text string like "a"\, and furthermore\, manipulating byte string (for instance appending a byte) is done with typical string operators. So:

$byte_string = 'something random bytes';

# works if $y is 7bit and no utf8 flag # but fails if $y is 7bit with utf8 flag $byte_string .= $y;

As you said\, all is well as long as you can keep these two beasts seperate\, but the slightest problem might mangle your data. Such as a decode_utf8 setting the UTF8 bit on a 7bit ASCII string\, therefore changing the 7bit byte string to a text string.

Hm\, maybe one could write a module that always tackles the encoding to an SV via magic. And then you could have a special encoding called "BINARY" (or absence of an encoding means it is treated as binary)\, so that if you ever try to fuse two strings together where one of them is tagged binary\, you get an exception (but only then!).

As you said\, the current warnings::encode can't decide between the case of "BINARY + UTF_8" and "ISO-8859-1 + UTF_8" as Perl makes no distinction between binary data and ISO-8859-1. And this missing distinction is certainly a bother :)

One of the limitations of the "there can be only two encodings" of Perl seems to be that strings are permanently upgraded: $iso_8859_1 = '...'; $utf8 = '...'; if ($iso_8859_1 eq $utf8) { ... }

$iso_8859_1 is temporarily upgraded to utf8 for this comparison.

(Yes\, this copies data\, and then throws it away. Again\, optimization does require knowing internals. The easiest optimization here is to utf8::upgrade $iso_8859_1\, after which the variable name no longer makes sense :))

Nah\, in this case I wanted the temporarily upgrade :)

Just like 1 + 2.0 will result in 3.0 and not 3 and we all know how much confusion this creates :) (heh\, I fell for it today\, even tho I should have know better :)

Doesn't really cause me any headaches\, to be honest.

Yeah\, I am not a genius :/ (Sometimes I wish I could upgrade my brain :)

The same type of string can be used for binary data\, because in the unicode encoding "latin1"\, all 256 codepoints map to the same byte values.

This sounds like a circular definition\, because in CP1250\, also all 256 codepoints map to the same byte values. Except it are different byte values :)

I said "unicode encoding"\, but should have said "unicode codepoints".

Codepoints 0..256 in latin1 map to byte values 0..256. That makes it special.

Erm\, I don't buy this because:

Codepoints 0..256 in KOI8-R (to pick one) map to byte values 0.256. That would make it special\, too.

(I don't nec. disagree with you\, I just don't understand what you mean).

In short\, it becomes a mess.

Yes\, with strong typing\, especially with string subtypes for arbitrary encodings\, it would be cleaner. But it would also not look like Perl 5.

Over the years\, I come to the insight that I want to build reliable and fast programs. (easy to maintain\, reliable\, fast\, pick two :-)

I do that with Perl. Really\, you should check that language out! You'll LOVE it! :)

Yeah\, maybe one day I actually start real programming work in Perl. ;)

All the best\,

Tels

PS: I think this discussion has become a bit off-topic\, so we should probably keep it off-list. Just for the original topic and the record\, when you have pure 7bit ASCII data\, Perl (decode etc) should not set the utf8 flag on the data\, as that makes things go slower and is just a waste. In fact\, it shouldn't even copy the data around etc.\, it should only make exactly one run through the data to count the high-bit bytes. PPS: Thanx for the discussion\, this really helps me to understand things better. P³S: Unrelated to this thread\, I was working on benchmarking Encode and the ISO-8859-1 to UTF-8 upgrade code. Stay tuned :)

- -- Signed on Sat Mar 31 01:18:34 2007 with key 0x93B84C15. View my photo gallery: http://bloodgate.com/photos PGP key on http://bloodgate.com/tels.asc or per email.

". . . my work\, which I've done for a long time\, was not pursued in order to gain the praise I now enjoy\, but chiefly from a craving after knowledge\, which I notice resides in me more than in most other men. And therewithal\, whenever I found out anything remarkable\, I have thought it my duty to put down my discovery on paper\, so that all ingenious people might be informed thereof."

-- Antony van Leeuwenhoek. Letter of June 12\, 1716 -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2 (GNU/Linux)

iQEVAwUBRg27uncLPEOTuEwVAQIkXAf+O+FgERCl2lcyr28XpeLcCl17pKtfeVBd kQn/j7sqMGLYuqzcZMrNIn4gKskw8L1T19Q0XcoJBVb4phlHHKrZttmbBrhN++KA YfXPd9WH/qg9exYHH/+TDdAWCaJYDYcG2B8xI1NTKrDgwFBt8sJJyt9J2jrJoPJE 6rPpAL9vun1wqv6MJeRacxHWmWk7wXflCIrUt9bf8c+feEpMJ51/331Kgb0tjcFs 85IpfzV9TuFn8I17it//7rPrzJfb1NOSwOcgk/6dj5msIoZv1psmNYZcaysAIGpu evEdhAjpmiVh+DSnGRZEoWfzGwoJfVwGCOmoaQ2O44e9u+AVmx6x0A== =gDih -----END PGP SIGNATURE-----