Perl / perl5

🐪 The Perl programming language
https://dev.perl.org/perl5/
Other
1.85k stars 527 forks source link

Inconsistent handling of characters with value > 0x7FFF_FFFF and other issues #9260

Closed p5pRT closed 13 years ago

p5pRT commented 16 years ago

Migrated from rt.perl.org#51936 (status was 'resolved')

Searchable as RT51936$

p5pRT commented 16 years ago

From chris.hall@highwayman.com

Created by chris.hall@highwayman.com

Amongst the issues​:

  * Character values > 0x7FFF_FFFF are not consistently handled.

  IMO​: the handling is so broken that it would be much better   to draw the line at 0x7FFF_FFFF.

  * chr and pack respond differently to large and out of range   values.

  * pack can generate strings that unpack will not process.

  * warnings about 'illegal' non-characters are arguably spurious.   Certainly there are many cases which are more illegal where   no warnings are issued.

  Treating 0xFFFF_FFFF as a non-character is interesting.

  * IMO​: chr(-1) complete nonsense == undef\, not "a character I   cannot handle" == U+FFFD.

Perl strings containing characters >0x7FFF_FFFF use a non-standard extension to UTF-8. Strictly speaking\, UTF-8 stops at U+10FFFF. However\, sequences up to 0x7FFF_FFFF are well defined.

Bits of Perl are happier with these non-standard sequences than others.

Consider​:

  1​: use strict ;   2​: use warnings ;   3​:   4​: warn "__Runtime__" ;   5​:   6​: my $q = chr(0x7FFF_FFFF).chr(0xE0).chr(0x8000_0000).chr(0xFFFF_FFFD) ;   7​: my $v = utf8​::valid($q) ? 'Valid' : 'Invalid' ;   8​: my $l = length($q) ;   9​: my $r = $q.$q ;   10​: $q =~ s/\x{E0}/ / ;   11​: $q =~ s/\x{7FFF_FFFF}/Hello/ ;   12​: $q =~ s/\x{8000_0000}/World/ ;   13​: $q =~ s/\x{FFFF_FFFD}/ !/ ;   14​: print "$v($l)​: '$q'\n" ;   15​:   16​: $r = substr($r\, 3\, 4) ;   17​: print "\$r="\, hx(sc($r))\, "\n" ;   18​: my @​w = unpack('U*'\, $r) ;   19​: print "\@​w="\, hx(@​w)\, "\n" ;   20​:   21​: $r = pack('U*'\, sc($r)\, 0x1_1234_5678) ;   22​: print "\$r="\, hx(sc($r))\, "\n" ;   23​: @​w = unpack('U*'\, $r) ;   24​: print "\@​w="\, hx(@​w)\, "\n" ;   25​:   26​: sub sc { map ord\, split(//\, $_[0]) ; } ;   27​: sub hx { map sprintf('\\x{%X}'\, $_)\, @​_ ; } ;

which generates​:

  A​: Unicode character 0x7fffffff is illegal at tb.pl line 11.   B​: Malformed UTF-8 character (byte 0xfe) at tb.pl line 12.   C​: Malformed UTF-8 character (byte 0xfe) at tb.pl line 13.   D​: Integer overflow in hexadecimal number at tb.pl line 21.   E​: Hexadecimal number > 0xffffffff non-portable at tb.pl line 21.   --​: __Runtime__ at tb.pl line 4.   a​: Unicode character 0x7fffffff is illegal at tb.pl line 6.   b​: Invalid(4)​: 'Hello World !'   c​: $r=\x{FFFFFFFD}\x{7FFFFFFF}\x{E0}\x{80000000}   d​: Malformed UTF-8 character (byte 0xfe) in unpack at tb.pl line 18.   e​: Malformed UTF-8 character (unexpected continuation byte 0x83\, with no   : preceding start byte) in unpack at tb.pl line 18.

  ... repeated for 0xbf\, 0xbf\, 0xbf\, 0xbf\, 0xbd

  f​: Malformed UTF-8 character (byte 0xfe) in unpack at tb.pl line 18.   g​: Malformed UTF-8 character (unexpected continuation byte 0x82\, with no   : preceding start byte) in unpack at tb.pl line 18.

  ... repeated for 0x80\, 0x80\, 0x80\, 0x80\, 0x80

  h​: @​w=\x{0}\x{0}\x{0}\x{0}\x{0}\x{0}\x{0}\x{7FFFFFFF}\x{E0}   : \x{0}\x{0}\x{0}\x{0}\x{0}\x{0}\x{0}   i​: Unicode character 0x7fffffff is illegal at tb.pl line 21.   j​: Unicode character 0xffffffff is illegal at tb.pl line 21.   k​: $r=\x{FFFFFFFD}\x{7FFFFFFF}\x{E0}\x{80000000}\x{FFFFFFFF}   l​: Malformed UTF-8 character (byte 0xfe) in unpack at tb.pl line 23.   m​: Malformed UTF-8 character (unexpected continuation byte 0x83\, with no   : preceding start byte) in unpack at tb.pl line 23.

  ... repeated for 0xbf\, 0xbf\, 0xbf\, 0xbf\, 0xbd

  n​: Malformed UTF-8 character (byte 0xfe) in unpack at tb.pl line 23.   o​: Malformed UTF-8 character (unexpected continuation byte 0x82\, with no   : preceding start byte) in unpack at tb.pl line 23.

  ... repeated for 0x80\, 0x80\, 0x80\, 0x80\, 0x80

  p​: Malformed UTF-8 character (byte 0xfe) in unpack at tb.pl line 23.   q​: Malformed UTF-8 character (unexpected continuation byte 0x83\, with no   : preceding start byte) in unpack at tb.pl line 23.

  ... repeated for 0xbf\, 0xbf\, 0xbf\, 0xbf\, 0xbf

  r​: @​w=\x{0}\x{0}\x{0}\x{0}\x{0}\x{0}\x{0}\x{7FFFFFFF}\x{E0}   : \x{0}\x{0}\x{0}\x{0}\x{0}\x{0}\x{0}\x{0}\x{0}\x{0}\x{0}\x{0}\x{0}\x{0}

NOTES​:

  1. chr(n) is happy with characters > 0x7FFF_FFFF

  BUT​: note the runtime warning about 0x7FFF_FFFF itself -- output line a.

  Unicode defines characters U+xxFFFF as non-characters\, for all xx from   0x00 to 0x10 -- the (current) Unicode range.

  These characters are NOT illegal. Unicode states​:

  "Noncharacter code points are reserved for internal use\, such as   for sentinel values. They should never be interchanged. They do\,   however\, have well-formed representations in Unicode encoding   forms and survive conversions between encoding forms. This allows   sentinel values to be preserved internally across Unicode   encoding forms\, even though they are not designed to be used in   open interchange."

  Characters > 0x10_FFFF are not known to Unicode.

  IMO\, chr(n) should not be issuing warnings about non-characters at all.

  IMO\, to project non-characters beyond the Unicode range is doubly   perverse.

  FURTHER​: although characters > 0x10_FFFF are beyond Unicode\, and   characters > 0x7FFF_FFFF are beyond UTF-8\, chr(n) is only warning   about actual and invented non-characters (and surrogates).

  2. Similarly "\x{8000_0000} and "\x{7FFF_FFFF}" -- output line A.

  3. HOWEVER​: utf8​::valid() considers a string containing characters   which are > 0x7FFF_FFFF to be *invalid* -- see code lines 7 & 14 and   output line b.

  IMO allowing for characters 0x7FFF_FFFF in the first place is a mistake.

  But having allowed them\, why flag the string as invalid ?

  4. However​: length() is happy\, and issues no warning.

  Either length() is accepting the non-standard encoding\, or some other   mechanism means that it's not scanning the string.

  5. Lines 12 & 13 generate warnings about malformed UTF-8\, at compile time.

  However\, the run-time copes with these super-large characters.

  6. substr is happy with the super-large characters -- line 16.

  7. split is happy with the super-large characters -- line 26.

  8. ord is happy with the super-large characters -- line 26.

  9. unpack 'U' throws up all over super-large characters !

  See lines 18 & 23\, and output d-h and l-r.

  unpack has no idea about the non-standard encoding of characters   greater than 0x7FFF_FFFF\, and unpacks each 'invalid' byte as   0x00.

10. pack 'U' complains about character values in much the same way as   chr does -- output i & j.

  However\, pack and chr are by no means consistent with each other\,   see below.

11. pack 'U' is generating stuff that unpack 'U' cannot cope with !

  See lines 21-24 and output k-r

___________________________________________________________

Looking further at chr and pack​:

  1​: use strict ;   2​: use warnings ;   3​:   4​: warn "__Runtime__" ;   5​:   6​: my $q = chr(0xD800).chr(0xFFFF).chr(0x7FFF_FFFF) ;   7​: my $v = utf8​::valid($q) ? 'Valid' : 'Invalid' ;   8​: print "\$q = "\, hx(sc($q))\, " -- $v\n" ;   9​:   10​: my @​t = (0x1_2345_6789\, -1\, -10\, 0xD800) ;   11​: my $r = join ''\, map(chr\, @​t) ;   12​: print "\$r="\, hx(sc($r))\, "\n" ;   13​:   14​: my $s = pack('U*'\, @​t) ;   15​: print "\$s="\, hx(sc($s))\, "\n" ;   16​:   17​: sub sc { map ord\, split(//\, $_[0]) ; } ;   18​: sub hx { map sprintf('\\x{%X}'\, $_)\, @​_ ; } ;

On a 64-bit v5.8.8​:

  A​: UTF-16 surrogate 0xd800 at tb2.pl line 6.   B​: Unicode character 0xffff is illegal at tb2.pl line 6.   C​: Unicode character 0x7fffffff is illegal at tb2.pl line 6.   D​: Hexadecimal number > 0xffffffff non-portable at tb2.pl line 10.   -- __Runtime__ at tb2.pl line 4.   a​: $q = \x{D800}\x{FFFF}\x{7FFFFFFF} -- Valid   b​: Unicode character 0xffffffffffffffff is illegal at tb2.pl line 11.   c​: UTF-16 surrogate 0xd800 at tb2.pl line 11.   d​: $r=\x{123456789}\x{FFFFFFFFFFFFFFFF}\x{FFFFFFFFFFFFFFF6}\x{D800}   e​: Unicode character 0xffffffff is illegal at tb2.pl line 14.   f​: UTF-16 surrogate 0xd800 at tb2.pl line 14.   g​: $s=\x{23456789}\x{FFFFFFFF}\x{FFFFFFF6}\x{D800}

  * chr(-1) generates a warning\, not because it's complete rubbish\,   but because 0xffffffffffffffff is a non-character !!!

  chr(-3) doesn't merit a warning.

  * note that surrogates and non-characters are OK as far as utf8​::valid   is concerned -- no warnings\, even.

  * pack is masking stuff to 32 bit unsigned !!

  * both chr and pack are throwing warnings about surrogates

On a 32-bit v5.10.0​:

  A​: Integer overflow in hexadecimal number at tb2.pl line 10.   B​: Hexadecimal number > 0xffffffff non-portable at tb2.pl line 10.   -- __Runtime__ at tb2.pl line 4.   a​: UTF-16 surrogate 0xd800 at tb2.pl line 6.   b​: Unicode character 0xffff is illegal at tb2.pl line 6.   c​: Unicode character 0x7fffffff is illegal at tb2.pl line 6.   d​: $q = \x{D800}\x{FFFF}\x{7FFFFFFF} -- Valid   e​: Unicode character 0xffffffff is illegal at tb2.pl line 11.   f​: UTF-16 surrogate 0xd800 at tb2.pl line 11.   g​: $r=\x{FFFFFFFF}\x{FFFD}\x{FFFD}\x{D800}   h​: Unicode character 0xffffffff is illegal at tb2.pl line 14.   i​: Unicode character 0xffffffff is illegal at tb2.pl line 14.   j​: UTF-16 surrogate 0xd800 at tb2.pl line 14.   k​: $s=\x{FFFFFFFF}\x{FFFFFFFF}\x{FFFFFFF6}\x{D800}

  * chr is mapping -ve values to U+FFFD -- without warning.

  This is as per documentation.

  However\, character 0xFFFF_FFFF\, merits a warning\, but does NOT   get translated to U+FFFD !!

  IMO​: this is a dog's dinner. I think​:

  - non-characters and surrogates should not trouble chr   (any more than they trouble utf8​::valid)

  - values that are invalid should generate undef\, not U+FFFD   replacement characters​:

  a) cannot distinguish chr(0xFFFD) and chr(-10)

  b) U+FFFD is a replacement for a character that we don't   know -- it's not a replacement for something that   just isn't a character in the first place !

  [-1 is a banana. U+FFFD is an orange\, which we may   substitute for another form of orange.]

  - limiting characters to 0x7FFF_FFFF is no great loss\, and   avoids a ton of portability and non-standard-ness issues.

  * pack 'U' is NOT mapping -ve values to U+FFFD !!

Perl Info ``` Flags: category=core severity=medium Site configuration information for perl 5.10.0: Configured by SYSTEM at Thu Jan 10 11:00:30 2008. Summary of my perl5 (revision 5 version 10 subversion 0) configuration: Platform: osname=MSWin32, osvers=5.00, archname=MSWin32-x86-multi-thread uname='' config_args='undef' hint=recommended, useposix=true, d_sigaction=undef useithreads=define, usemultiplicity=define useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef use64bitint=undef, use64bitall=undef, uselongdouble=undef usemymalloc=n, bincompat5005=undef Compiler: cc='cl', ccflags ='-nologo -GF -W3 -MD -Zi -DNDEBUG -O1 -DWIN32 -D_CONSOLE -DNO_STRICT -DHAVE_DES_FCRYPT -DUSE_SITECUSTOMIZE - DPRIVLIB_LAST_IN_INC -DPERL_IMPLICIT_CONTEXT -DPERL_IMPLICIT_SYS -DUSE_PERLIO -DPERL_MSVCRT_READFIX', optimize='-MD -Zi -DNDEBUG -O1', cppflags='-DWIN32' ccversion='12.00.8804', gccversion='', gccosandvers='' intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234 d_longlong=undef, longlongsize=8, d_longdbl=define, longdblsize=10 ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='__int64', lseeksize=8 alignbytes=8, prototype=define Linker and Libraries: ld='link', ldflags ='-nologo -nodefaultlib -debug -opt:ref,icf -libpath:"C:\Program Files\Perl\lib\CORE" -machine:x86' libpth=\lib libs= oldnames.lib kernel32.lib user32.lib gdi32.lib winspool.lib comdlg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib netapi32.lib uuid.lib ws2_32.lib mpr.lib winmm.lib version.lib odbc32.lib odbccp32.lib msvcrt.lib perllibs= oldnames.lib kernel32.lib user32.lib gdi32.lib winspool.lib comdlg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib netapi32.lib uuid.lib ws2_32.lib mpr.lib winmm.lib version.lib odbc32.lib odbccp32.lib msvcrt.lib libc=msvcrt.lib, so=dll, useshrplib=true, libperl=perl510.lib gnulibc_version='' Dynamic Linking: dlsrc=dl_win32.xs, dlext=dll, d_dlsymun=undef, ccdlflags=' ' cccdlflags=' ', lddlflags='-dll -nologo -nodefaultlib -debug -opt:ref,icf -libpath:"C:\Program Files\Perl\lib\CORE" -machine:x86' Locally applied patches: ACTIVEPERL_LOCAL_PATCHES_ENTRY 32809 Load 'loadable object' with non-default file extension 32728 64-bit fix for Time::Local @INC for perl 5.10.0: d:\gmch_root\gmch perl lib d:\gmch_root\gmch perl lib\windows C:/Program Files/Perl/site/lib C:/Program Files/Perl/lib . Environment for perl 5.10.0: HOME (unset) LANG (unset) LANGUAGE (unset) LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=C:\Program Files\Perl\site\bin;C:\Program Files\Perl\bin;C:\PROGRAM FILES\_BATCH;C:\PROGRAM FILES\_BIN;C:\PROGRAM FILES\ARM\BIN\WIN_32- PENTIUM;C:\PROGRAM FILES\PERL\BIN\;C:\WINDOWS\SYSTEM32;C:\WINDOWS;C:\WINDOWS\SYSTEM32\WBEM;C:\PROGRAM FILES\ATI TECHNOLOGIES\ATI CONTROL PANEL;C:\PROGRAM FILES\MICROSOFT SQL SERVER\80\TOOLS\BINN\;C:\PROGRAM FILES\ARM\UTILITIES\FLEXLM\10.8.0\12\WIN_32-PENTIUM;C:\PROGRAM FILES\ARM\R VCT\PROGRAMS\3.0\441\EVAL2-SC\WIN_32-PENTIUM;C:\PROGRAM FILES\ARM\RVD\CORE\3.0\675\EVAL2-SC\WIN_32-PENTIUM\BIN;C:\PROGRAM FILES\SUPPORT TOOLS\;C:\Program Files\QuickTime\QTSystem\ PERLLIB=d:\gmch_root\gmch perl lib;d:\gmch_root\gmch perl lib\windows PERL_BADLANG (unset) SHELL (unset) -- Chris Hall highwayman.com ```
p5pRT commented 16 years ago

From jgmyers@proofpoint.com

This is similar to bug #43294.

I agree that allowing characters above the Unicode maximum of U+10FFFF is a mistake. It serves no useful purpose and just causes trouble for those of us who are trying to process externally-provided UTF-8 data. To safely process untrusted UTF-8 data\, we poor implementors need to learn all of the dark corners of Perl's nonstandard UTF-8 processing and somehow deal with the fact that Perl doesn't even agree with itself as to what is valid UTF-8. (see also bug 38722).

Allowing surrogates and the non-character U+FFFE in UTF-8 is a security problem in much the same way that allowing non-shortest form sequences (such as C0 80) is. For that reason\, chr() should not be permitted to create a surrogate or noncharacter--such a character cannot be represented in a well-formed UTF-8 sequence.

p5pRT commented 16 years ago

The RT System itself - Status changed from 'new' to 'open'

p5pRT commented 16 years ago

From chris.hall@highwayman.com

On Thu Mar 20 15​:25​:57 2008\, jgmyers@​proofpoint.com wrote​:

This is similar to bug #43294.

I agree that allowing characters above the Unicode maximum of U+10FFFF is a mistake. It serves no useful purpose and just causes trouble for those of us who are trying to process externally-provided UTF-8 data. To safely process untrusted UTF-8 data\, we poor implementors need to learn all of the dark corners of Perl's nonstandard UTF-8 processing and somehow deal with the fact that Perl doesn't even agree with itself as to what is valid UTF-8. (see also bug 38722).

Oh dear. I was actually trying to argue for decoupling general characters from Unicode and strict UTF-8.

I think Perl's character general character handling has been mixed up with the handling of Unicode exchange formats. Partly because the internal form is utf8-like and stuff is called utf8 !

IMO Perl should\, internally\, handle characters with values 0x0..0x7FFF_FFFF -- interpretting the subset which is Unicode as Unicode when required\, and only when required.

I would dispense with all the broken and incomplete handling of "illegal" Unicode values\, and the OTT values > 0x7FFF_FFFF\, which I imagine would simplify things.

Separately there is clearly the need to filter strict UTF-8 for exchange. Encode's strict-UTF handling isn't complete\, but I don't think the requirements are either simple or consistent across applications.

Seems to me that the current code falls between two stools and is not fully satisfying either the needs of general character string handling or the needs of strict interchange.

Allowing surrogates and the non-character U+FFFE in UTF-8 is a security problem in much the same way that allowing non-shortest form sequences (such as C0 80) is. For that reason\, chr() should not be permitted to create a surrogate or noncharacter--such a character cannot be represented in a well-formed UTF-8 sequence.

I don't think it's chr's job to police anything. The current inconsistencies etc IMO indicate that once you start trying to police these things you hit conflicting requirements\, e.g.​:

  * non-characters are OK for internal use\, but not for external   interchange.

  * use of strings for things other than Unicode

  [I note that printf '%vX' is suggested for IPv6. This implies   holding IPv6 addresses in 8 characters\, each 0x0..0xFFFF.   Which would be impossible if strings refused to allow   values that aren't kosher UTF-8 !]

  * processing UTF-16 as strings of characters 0..0xFFFF.

and trying to do two things at once -- e.g. allowing chr() to generate surrogates but throw warnings about them I doubt satisfies anyone !

Applications that need strict UTF-8 (and possibly subsets thereof) need a layer of support for filtering and dealing with stuff that is application-invalid.

But I don't think the needs of strict UTF-8 should get in the way of simple\, general string handling.

-- Chris Hall

p5pRT commented 16 years ago

From jgmyers@proofpoint.com

On Thu Mar 20 17​:56​:04 2008\, chris_hall wrote​:

I think Perl's character general character handling has been mixed up with the handling of Unicode exchange formats. Partly because the internal form is utf8-like and stuff is called utf8 !

IMO Perl should\, internally\, handle characters with values 0x0..0x7FFF_FFFF -- interpretting the subset which is Unicode as Unicode when required\, and only when required.

I disagree--Perl should adopt and conform to the Unicode standard. Implementing something which is similar but nonconforming is just laying traps for unwary developers.

Particularly heinous is the concept of calling something "utf8" that violates the conformance requirements that Unicode places on UTF-8. Some of the conformance requirements that Perl violates\, those against decoding surrogates or U+FFFE\, are necessary for security.

(Actually\, Unicode does not define the UTF-8 byte sequence for U+FFFE as being ill-formed\, even though doing so is necessary for security. If you have a security syntax check followed by encoding and decoding in UTF-16\, then an attacker could use U+FFFE to trick the UTF-16 decoder into byteswapping the data and having it interpreted differently than what was checked. I have not been able to get the Unicode Consortium to acknowledge this error.)

I would dispense with all the broken and incomplete handling of "illegal" Unicode values\, and the OTT values > 0x7FFF_FFFF\, which I imagine would simplify things.

By allowing values that are not permitted by Unicode\, you are laying a trap for developers not wary of getting get such illegal input.

Separately there is clearly the need to filter strict UTF-8 for exchange. Encode's strict-UTF handling isn't complete\, but I don't think the requirements are either simple or consistent across applications.

The requirements with respect to ill-formed sequences\, including surrogates and values above 10FFFF\, are clearly specified in the Unicode standard.

The requirements with respect to noncharacters are admittedly complex and obscure. Because of the U+FFFE issue\, my experience has been that it is best to simply disallow them all.

Seems to me that the current code falls between two stools and is not fully satisfying either the needs of general character string handling or the needs of strict interchange.

I would agree with this\, though I disagree about the need for non-interchange strings. How are developers to keep straight which interfaces are to allow or disallow non-interchange strings?

I don't think it's chr's job to police anything.

I disagree. It is chr's job to police chr('orange'). Similarly\, it should police chr(0x7FFF_FFFF).

The current inconsistencies etc IMO indicate that once you start trying to police these things you hit conflicting requirements\, e.g.​:

* non-characters are OK for internal use\, but not for external interchange.

Which then begs the question of what is "internal use" versus "external interchange" and what a module must do when given "internal" data and must "interchange" it.

* use of strings for things other than Unicode

\[I note that printf '%vX' is suggested for IPv6\.  This implies
 holding IPv6 addresses in 8 characters\, each 0x0\.\.0xFFFF\.
 Which would be impossible if strings refused to allow
 values that aren't kosher UTF\-8 \!\]

To use strings for things other than Unicode\, one should use byte sequences. To use Unicode characters for things that are not Unicode characters is a mistake.

* processing UTF-16 as strings of characters 0..0xFFFF.

UTF-16 is a character encoding scheme and should be processed as a byte sequence just like every other character encoding scheme is. Surrogates are not characters as defined in chapter 2 of Unicode 5.0.

p5pRT commented 16 years ago

From chris.hall@highwayman.com

On Fri\, 21 Mar 2008 you wrote

On Thu Mar 20 17​:56​:04 2008\, chris_hall wrote​: ... (Actually\, Unicode does not define the UTF-8 byte sequence for U+FFFE as being ill-formed\, even though doing so is necessary for security. If you have a security syntax check followed by encoding and decoding in UTF-16\, then an attacker could use U+FFFE to trick the UTF-16 decoder into byteswapping the data and having it interpreted differently than what was checked. I have not been able to get the Unicode Consortium to acknowledge this error.)

The standard already says that non-characters should not be exchanged externally -- so a careful UTF-8 decoder would intercept U+FFFE (what it would do with it might vary from application to application).

I don't quite understand why you'd want to apply a UTF-16 decoder after a UTF-8 one. Or why a UTF-16 decoder would worry about byte-swapping after the BOM. What am I missing\, here ?

I would dispense with all the broken and incomplete handling of "illegal" Unicode values\, and the OTT values > 0x7FFF_FFFF\, which I imagine would simplify things.

By allowing values that are not permitted by Unicode\, you are laying a trap for developers not wary of getting get such illegal input.

No\, I'm suggesting removing all the clutter from simple character handling\, which gets in the way of some applications.

Applications that don't trust their input have their own issues\, which I think need to be treated separately\, with facilities for different applications to specify (a) what is invalid for them and (b) how to deal with invalid input.

Separately there is clearly the need to filter strict UTF-8 for exchange. Encode's strict-UTF handling isn't complete\, but I don't think the requirements are either simple or consistent across applications.

The requirements with respect to ill-formed sequences\, including surrogates and values above 10FFFF\, are clearly specified in the Unicode standard.

It says they are ill-formed. It doesn't mandate what your application might do with them when they appear.

A quick and dirty application might just throw rubbish away\, and might get away with it.

Another application might convert rubbish to U+FFFD and later whinge about unspecified errors in the input.

Yet another application might which to give more specific diagnostics\, either at the time the data is first received\, or at some later time.

Sequences between 0x10_FFFF and 0x7FFF_FFFF are well understood\, though UTF-8 declares them ill-formed (while noting that ISO-10646 accepts them). Should an entire 4\, 5 or 6 byte sequence beyond 0x10_FFFF be treated as ill-formed\, or each individual byte as being part of an ill-formed sequence ?

Similarly\, the redundant longer forms\, which UTF-8 says are ill-formed\, different applications may wish to handle differently.

The requirements with respect to noncharacters are admittedly complex and obscure. Because of the U+FFFE issue\, my experience has been that it is best to simply disallow them all.

Except that non-characters are entirely legal\, and may be essential to some applications.

Then there's what to do with (a) unassigned characters\, (b) private use characters when exchanging data between unconnected parties\, (c) characters not known to the recipient\, (d) control characters\, etc. etc.

...

The current inconsistencies etc IMO indicate that once you start trying to police these things you hit conflicting requirements\, e.g.​:

* non-characters are OK for internal use\, but not for external interchange.

Which then begs the question of what is "internal use" versus "external interchange" and what a module must do when given "internal" data and must "interchange" it.

Indeed\, and that may vary from application to application.

So\, not only is it (a) more general and (b) conceptually simpler to treat strings as sequences of abstract entities\, but we can see that as soon as we try to do more\, we run into (i) definition issues and (ii) application-dependent issues.

* use of strings for things other than Unicode

\[I note that printf '%vX' is suggested for IPv6\.  This implies
 holding IPv6 addresses in 8 characters\, each 0x0\.\.0xFFFF\.
 Which would be impossible if strings refused to allow
 values that aren't kosher UTF\-8 \!\]

To use strings for things other than Unicode\, one should use byte sequences. To use Unicode characters for things that are not Unicode characters is a mistake.

I don't see why handling an IPv6 address as a short sequence of 16 bit "characters" (that is\, things that go in strings) is any less reasonable than handling IPv4 addresses as short sequences of 8 bit "characters".

In the old 8-bit character world what one did with characters was not limited by any given character set interpretation. The new world of 31 (or more) bit characters should not be limited either.

* processing UTF-16 as strings of characters 0..0xFFFF.

UTF-16 is a character encoding scheme and should be processed as a byte sequence just like every other character encoding scheme is.

Not really. UTF-16 is defined in terms of 16 bit values. I can use strings for byte sequences. Why not word (16 bit) sequences ?

Surrogates are not characters as defined in chapter 2 of Unicode 5.0.

Well\, this is it in a nut-shell.

I don't think that Perl characters (that is\, things that are components of Perl strings) are\, or should be\, defined to be Unicode characters. I think they should be abstract values -- with a 1-to-1 mapping to/from 31- or 32- bit unsigned integers.

[OK\, the size here is arbitrary... 31-bits fits with the well understood encoding\, 32-bits would be trivially portable\, 64-bits seems OTT\, but you could argue for that too.]

[Even if Perl characters were exactly Unicode characters\, there would still be the application specific issues of what to do with non-characters\, private use characters\, undefined characters\, etc. etc.]

On top of a generic string data structure there should clearly be extensive support for Unicode. On top of that comes the need for controlled exchange of the various Unicode encoding formats -- for which different applications may have different requirements.

Chris -- Chris Hall highwayman.com

p5pRT commented 16 years ago

From jgmyers@proofpoint.com

Chris Hall via RT wrote​:

It says they are ill-formed. It doesn't mandate what your application might do with them when they appear.

Unicode 5.0 conformance requirement C10 does mandate a restriction on what an application might do with ill-formed sequences. It states "When a process interprets a code unit sequence which purports to be in a Unicode character encoding form\, it shall treat ill-formed code unit sequences as an error condition and shall not interpret such sequences as characters."

A quick and dirty application might just throw rubbish away\, and might get away with it.

Another application might convert rubbish to U+FFFD and later whinge about unspecified errors in the input.

Unicode permits such behavior. Sequences between 0x10_FFFF and 0x7FFF_FFFF are well understood\, though UTF-8 declares them ill-formed (while noting that ISO-10646 accepts them). ISO-10646 Amendment 2 no longer accepts characters above U+10FFFF. Should an entire 4\, 5 or 6 byte sequence beyond 0x10_FFFF be treated as ill-formed\, or each individual byte as being part of an ill-formed sequence ?

Unicode permits either behavior. Similarly\, the redundant longer forms\, which UTF-8 says are ill-formed\, different applications may wish to handle differently.

Again\, Unicode conformance requirement C10 prohibits applications from interpreting such sequence of characters. To interpret such sequences as characters leaves applications vulnerable to serious security holes.
See Unicode 5.0 section 5.19\, Unicode Security\, which addresses this very issue.

(Actually\, Unicode does not define the UTF-8 byte sequence for U+FFFE as being ill-formed\, even though doing so is necessary for security. If you have a security syntax check followed by encoding and decoding in UTF-16\, then an attacker could use U+FFFE to trick the UTF-16 decoder into byteswapping the data and having it interpreted differently than what was checked. I have not been able to get the Unicode Consortium to acknowledge this error.)

I don't quite understand why you'd want to apply a UTF-16 decoder after a UTF-8 one. Or why a UTF-16 decoder would worry about byte-swapping after the BOM. What am I missing\, here ?

Software is sometimes constructed by connecting together modules that were previously made elsewhere. So an application might\, after passing data through the security syntax check (or "security watchdog module" in Unicode section 5.19)\, process it through a separate module that writes the data out in UTF-16BE. That UTF-16BE data might in turn be processed by a third module that interprets its input as UTF-16.

The potential attack does require a UTF-16 decoder that is sloppier than the one in Encode--either willing to switch endianness mid-stream or willing to treat an initial BOM as optional.

By allowing values that are not permitted by Unicode\, you are laying a trap for developers not wary of getting get such illegal input.

No\, I'm suggesting removing all the clutter from simple character handling\, which gets in the way of some applications.

Applications that don't trust their input have their own issues\, which I think need to be treated separately\, with facilities for different applications to specify (a) what is invalid for them and (b) how to deal with invalid input.

Ill-formed sequences are invalid for everybody. By pushing the responsibility for handling such non-obvious character handling issues from the Perl core to individual applications\, you would be significantly increasing the number of applications that fail to handle such issues as needed. This is laying traps.

Even you seem to have been unaware of the seriously adverse security impact of handling the redundant longer forms as characters. How do you expect a run-of-the-mill Perl script writer to even know that they might have to run extra Unicode-specific validity checks? The current distinction that Perl makes between "utf8" and "utf-8" is quite obscure.

The requirements with respect to noncharacters are admittedly complex and obscure. Because of the U+FFFE issue\, my experience has been that it is best to simply disallow them all.

Except that non-characters are entirely legal\, and may be essential to some applications.

Please provide an example of a reasonable application to which a non-character is essential. There is no shortage of private use characters--I find it hard to believe that the loss of 66 potential characters is quite so catastrophic.

Then there's what to do with (a) unassigned characters\, (b) private use characters when exchanging data between unconnected parties\, (c) characters not known to the recipient\, (d) control characters\, etc. etc.

There are no such problems with any of these categories.

So\, not only is it (a) more general and (b) conceptually simpler to treat strings as sequences of abstract entities\, but we can see that as soon as we try to do more\, we run into (i) definition issues and (ii) application-dependent issues.

No\, it's the converse. When you fail to provide a consistent definition across the language\, you run into issues with mismatched and inconsistent definitions within and across applications. I don't see why handling an IPv6 address as a short sequence of 16 bit "characters" (that is\, things that go in strings) is any less reasonable than handling IPv4 addresses as short sequences of 8 bit "characters".

In neither case are they characters. In the old 8-bit character world what one did with characters was not limited by any given character set interpretation. The new world of 31 (or more) bit characters should not be limited either.

The old 8-bit character world is hardly a model of reasonableness. One didn't necessarily know what the character encoding scheme was\, so one was quite likely to give the data the wrong interpretation. Some schemes\, such as ISO 2022\, were an absolute nightmare.

The ability to store arbitrary 16 bit quantities in UTF-8 strings is hardly an overriding concern. The overriding concern is to handle text correctly.

UTF-16 is a character encoding scheme and should be processed as a byte sequence just like every other character encoding scheme is.

Not really. UTF-16 is defined in terms of 16 bit values. I can use strings for byte sequences. Why not word (16 bit) sequences ?

Perl strings are not word (16 bit) sequences.

UTF-16 was a bad idea to begin with. Let it die a natural death\, just like UTF-7.

Surrogates are not characters as defined in chapter 2 of Unicode 5.0.

Well\, this is it in a nut-shell.

I don't think that Perl characters (that is\, things that are components of Perl strings) are\, or should be\, defined to be Unicode characters. I think they should be abstract values -- with a 1-to-1 mapping to/from 31- or 32- bit unsigned integers.

This is indeed it in a nut-shell. Perl has a choice​: On one hand\, it could adopt and conform to Unicode\, taking advantage of all the work and expertise put into the foremost international standard for character encoding. On the other hand\, Perl could decide that it somehow knows more about character encoding than the Unicode Consortium (and the subject experts that contributed to their standard) and go off and invent something new and inconsistent with the constraints the Unicode Consortium found it necessary to impose.

p5pRT commented 16 years ago

From perl@nevcal.com

On approximately 3/25/2008 3​:27 PM\, came the following characters from the keyboard of John Gardiner Myers​:

Chris Hall via RT wrote​:

Well\, this is it in a nut-shell.

I don't think that Perl characters (that is\, things that are components of Perl strings) are\, or should be\, defined to be Unicode characters. I think they should be abstract values -- with a 1-to-1 mapping to/from 31- or 32- bit unsigned integers. This is indeed it in a nut-shell. Perl has a choice​: On one hand\, it could adopt and conform to Unicode\, taking advantage of all the work and expertise put into the foremost international standard for character encoding. On the other hand\, Perl could decide that it somehow knows more about character encoding than the Unicode Consortium (and the subject experts that contributed to their standard) and go off and invent something new and inconsistent with the constraints the Unicode Consortium found it necessary to impose.

Perl seems to have already made that choice... and chose TMTOWTDI.

The language implements an extension of UTF-8 encoding rules for 70** bit integers (which is very space inefficient above 31 bit integers\, even more so that UTF-8 itself) which it calls utf8.

The language has certain string operations* that implement certain Unicode semantics for strings stored in utf8 format.

Module Encode implements (as best as Dan and kibitzers can) UTF-8 encoding and decoding and validity checking.

So people that want to use utf8 strings as containers for 16-bit integers are welcome to\, as Chris suggests. And people that want to conform to strict Unicode interpretations have the tools to do so. And people that choose to use utf8 strings only for Unicode codepoints can restrict themselves to doing so.

It appears that reported bugs get fixed\, as time permits. It appears that the goal is to conform toUnicode semantics in Module Encode\, and certain string operations* within the language.

* This list is fairly well known\, including "\l\L\u\U" uc ucfirst lc lcfirst and certain regexp operations\, all of which have different semantics when applied to utf8 strings vs non-utf8 strings. This is considered a bug by some\, and a deficiency by most of the rest.

** maybe it is 72? Larger than 64\, apparently\, and such values higher than the platform's native integer size (usually 32 or 64) are hard to access... chr and ord can't deal with them.

-- Glenn -- http​://nevcal.com/

A protocol is complete when there is nothing left to remove. -- Stuart Cheshire\, Apple Computer\, regarding Zero Configuration Networking

p5pRT commented 16 years ago

From chris.hall@highwayman.com

On Tue\, 25 Mar 2008 you wrote

On approximately 3/25/2008 3​:27 PM\, came the following characters from the keyboard of John Gardiner Myers​:

Chris Hall via RT wrote​:

Well\, this is it in a nut-shell.

I don't think that Perl characters (that is\, things that are components of Perl strings) are\, or should be\, defined to be Unicode characters. I think they should be abstract values -- with a 1-to-1 mapping to/from 31- or 32- bit unsigned integers.

This is indeed it in a nut-shell. Perl has a choice​: On one hand\, it could adopt and conform to Unicode\, taking advantage of all the work and expertise put into the foremost international standard for character encoding. On the other hand\, Perl could decide that it somehow knows more about character encoding than the Unicode Consortium (and the subject experts that contributed to their standard) and go off and invent something new and inconsistent with the constraints the Unicode Consortium found it necessary to impose.

Perl seems to have already made that choice... and chose TMTOWTDI.

The language implements an extension of UTF-8 encoding rules for 70** bit integers (which is very space inefficient above 31 bit integers\, even more so that UTF-8 itself) which it calls utf8.

As reported​: the 7 and 13 byte extended sequences are not properly handled everywhere. The documentation is coy about characters greater than 31 bits.

IMO things are so broken (not even utf8​::valid() likes the 7- and 13-byte sequences !) that it's not too late to row back from this....

  - stopping at 31 bit integers is at least consistent with well-known   4\, 5 and 6 byte "UTF-8" sequences.

  - 32 bit integers could be supported in 6 byte sequences (by treating   0xFC..0xFF prefixes as containing the MS 2 bits) and would be   portable (and remains reasonably practical\, space-wise).

....the extent of brokenness recalls the guiding mantra​: KISS.

The language has certain string operations* that implement certain Unicode semantics for strings stored in utf8 format.

Module Encode implements (as best as Dan and kibitzers can) UTF-8 encoding and decoding and validity checking.

So people that want to use utf8 strings as containers for 16-bit integers are welcome to\, as Chris suggests. And people that want to conform to strict Unicode interpretations have the tools to do so. And people that choose to use utf8 strings only for Unicode codepoints can restrict themselves to doing so.

The separation between the content of strings and Unicode is unclear.

The name utf8 doesn't help !

A good example of this is chr(n) which​:

  - issues warnings if 'n' is a Unicode surrogate or non-character.

  These warnings are a nuisance for people using strings as containers   for n-bit integers.

  Those wanting help with strict Unicode aren't materially helped by   this behaviour.

  - accepts characters beyond the Unicode range without warning.

  So isn't consistent in its "Unicode support".

  - generates a chr(0xFFFD) in response to chr(-1).

  Which makes no sense where strings are used as containers for n-bit   integers !

It appears that reported bugs get fixed\, as time permits. It appears that the goal is to conform toUnicode semantics in Module Encode\, and certain string operations* within the language.

There are plenty of bugs to go round :-}

I hope that has not been lost in the discussion.

I'm not sure that the Encode Module is the right place for all support for Unicode.

It seems to me that Encode is to do with mapping between Perl characters (interpreted as Unicode code-points) and a variety of Character Encoding Schemes\, including UTF-8. On input this concerns itself with ill-formed stuff. On output it concerns itself with things that cannot be output. In both cases there is mapping between different Coded Character Sets\, and coping with impossible mappings.

With Unicode there are additional\, specific options required either to allow or do something else with​:

  * the non-characters -- all of them.

  * the Replacement Character -- perhaps should not send these\, or   perhaps do not wish to receive these.

  * private use characters -- which may or may not be suitable for   exchange.

  * perhaps more general support for sub-sets of Unicode.

  * dealing with canonical equivalences.

Now I suppose that a lot can be done by regular expressions and other such processing. This looks like hard work. And might not be terribly efficient ? (Certainly John Gardiner Myers wants utf8​::valid to do very strict UTF-8 checking as an efficient test for whether other processing is required.)

* This list is fairly well known\, including "\l\L\u\U" uc ucfirst lc lcfirst and certain regexp operations\, all of which have different semantics when applied to utf8 strings vs non-utf8 strings. This is considered a bug by some\, and a deficiency by most of the rest.

** maybe it is 72? Larger than 64\, apparently\, and such values higher than the platform's native integer size (usually 32 or 64) are hard to access... chr and ord can't deal with them.

It's 72​: 13 byte sequence\, starting 0xFF followed by 12 bytes carrying 6 bits of the value each.

FWIW\, on 64-bit integer machine​:

  $v = 0xFFFF_FFFF_FFFF_FFFD ;   if ($v != ord(chr($v))) { die ; } ;

work just fine. Though Perl whimpers​:

  Hexadecimal number > 0xffffffff non-portable at .... (compile time)

While​:

  $v = 0xFFFF_FFFF_FFFF_FFFF ;   if ($v != ord(chr($v))) { die ; } ;

whimpers​:

  Hexadecimal number > 0xffffffff non-portable at .... (compile time)   Unicode character 0xffffffffffffffff is illegal at .... (run time)

where the second whinge is baroque.

Chris -- Chris Hall highwayman.com +44 7970 277 383

p5pRT commented 16 years ago

From chris.hall@highwayman.com

On Tue\, 25 Mar 2008 John Gardiner Myers wrote

Chris Hall via RT wrote​:

It says they are ill-formed. It doesn't mandate what your application might do with them when they appear.

Unicode 5.0 conformance requirement C10 does mandate a restriction on what an application might do with ill-formed sequences. It states "When a process interprets a code unit sequence which purports to be in a Unicode character encoding form\, it shall treat ill-formed code unit sequences as an error condition and shall not interpret such sequences as characters."

Sure. But the real point is that this doesn't specify how the error condition must be dealt with.

And\, as previously discussed\, the issue of ill-formed UTF-8 is only part of the problem.

A quick and dirty application might just throw rubbish away\, and might get away with it.

Another application might convert rubbish to U+FFFD and later whinge about unspecified errors in the input.

Unicode permits such behavior.

Sure. But the point is that there isn't a single correct approach\, it depends on the application.

Sequences between 0x10_FFFF and 0x7FFF_FFFF are well understood\, though UTF-8 declares them ill-formed (while noting that ISO-10646 accepts them).

ISO-10646 Amendment 2 no longer accepts characters above U+10FFFF.

OK. I was going by what Unicode 5.0 says​:

  "The definition of UTF-8 in Annex D of ISO/IEC 10646​:2003 also allows   for the use of five and six-byte sequences to encode characters that   are outside the range of the Unicode character set; those five- and   six-byte sequences are illegal for the use of UTF-8 as an encoding   form of Unicode characters."

Should an entire 4\, 5 or 6 byte sequence beyond 0x10_FFFF be treated as ill-formed\, or each individual byte as being part of an ill-formed sequence ?

Unicode permits either behavior.

And any given application may wish to do one or the other.

Similarly\, the redundant longer forms\, which UTF-8 says are ill-formed\, different applications may wish to handle differently.

Again\, Unicode conformance requirement C10 prohibits applications from interpreting such sequence of characters. To interpret such sequences as characters leaves applications vulnerable to serious security holes. See Unicode 5.0 section 5.19\, Unicode Security\, which addresses this very issue.

It takes a narrow view of this. Obviously it is good to encourage strict encoding ! If one wanted to be "generous in what one accepts" one might accept and recode the redundant but longer forms -- which deals with the security issue. But this is not a big requirement.

....

By allowing values that are not permitted by Unicode\, you are laying a trap for developers not wary of getting get such illegal input.

No\, I'm suggesting removing all the clutter from simple character handling\, which gets in the way of some applications.

Applications that don't trust their input have their own issues\, which I think need to be treated separately\, with facilities for different applications to specify (a) what is invalid for them and (b) how to deal with invalid input.

Ill-formed sequences are invalid for everybody. By pushing the responsibility for handling such non-obvious character handling issues from the Perl core to individual applications\, you would be significantly increasing the number of applications that fail to handle such issues as needed. This is laying traps.

The existing handling is in a mess. I suggest that this is partly because the problem is not straightforward\, and there is no single\, universal solution. The problem is not simply ill-formed sequences.

IMO the solution is (a) to simplify the base string and character data structures -- so that they are not confused by the conflicting requirements\, and (b) to beef up support for strict Unicode character and UTF handling\, with sufficient flexibility to allow for different applications to do different things\, and with a sensible set of defaults for straightforward use.

Even you seem to have been unaware of the seriously adverse security impact of handling the redundant longer forms as characters.

As above. I grant that handling the redundant longer forms is not a big requirement\, but if handled correctly the security issue is dealt with.

How do you expect a run-of-the-mill Perl script writer to even know that they might have to run extra Unicode-specific validity checks? The current distinction that Perl makes between "utf8" and "utf-8" is quite obscure.

Yes\, it doesn't help clarify things.

The requirements with respect to noncharacters are admittedly complex and obscure. Because of the U+FFFE issue\, my experience has been that it is best to simply disallow them all.

Except that non-characters are entirely legal\, and may be essential to some applications.

Please provide an example of a reasonable application to which a non-character is essential. There is no shortage of private use characters--I find it hard to believe that the loss of 66 potential characters is quite so catastrophic.

Except that it would no longer be Unicode conformant.

If you want to argue that non-characters are a Bad Thing\, that's a separate topic.

Using private use characters instead simply moves the problem. If I use non-characters as delimiters in my application\, I should remove them before sending the text to somebody who does not expect them. If I were to use some private-use characters for the same thing\, I should still remove them\, shouldn't I ?

Then there's what to do with (a) unassigned characters\, (b) private use characters when exchanging data between unconnected parties\, (c) characters not known to the recipient\, (d) control characters\, etc. etc.

There are no such problems with any of these categories.

Well... if you're troubled by the exchange of 66 non-character values I'm surprised you're not troubled by the huge number of private use characters ! If my system were to place some internal significance on some private use characters\, it might be a security issue if these were not filtered out on exchange with third parties -- much like the non-characters.

An application that was *really* worried about what it was being sent might wish to filter any or all of these things. It might wish to filter down to some supported sub-set. Not to mention the reduction to canonical form(s).

So\, not only is it (a) more general and (b) conceptually simpler to treat strings as sequences of abstract entities\, but we can see that as soon as we try to do more\, we run into (i) definition issues and (ii) application-dependent issues.

No\, it's the converse. When you fail to provide a consistent definition across the language\, you run into issues with mismatched and inconsistent definitions within and across applications.

I agree that without a consistent definition you get a mess.

I don't see why handling an IPv6 address as a short sequence of 16 bit "characters" (that is\, things that go in strings) is any less reasonable than handling IPv4 addresses as short sequences of 8 bit "characters".

In neither case are they characters.

Looking at the "Character Encoding Model"\, where I said "characters" a little loosely\, the jargon suggests 'code units'. But in any case\, if a thing that is an element of a string is not a "character" what would you recommend I call it ?

In the old 8-bit character world what one did with characters was not limited by any given character set interpretation. The new world of 31 (or more) bit characters should not be limited either.

The old 8-bit character world is hardly a model of reasonableness. One didn't necessarily know what the character encoding scheme was\, so one was quite likely to give the data the wrong interpretation. Some schemes\, such as ISO 2022\, were an absolute nightmare.

Granted that character encodings in the 8-bit world were tricky.

But chr() didn't get upset about\, for example\, DEL (0x7F) or DLE (0x10) despite the obvious issues of interpretation. Core Perl does not attempt to intervene here. I realise that this may appear trivial\, but it illustrates the difference between treating strings as sequences of generic 'code units' and treating them as characters according to some specific 'coded character set'.

.....

Well\, this is it in a nut-shell.

I don't think that Perl characters (that is\, things that are components of Perl strings) are\, or should be\, defined to be Unicode characters. I think they should be abstract values -- with a 1-to-1 mapping to/from 31- or 32- bit unsigned integers.

This is indeed it in a nut-shell. Perl has a choice​: On one hand\, it could adopt and conform to Unicode\, taking advantage of all the work and expertise put into the foremost international standard for character encoding.

Leaving to one side any questions about ill-formed sequences. What should be done with​:

  * non-characters -- allow\, filter out\, replace\, ... ?

  * private-use characters -- allow\, filter out\, replace\, ... ?

  * unassigned characters -- allow\, filter out\, replace\, ... ?

  * canonical equivalences -- allow\, filter out\, replace\, ... ?

  The standard acknowledges a security issue here\, but punts it​:

  "However\, another level of alternate representation has raised   other security questions​: the canonical equivalences between   precomposed characters and combining character sequences that   represent the same abstract characters. .... The conformance   requirement\, however\, is that conforming implementations cannot   be required to make an interpretation distinction between   canonically equivalent representations. The way for a security-   conscious application to guarantee this is to carefully observe   the normalization specifications (see Unicode Standard Annex   #15\, “Unicode Normalization Forms”) so that data is handled   consistently in a normalized form."

  * requirements to handle only sub-sets of characters.

  * other things\, perhaps ?

Even surrogates are potentially tricky...

... in UTF-8 surrogate values are explicitly ill-formed.

... in UTF-16 they should travel in pairs\, but I guess decoders need to   do something with poorly formed or possibly incomplete input.

... but it appears that some code will combine surrogate code points   even after decoding the UTF -- I suspect that this is a hangover   from older systems where a 16-bit internal character form looked   like a reasonable compromise.

... so banning these values from Perl strings is problematic.

With ill-formed sequences the question is how to deal with the error condition(s).

The point here is that the requirements are not simple and not universal.

There is\, absolutely\, a crying need for clear and effective support for handling Unicode and the UTFs -- especially UTF-8\, given its increasing dominance.

On the other hand\, Perl could decide that it somehow knows more about character encoding than the Unicode Consortium (and the subject experts that contributed to their standard) and go off and invent something new and inconsistent with the constraints the Unicode Consortium found it necessary to impose.

This is a false alternative.

Supporting generic "character" and string primitives does not preclude layering strong and flexible UTF and Unicode handling on top\, allowing different applications to take more or less control over the various options/ambiguities.

At present Perl is achieving neither.

Chris

PS​: big-endian integers are sinful. -- Chris Hall highwayman.com +44 7970 277 383

p5pRT commented 13 years ago

From @khwilliamson

After much further discussion and gnashing of teeth\, this has been resolved. The output of the original program in this ticket on current blead is​:

Hexadecimal number > 0xffffffff non-portable at 51936.pl line 21. __Runtime__ at 51936.pl line 4. Valid(4)​: 'Hello World !' $r=\x{FFFFFFFD}\x{7FFFFFFF}\x{E0}\x{80000000} @​w=\x{FFFFFFFD}\x{7FFFFFFF}\x{E0}\x{80000000} $r=\x{FFFFFFFD}\x{7FFFFFFF}\x{E0}\x{80000000}\x{112345678} @​w=\x{FFFFFFFD}\x{7FFFFFFF}\x{E0}\x{80000000}\x{112345678}

The decision was made to allow any unsigned values be stored as strings internally in Perl. The non-character code points are all recognized\, and allowed. When printing any of the surrogates\, non-char code points\, or above-legal-Unicode code points\, a warning is raised. All are prohibited under strict UTF-8 input. There has been discussion and some work on making the :utf8 layer more strict. I believe this will happen.

When doing an operation that requires Unicode semantics on an above-Unicode code point\, a warning is raised. An example is changing the case\, and this is a no-op.

Unicode doesn't actually forbid the use of isolated surrogates in strings inside languages\, even though a non-lawyer\, such as myself\, reading the standard would think that it did. There is some text that allows it. I posted to p5p a portion of an email from the president of Unicode that reiterated this (sent to someone on another project). The clincher is that ICU\, the semi-official Unicode implementation does allow isolated surrogates in strings. And\, Unicode as of version 5.1 does give property values for every property for the surrogates. At this time\, we are warning on surrogates if a case change (including /i regular expression matching) is done on them. I'm not sure that this is correct\, as Unicode does furnish casing semantics for them\, but it is easier to remove a warning later than to add one.

The portion of the original ticket involving chr(-1) has not been resolved. I submitted a bug report for just that\, but have not gotten a reply back as to the number assigned to it.

In any event\, I believe much of the inglorious handling of this whole situation is now fixed

--Karl Williamson

p5pRT commented 13 years ago

@khwilliamson - Status changed from 'open' to 'resolved'