Perl / perl5

🐪 The Perl programming language
https://dev.perl.org/perl5/
Other
1.98k stars 559 forks source link

Please implement Unicode Corrigendum #9 (noncharacters) #13594

Closed p5pRT closed 9 years ago

p5pRT commented 10 years ago

Migrated from rt.perl.org#121226 (status was 'rejected')

Searchable as RT121226$

p5pRT commented 10 years ago

From gpiero@rm-rf.it

Created by gpiero@rm-rf.it

Currently perl issues a serious warning when trying to output (or input) Unicode Noncharacters [0]​:

$ perl -CS -le 'print "noncharacter​: \x{FDEF}"' Unicode non-character U+FDEF is illegal for open interchange at -e line 1. noncharacter​: �

This is due to a common interpretation of the Unicode standard. Anyway the Unicode Technical Committee has issued Corrigendum #9 [1] on 2013-Jan-30. Quoting from it​:

`` Noncharacters in the Unicode Standard are intended for internal use and have no standard interpretation when exchanged outside the context of internal use. However\, they are not illegal in interchange nor do they cause ill-formed Unicode text. This has always been the intent of the standard\, as expressed by the Unicode Technical Committee. This is necessary for the effective use of noncharacters\, because anytime a Unicode string crosses an API boundary\, it is in effect being "interchanged". ``

As this is labeled as a clarification\, I don't think we have to wait for the next Unicode version for adhering to it (I mean​: adhering to Corrigendum #9 does not break compliance with previous versions of Unicode\, IMO).

I admit that at this point it isn't clear to me the distinction between private-use[2] and noncharacters\, but\, as for what perl is concerned\, I think they should be managed in the same way. I.e.​:

$ perl -CS -le 'print "private-use character​: \x{F8FF}"' private-use character​: 

(no warning issued).

At the very least\, the severity of the 'nonchar' warning should be lowered.

Thanks\, Gian Piero.

[0] http​://www.unicode.org/faq/private_use.html#noncharacters [1] http​://www.unicode.org/versions/corrigendum9.html [2] http​://www.unicode.org/faq/private_use.html#private_use

Perl Info ``` Flags: category=core severity=wishlist Site configuration information for perl 5.19.8: Configured by gpiero at Mon Feb 10 20:17:47 CET 2014. Summary of my perl5 (revision 5 version 19 subversion 8) configuration: Platform: osname=linux, osvers=3.12-1-amd64, archname=x86_64-linux uname='linux caimano 3.12-1-amd64 #1 smp debian 3.12.8-1 (2014-01-19) x86_64 gnulinux ' config_args='-de -Dprefix=/home/gpiero/perl5/perlbrew/perls/perl-5.19.8 -Dusedevel -Aeval:scriptdir=/home/gpiero/perl5/perlbrew/perls/perl-5.19.8/bin' hint=recommended, useposix=true, d_sigaction=define useithreads=undef, usemultiplicity=undef use64bitint=define, use64bitall=define, uselongdouble=undef usemymalloc=n, bincompat5005=undef Compiler: cc='cc', ccflags ='-fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64', optimize='-O2', cppflags='-fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include' ccversion='', gccversion='4.8.2', gccosandvers='' intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16 ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 alignbytes=8, prototype=define Linker and Libraries: ld='cc', ldflags =' -fstack-protector -L/usr/local/lib' libpth=/usr/local/lib /usr/lib/gcc/x86_64-linux-gnu/4.8/include-fixed /usr/include/x86_64-linux-gnu /usr/lib /lib/x86_64-linux-gnu /lib/../lib /usr/lib/x86_64-linux-gnu /usr/lib/../lib /lib libs=-lnsl -ldl -lm -lcrypt -lutil -lc perllibs=-lnsl -ldl -lm -lcrypt -lutil -lc libc=, so=so, useshrplib=false, libperl=libperl.a gnulibc_version='2.17' Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E' cccdlflags='-fPIC', lddlflags='-shared -O2 -L/usr/local/lib -fstack-protector' @INC for perl 5.19.8: /home/gpiero/perl5/perlbrew/perls/perl-5.19.8/lib/site_perl/5.19.8/x86_64-linux /home/gpiero/perl5/perlbrew/perls/perl-5.19.8/lib/site_perl/5.19.8 /home/gpiero/perl5/perlbrew/perls/perl-5.19.8/lib/5.19.8/x86_64-linux /home/gpiero/perl5/perlbrew/perls/perl-5.19.8/lib/5.19.8 . Environment for perl 5.19.8: HOME=/home/gpiero LANG=en_US.UTF-8 LANGUAGE=en_US:en LC_COLLATE=C LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=/home/gpiero/perl5/perlbrew/bin:/home/gpiero/perl5/perlbrew/perls/perl-5.19.8/bin:/home/gpiero/bin:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games PERLBREW=command perlbrew PERLBREW_BASHRC_VERSION=0.67 PERLBREW_HOME=/home/gpiero/.perlbrew PERLBREW_MANPATH=/home/gpiero/perl5/perlbrew/perls/perl-5.19.8/man PERLBREW_PATH=/home/gpiero/perl5/perlbrew/bin:/home/gpiero/perl5/perlbrew/perls/perl-5.19.8/bin PERLBREW_PERL=perl-5.19.8 PERLBREW_ROOT=/home/gpiero/perl5/perlbrew PERLBREW_VERSION=0.67 PERL_BADLANG (unset) SHELL=/bin/bash ```
p5pRT commented 10 years ago

From @ap

* Gian Piero \gpiero@​rm\-rf\.it [2014-02-11 17​:10]​:

The Unicode Technical Committee has issued Corrigendum #9 [1] on 2013-Jan-30.

I found most helpful the sections from \<http​://www.unicode.org/faq/private_use.html#nonchar7> on down.

They clearly describe the intent that these noncharacters should be usable without further ado from any infrastructure.

I admit that at this point it isn't clear to me the distinction between private-use[2] and noncharacters\, but\, as for what perl is concerned\, I think they should be managed in the same way. I.e.​:

$ perl -CS -le 'print "private-use character​: \x{F8FF}"' private-use character​: 

(no warning issued).

That is very much my interpretation of the FAQ.

Quoting from \<http​://www.unicode.org/faq/private_use.html#nonchar2>​:

  Noncharacters are in a sense a kind of private-use character\,   because they are reserved for internal (private) use. However\, that   internal use is intended as a “super” private use\, not normally   interchanged with other users.

Prior to that\, the FAQ expends some verbiage to convey that private-use characters were intended specifically for interchange among parties who have agreed on some interpretation for those characters amongst each other.

So in answer to your question\, it appears that the UTC conceives the difference between non- and private-use characters to be that the meaning of noncharacters should always be considered unknown whenever they cross the boundaries of a particular system\, while private-use characters may meaningfully pass the boundaries between systems that share an agreed-upon interpretation for them.

It inescapably follows that if even the use of noncharacters must not cause warnings\, then much more so neither must the use of private-use characters.

Regards\, -- Aristotle Pagaltzis // \<http​://plasmasturm.org/>

p5pRT commented 10 years ago

The RT System itself - Status changed from 'new' to 'open'

p5pRT commented 10 years ago

From gpiero@rm-rf.it

Hi Aristotle\,

thank you for your reply.

* [Tue\, Feb 11\, 2014 at 11​:47​:16AM -0800] Aristotle Pagaltzis via RT​:

I admit that at this point it isn't clear to me the distinction between private-use[2] and noncharacters\, but\, as for what perl is concerned\, I think they should be managed in the same way. I.e.​:

So in answer to your question\, it appears that the UTC conceives the difference between non- and private-use characters to be that the meaning of noncharacters should always be considered unknown whenever they cross the boundaries of a particular system\, while private-use characters may meaningfully pass the boundaries between systems that share an agreed-upon interpretation for them.

Yes\, but the definition of 'system' in this context is not so clear to me and probably isn't commonly agreed upon in general (as also the UTC expresses some concerns when talking about distributed software).
Personally I don't see much differences between the two sets in practical cases\, but probably I'm wrong. Anyway this is just a personal consideration. The point here is that perl should not warn when using noncharacters\, and I think we agree on this.

It inescapably follows that if even the use of noncharacters must not cause warnings\, then much more so neither must the use of private-use characters.

I'm afraid I wasn't clear in the initial report​: AFAIK perl already correctly manages private-use characters and does not seem to issue warnings when using them. It does however cause a (serious) warning when using noncharacters\, and this is the only warning I was asking to dismiss.

Ciao\, Gian Piero.

p5pRT commented 10 years ago

From @khwilliamson

On 02/11/2014 02​:16 PM\, Gian Piero Carrubba wrote​:

Hi Aristotle\,

thank you for your reply.

* [Tue\, Feb 11\, 2014 at 11​:47​:16AM -0800] Aristotle Pagaltzis via RT​:

I admit that at this point it isn't clear to me the distinction between private-use[2] and noncharacters\, but\, as for what perl is concerned\, I think they should be managed in the same way. I.e.​:

So in answer to your question\, it appears that the UTC conceives the difference between non- and private-use characters to be that the meaning of noncharacters should always be considered unknown whenever they cross the boundaries of a particular system\, while private-use characters may meaningfully pass the boundaries between systems that share an agreed-upon interpretation for them.

Yes\, but the definition of 'system' in this context is not so clear to me and probably isn't commonly agreed upon in general (as also the UTC expresses some concerns when talking about distributed software). Personally I don't see much differences between the two sets in practical cases\, but probably I'm wrong. Anyway this is just a personal consideration. The point here is that perl should not warn when using noncharacters\, and I think we agree on this.

It inescapably follows that if even the use of noncharacters must not cause warnings\, then much more so neither must the use of private-use characters.

I'm afraid I wasn't clear in the initial report​: AFAIK perl already correctly manages private-use characters and does not seem to issue warnings when using them. It does however cause a (serious) warning when using noncharacters\, and this is the only warning I was asking to dismiss.

Ciao\, Gian Piero.

I have mixed feelings about this request.

First\, some clarifications. Private-use characters have always been intended to be freely interchangeable\, but the meanings are not specified by the Standard. What they typically are used for is corporations or other groups decide they want to use certain ones for certain purposes and their code is written knowing this. But there is nothing preventing another group from using the same code points for something else. As long as the two groups don't ever exchange files which use these code points\, there is no problem. As an example\, the Apple Corporation has chosen a particular code point to represent their logo. All their software recognizes this code point and treats it accordingly. If you are writing software that might run on one of their devices\, and you need private-use code points\, it would be best if you avoided using that particular one. Another example is there is a registry of private-use code points run by an individual\, IIRC. He publishes the list so that people can avoid conflicts. It includes characters from the Klingon script and similar ones\, that Unicode refuses to encode\, but which have communities who want them. Some scripts started out in this registry\, but Unicode was eventually persuaded to encode them\, and code that used the old values could be changed to simply add a constant number to any code point to get the Unicode value.

Non-character code points have a different genesis altogether. Originally Unicode was conceived as having just 2**16 code points. If one wants to loop over all of them using a 16-bit word size\, you can't use the typical "while (x \< MAX) {}" loop without overflowing. They solved this by just saying U+FFFF isn't ever going to be a real code point\, so you could say "while (x \< 0xFFFF)" and cover everything of interest. They also wanted to reserve FFFE\, since the Byte-Order Mark (FEFF) looks like that value when the byte ordering is wrong. You don't want a legal character to be confused with the BOM.

When Unicode was expanded beyond 16-bits\, they created the plane concept\, with Plane0 being 0-65535\, Plane1 being the next set\, etc. It was envisioned that software would work on a given plane\, switching at times\, so they reserved FFFF and FFFE on each of the 16 planes.

Eventually it became clear that there is a need for text-processing software to be able to have sentinel code points that it knows won't be in the middle of a stream of text that it is processing. Thus\, they added the other non-character code points. (There may be a reason for these particular ones to be not-desirable to use for other purposes\, but if so\, I'm unaware of what it is.)

Non-character code points should not be foisted off on an unsuspecting application\, unlike private-use code points. Software has been written expecting that it can use these code points for its own purposes and not have to worry about them being in an input stream. One should have a gate keeper that rejects these by default. An example is a text editor that is intended to edit any Unicode-conformant text. It doesn't know what any private-use character is intended for\, nor does it need to know. What it knows is that such a character should be treated like any other. But a text editor may use an algorithm that intersperses characters that have special meaning to it with the ones that are being edited. That's what non-characters are for. A conformant text editor does not have to accept text with non-characters. A conformant text editor does have to accept text with private-use characters. The Corrigendum says "Noncharacter​: A code point that is permanently reserved for internal use"

Now to the request. I agree that the warning is not severe; however we wanted it to be on by default\, and the only way to do that currently in Perl is to make it "severe". The question is should you be warned if you are outputting a code point that is "permanently reserved for internal use". It sure sounds like it to me\, but I can see the other side too. But that's why we made a new and separate warning category for just the input and output of these code points. If your application does this\, it is a simple matter to say

"no warnings 'nonchar'"

to silence just them.

p5pRT commented 10 years ago

From @xdg

perldiag says this​:

  Unicode non-character U+%X is illegal for open interchange   (S utf8\, nonchar) Certain codepoints\, such as U+FFFE and U+FFFF\,   are defined by the Unicode standard to be non-characters. Those   are legal codepoints\, but are reserved for internal use; so\,   applications shouldn't attempt to exchange them. If you know what   you are doing you can turn off this warning by "no warnings   'nonchar';".

That seems to explain it solely as "non-characters" not "private characters".

From Karl's explanation and corrigendum #9\, I think it's clear that "interchange" is allowed\, even if it's an odd case. Certainly\, they are not "illegal".

A new 'privatechar' warning category should be added to cover those distinct from 'nonchar'\, and I think the wording needs to be softer​:

E.g.

  Unicode private character U+%x in %x\, may not be portable

and

  Unicode non-character U+%x in %x\, may not be portable

In those\, the second "%x" would be the op that triggered the warning\, akin to the "wide character" warnings.

Unlike the wide character warning\, though\, where the IO handle is wholly unprepared for character data\, I'm not convinced that nonchar and privatechar need to be on by default\, however. They should be enabled by "use warnings".

Of course\, an IO layer should be able to decide if those are acceptable. E.g.

  binmode(STDOUT\, "​:utf8_private_strict");

Should something like that be created\, it should allow private characters but warn on non characters.

David

On Wed\, Feb 12\, 2014 at 2​:22 PM\, Karl Williamson \public@&#8203;khwilliamson\.com wrote​:

On 02/11/2014 02​:16 PM\, Gian Piero Carrubba wrote​:

Hi Aristotle\,

thank you for your reply.

* [Tue\, Feb 11\, 2014 at 11​:47​:16AM -0800] Aristotle Pagaltzis via RT​:

I admit that at this point it isn't clear to me the distinction between private-use[2] and noncharacters\, but\, as for what perl is concerned\, I think they should be managed in the same way. I.e.​:

So in answer to your question\, it appears that the UTC conceives the difference between non- and private-use characters to be that the meaning of noncharacters should always be considered unknown whenever they cross the boundaries of a particular system\, while private-use characters may meaningfully pass the boundaries between systems that share an agreed-upon interpretation for them.

Yes\, but the definition of 'system' in this context is not so clear to me and probably isn't commonly agreed upon in general (as also the UTC expresses some concerns when talking about distributed software). Personally I don't see much differences between the two sets in practical cases\, but probably I'm wrong. Anyway this is just a personal consideration. The point here is that perl should not warn when using noncharacters\, and I think we agree on this.

It inescapably follows that if even the use of noncharacters must not cause warnings\, then much more so neither must the use of private-use characters.

I'm afraid I wasn't clear in the initial report​: AFAIK perl already correctly manages private-use characters and does not seem to issue warnings when using them. It does however cause a (serious) warning when using noncharacters\, and this is the only warning I was asking to dismiss.

Ciao\, Gian Piero.

I have mixed feelings about this request.

First\, some clarifications. Private-use characters have always been intended to be freely interchangeable\, but the meanings are not specified by the Standard. What they typically are used for is corporations or other groups decide they want to use certain ones for certain purposes and their code is written knowing this. But there is nothing preventing another group from using the same code points for something else. As long as the two groups don't ever exchange files which use these code points\, there is no problem. As an example\, the Apple Corporation has chosen a particular code point to represent their logo. All their software recognizes this code point and treats it accordingly. If you are writing software that might run on one of their devices\, and you need private-use code points\, it would be best if you avoided using that particular one. Another example is there is a registry of private-use code points run by an individual\, IIRC. He publishes the list so that people can avoid conflicts. It includes characters from the Klingon script and similar ones\, that Unicode refuses to encode\, but which have communities who want them. Some scripts started out in this registry\, but Unicode was eventually persuaded to encode them\, and code that used the old values could be changed to simply add a constant number to any code point to get the Unicode value.

Non-character code points have a different genesis altogether. Originally Unicode was conceived as having just 2**16 code points. If one wants to loop over all of them using a 16-bit word size\, you can't use the typical "while (x \< MAX) {}" loop without overflowing. They solved this by just saying U+FFFF isn't ever going to be a real code point\, so you could say "while (x \< 0xFFFF)" and cover everything of interest. They also wanted to reserve FFFE\, since the Byte-Order Mark (FEFF) looks like that value when the byte ordering is wrong. You don't want a legal character to be confused with the BOM.

When Unicode was expanded beyond 16-bits\, they created the plane concept\, with Plane0 being 0-65535\, Plane1 being the next set\, etc. It was envisioned that software would work on a given plane\, switching at times\, so they reserved FFFF and FFFE on each of the 16 planes.

Eventually it became clear that there is a need for text-processing software to be able to have sentinel code points that it knows won't be in the middle of a stream of text that it is processing. Thus\, they added the other non-character code points. (There may be a reason for these particular ones to be not-desirable to use for other purposes\, but if so\, I'm unaware of what it is.)

Non-character code points should not be foisted off on an unsuspecting application\, unlike private-use code points. Software has been written expecting that it can use these code points for its own purposes and not have to worry about them being in an input stream. One should have a gate keeper that rejects these by default. An example is a text editor that is intended to edit any Unicode-conformant text. It doesn't know what any private-use character is intended for\, nor does it need to know. What it knows is that such a character should be treated like any other. But a text editor may use an algorithm that intersperses characters that have special meaning to it with the ones that are being edited. That's what non-characters are for. A conformant text editor does not have to accept text with non-characters. A conformant text editor does have to accept text with private-use characters. The Corrigendum says "Noncharacter​: A code point that is permanently reserved for internal use"

Now to the request. I agree that the warning is not severe; however we wanted it to be on by default\, and the only way to do that currently in Perl is to make it "severe". The question is should you be warned if you are outputting a code point that is "permanently reserved for internal use". It sure sounds like it to me\, but I can see the other side too. But that's why we made a new and separate warning category for just the input and output of these code points. If your application does this\, it is a simple matter to say

"no warnings 'nonchar'"

to silence just them.

-- David Golden \xdg@&#8203;xdg\.me Take back your inbox! → http​://www.bunchmail.com/ Twitter/IRC​: @​xdg

p5pRT commented 10 years ago

From tchrist@perl.com

I don't understand why you would ever want to issue a warning for emitting a PUA code point.

  use charnames "​:alias" => {   "APPLE CORPORATE LOGO" => 0xF8FF\,   };

  print "\N{APPLE CORPORATE LOGO}\n";

Let alone all the fun I have with my Tengwar module.

  ### This one matches the assignments of the Free Tengwar Font Project
  ### @​ http​://freetengwar.sourceforge.net/   use constant TENGWAR_BASE => _CONSCRIPT_UNICODE_REGISTRY;

  ### Whereas This one matches the official roadmap​:   # use constant TENGWAR_BASE => _UNICODE_CONSORTIIUM;

  ## if In file\, can do this​:   ## use charnames "​:full"\, "​:alias" => "tengwar";

  use charnames "​:full"\, "​:alias" => { reverse (

  (TENGWAR_BASE + 0x00) => "TENGWAR LETTER TINCO"\,   (TENGWAR_BASE + 0x01) => "TENGWAR LETTER PARMA"\,   (TENGWAR_BASE + 0x02) => "TENGWAR LETTER CALMA"\,   (TENGWAR_BASE + 0x03) => "TENGWAR LETTER QUESSE"\,   (TENGWAR_BASE + 0x04) => "TENGWAR LETTER ANDO"\,   (TENGWAR_BASE + 0x05) => "TENGWAR LETTER UMBAR"\,

  ....

--tom

p5pRT commented 10 years ago

From @tux

On Wed\, 12 Feb 2014 21​:56​:33 -0700\, Tom Christiansen \tchrist@&#8203;perl\.com wrote​:

I don't understand why you would ever want to issue a warning for emitting a PUA code point.

use charnames "&#8203;:alias" => \{
  "APPLE CORPORATE LOGO" => 0xF8FF\,
\};

print "\\N\{APPLE CORPORATE LOGO\}\\n";

Let alone all the fun I have with my Tengwar module.

\#\#\# This one matches the assignments of the Free Tengwar Font Project  
\#\#\#       @&#8203; http&#8203;://freetengwar\.sourceforge\.net/
use constant TENGWAR\_BASE => \_CONSCRIPT\_UNICODE\_REGISTRY;

\#\#\# Whereas This one matches the official roadmap&#8203;:
\# use constant TENGWAR\_BASE => \_UNICODE\_CONSORTIIUM;

\#\# if In file\, can do this&#8203;:
\#\# use charnames "&#8203;:full"\, "&#8203;:alias" => "tengwar";

use charnames "&#8203;:full"\, "&#8203;:alias" => \{ reverse \(

\(TENGWAR\_BASE \+ 0x00\) => "TENGWAR LETTER TINCO"\,
\(TENGWAR\_BASE \+ 0x01\) => "TENGWAR LETTER PARMA"\,
\(TENGWAR\_BASE \+ 0x02\) => "TENGWAR LETTER CALMA"\,
\(TENGWAR\_BASE \+ 0x03\) => "TENGWAR LETTER QUESSE"\,
\(TENGWAR\_BASE \+ 0x04\) => "TENGWAR LETTER ANDO"\,
\(TENGWAR\_BASE \+ 0x05\) => "TENGWAR LETTER UMBAR"\,

I am really happy to see :alias being used like this. It clearly proves I am not the only one who uses it on a daily basis since 2002 :)

-- H.Merijn Brand http​://tux.nl Perl Monger http​://amsterdam.pm.org/ using perl5.00307 .. 5.19 porting perl5 on HP-UX\, AIX\, and openSUSE http​://mirrors.develooper.com/hpux/ http​://www.test-smoke.org/ http​://qa.perl.org http​://www.goldmark.org/jeff/stupid-disclaimers/

p5pRT commented 10 years ago

From gpiero@rm-rf.it

Hello Karl\,

thank you\, very interesting explanation.

* [Wed\, Feb 12\, 2014 at 11​:23​:08AM -0800] karl williamson via RT​:

But a text editor may use an algorithm that intersperses characters that have special meaning to it with the ones that are being edited. That's what non-characters are for.

Oh well\, I thought UTC recommended use of out-of-range sentinels in those cases (forward compatibility as the Unicode range expands is left as an open exercise). So now I guess I haven't understood use cases for sentinels (but maybe the end-of-... cases).

On the other hand\, both noncharacters and sentinels could not be the best choice here. I assume you're saving those "formatting codes" in your data files as they have a meaning you don't want to lose. Now you (or someone other for what it matters) want to build a full office suite and start coding an email client or a presentation software. You probably also want to be able to include or import the data files created with your text editor (hopefully preserving the formatting).
Does this mean that your noncharacters suddenly become illegal and you have to replace them all with private-use chars for being able to exchange them between the involved softwares ? Or do you consider the office suite to be a single 'system' and so you're allowed to use them ?
In the latter case obviously the two softwares should agree upon their meaning\, so... well\, isn't it so much similar to the definition of private-use characters ?

I guess I would use private-use chars in most cases for avoiding problems in the future... ... mmh\, and will probably end up with some private-use character clashing... damn it...

A conformant text editor does not have to accept text with non-characters. A conformant text editor does have to accept text with private-use characters.

Ok\, I'm following you here. Nevertheless I'm wondering how much is it practical... Hopefully I'm not the only one that in 2014 still doesn't have a clear understanding of Unicode. Hmm\, on second thought\, I do hope I am...

Now to the request. I agree that the warning is not severe; however we wanted it to be on by default\, and the only way to do that currently in Perl is to make it "severe". The question is should you be warned if you are outputting a code point that is "permanently reserved for internal use". It sure sounds like it to me\, but I can see the other side too. But that's why we made a new and separate warning category for just the input and output of these code points. If your application does this\, it is a simple matter to say

"no warnings 'nonchar'"

to silence just them.

Ok\, so I mis-interpreted the reason for the warning. I thought it was Perl telling me​: "ehi\, you cannot in/out-put that char"\, and having to explicitly tell​: "no\, no\, I assure you I can... cfr. Corrigendum #9" seemed somewhat wrong. But I see your point now.

Ciao\, Gian Piero.

p5pRT commented 10 years ago

From @xdg

On Wed\, Feb 12\, 2014 at 11​:56 PM\, Tom Christiansen \tchrist@&#8203;perl\.com wrote​:

I don't understand why you would ever want to issue a warning for emitting a PUA code point.

Because they require prior agreement between parties\, I think it sensible to (optionally) warn when private characters go into a IO stream.

This is particularly important for characters read from one source and output to another destination. Without knowing that the source and destination agree to the same semantics\, unexpected results could occur and those are exactly the sort of things that Perl tends to warn users about.

Absolutely\, the warning should not be on by default. When warnings are enabled\, turning off the private character warning or bypassing the warning by putting on an IO layer that knows that private characters are OK would be the way to explicitly indicate "prior agreement".

I don't think "​:utf8" should warn about private or non-characters. I do think :encoding(UTF-8) should (as it is currently defined as "strict"). I would love to have someone implement a middle ground that is strict about ill-formed data but allows private/non-characters. E.g. :encoding(UTF-8-any).

Then one could say C\<\< binmode($fh\, "​:encoding(UTF-8-any)") >> and merrily use private characters without issue\, plus the code would be self-documenting that private/non character use is allowed.

If at some point in the future\, we move to make "​:utf8" itself "strict"\, then I would favor it having "UTF-8-any' semantics.

David

-- David Golden \xdg@&#8203;xdg\.me Take back your inbox! → http​://www.bunchmail.com/ Twitter/IRC​: @​xdg

p5pRT commented 10 years ago

From gpiero@rm-rf.it

* [Wed\, Feb 12\, 2014 at 01​:37​:08PM -0800] David Golden via RT​:

A new 'privatechar' warning category should be added to cover those distinct from 'nonchar'\, and I think the wording needs to be softer​:

E.g.

Unicode private character U+%x in %x\, may not be portable

and

Unicode non-character U+%x in %x\, may not be portable

In those\, the second "%x" would be the op that triggered the warning\, akin to the "wide character" warnings.

* [Thu\, Feb 13\, 2014 at 06​:47​:07AM -0800] David Golden via RT​:

On Wed\, Feb 12\, 2014 at 11​:56 PM\, Tom Christiansen \tchrist@&#8203;perl\.com wrote​:

I don't understand why you would ever want to issue a warning for emitting a PUA code point.

Because they require prior agreement between parties\, I think it sensible to (optionally) warn when private characters go into a IO stream.

This is particularly important for characters read from one source and output to another destination. Without knowing that the source and destination agree to the same semantics\, unexpected results could occur and those are exactly the sort of things that Perl tends to warn users about.

Not yet sure\, but I think I agree with you about a warning related to private-use chars because of the reasons you exposed. Put it this way​: if you were on the receiving end of a streaming containing PU chars you didn't expect\, you probably wished that the developer of the offending code had been warned about the problem. Still\, I wonder if there should be a way to enable (extra-)optional warnings. Something like​:

use warnings; use warnings 'private_chars'; # not enabled by previous statement.

Unlike the wide character warning\, though\, where the IO handle is wholly unprepared for character data\, I'm not convinced that nonchar and privatechar need to be on by default\, however. They should be enabled by "use warnings".

Strongly uncertain about this matter. My first reaction was​: "absolutely they should be explicitly enabled via a 'use warnings'". But after thinking a bit about Karl's explanation\, I can see the point for having them always enabled. Quoting myself​: "if you were on the receiving end of a streaming containing... etc." (and you want the developer on the other side to _always_ see the warning). But then again\, setting it to be severe is unfriendly to one-liners\, when you have to type extra chars only to tell the interpreter that you don't care about those warnings. Well\, in the end I probably agree with you about this point too\, but I'm reserving the right to change idea in almost no time if I feel so.

Of course\, an IO layer should be able to decide if those are acceptable. E.g.

binmode(STDOUT\, "​:utf8_private_strict");

Should something like that be created\, it should allow private characters but warn on non characters.

Interesting idea. I also see a use for a layer that would accept Unicode non-characters but would continue to warn about non-Unicode characters. Please note that currently the 'nonchar' warning tag is confusing and probably its scope is too wide.

$ perl-5.19.8 -CS -le 'print "\x{FFFF_1234}"' >/dev/null
Code point 0xFFFF1234 is not Unicode\, may not be portable at -e line 1. $ perl-5.19.8 -CS -le 'no warnings "nonchar"; print "\x{FFFF_1234}"' >/dev/null

So\, not only it turns off warnings about non-characters\, but it also shuts up entirely about non-Unicode codepoints. I think it make sense to split it into two separate tags\, 'nonchar' and 'nonunicode'\, probably lowering the severity of the former.

Back to the subject of layers\, I probably would also love a couple of layers that would silently strip off private-use and/or nonchars\, so that you could 'sanitize' your input without worrying about characters that you would discard anyway (but probably in a less efficient way).

Ciao\, Gian Piero.

p5pRT commented 10 years ago

From gpiero@rm-rf.it

* [Mon\, Feb 17\, 2014 at 08​:23​:14PM +0100] Gian Piero Carrubba​:

Please note that currently the 'nonchar' warning tag is confusing and probably its scope is too wide.

$ perl-5.19.8 -CS -le 'print "\x{FFFF_1234}"' >/dev/null Code point 0xFFFF1234 is not Unicode\, may not be portable at -e line 1. $ perl-5.19.8 -CS -le 'no warnings "nonchar"; print "\x{FFFF_1234}"' >/dev/null

So\, not only it turns off warnings about non-characters\, but it also shuts up entirely about non-Unicode codepoints. I think it make sense to split it into two separate tags\, 'nonchar' and 'nonunicode'\, probably lowering the severity of the former.

Well\, not exactly. The 'non_unicode' tag already exists and is severe too\, but the 'no warnings "tag"' syntax acts ...mmh.. strangely.

$ perlbrew exec --with perl-5.19.8 perl -CS -l >/dev/null   print "\x{FDEF}";   binmode(STDOUT\, '​:encoding(notexist)'); Unicode non-character U+FDEF is illegal for open interchange at - line 1. Cannot find encoding "notexist" at - line 2.

$ perlbrew exec --with perl-5.19.8 perl -CS -l >/dev/null   no warnings 'io';   print "\x{FDEF}";   binmode(STDOUT\, '​:encoding(notexist)'); (no warnings)

wtf? Disabling a 'severe' warning results in _all_ warnings being silenced if you don't also explicitly tell 'use warnings'. Definitively not what I would expect.

I cannot find it reported nor this behaviour seems documented. Should it be reported separately ?

Ciao\, Gian Piero.

p5pRT commented 10 years ago

From @xdg

On Mon\, Feb 17\, 2014 at 2​:23 PM\, Gian Piero Carrubba \gpiero@&#8203;rm\-rf\.it wrote​:

Unlike the wide character warning\, though\, where the IO handle is wholly unprepared for character data\, I'm not convinced that nonchar and privatechar need to be on by default\, however. They should be enabled by "use warnings".

Strongly uncertain about this matter. My first reaction was​: "absolutely they should be explicitly enabled via a 'use warnings'".

That's actually what I meant. I think nonchar and privatechar should not be "severe" warnings (that fire regardless of "use warnings"). I think they should be regular warnings that are enabled with "use warnings" like any other warning. They should not be "optional" warnings that need to be explicitly turned on -- because no one will bother to do so and thus there's little point.

David

-- David Golden \xdg@&#8203;xdg\.me Take back your inbox! → http​://www.bunchmail.com/ Twitter/IRC​: @​xdg

p5pRT commented 10 years ago

From @rjbs

949cf498 introduced "a new set of flags to disallow those code points." For example\, UNICODE_WARN_NONCHAR. Encode​::Unicode seems to always pass UNICODE_WARN_ILLEGAL_INTERCHANGE as its flags. Would exposing a means to tweak this be plausible?

In general\, I think the problem is that there are cases for wanting these warnings or not\, not only on a program or scope basis\, but per-handle. If I've opened a handle to some generic input file\, I may want to be alerted to any non-characters\, while a connection to another part of my internal infrastructure may be quite prepared to truck in them.

Making the warning non-severe seems reasonable\, although I'm not worked up about it. To me\, severe warnings are for things that are going to change in the future or that are almost certainly a mistake or ambiguity. Although non-Unicode (characters U+11000 and up) warnings seem like candidates for severe warnings\, I'm not sure Unicode non-character warnings fit.

I also don't know how commonly code is being run with no warnings enabled\, so I'm not sure how significant this distinction is in practice.

-- rjbs

p5pRT commented 10 years ago

From @xdg

On Tue\, Feb 18\, 2014 at 8​:27 AM\, Ricardo Signes \perl\.p5p@&#8203;rjbs\.manxome\.org wrote​:

In general\, I think the problem is that there are cases for wanting these warnings or not\, not only on a program or scope basis\, but per-handle. If I've opened a handle to some generic input file\, I may want to be alerted to any non-characters\, while a connection to another part of my internal infrastructure may be quite prepared to truck in them.

That's my rationale for having strict UTF-8 layers do warnif() to a category and letting users enable or disable those warnings as usual.

  use warnings;

  say $nonchar; # warns about wide char

  binmode(STDOUT\, '​:utf8');   say $nonchar; # lax​: doesn't warn

  binmode(STDOUT\, '​:encoding(UTF-8)');   say $nonchar; # strict​: warns about nonchar

  {   no warnings 'nonchar';   say $nonchar; # doesn't warn   }

  binmode(STDOUT\, '​:encoding(UTF-8-any)');   say $nonchar; # permissive​: doesn't warn

-- David Golden \xdg@&#8203;xdg\.me Take back your inbox! → http​://www.bunchmail.com/ Twitter/IRC​: @​xdg

p5pRT commented 10 years ago

From @ikegami

On Mon\, Feb 17\, 2014 at 2​:23 PM\, Gian Piero Carrubba \gpiero@&#8203;rm\-rf\.itwrote​:

Still\, I wonder if there should be a way to enable (extra-)optional warnings. Something like​:

use warnings; use warnings 'private_chars'; # not enabled by previous statement.

C\<\< use warnings; >> is documented to enable all warnings. Don't break this promise.

p5pRT commented 10 years ago

From @pjcj

On Tue\, Feb 18\, 2014 at 12​:49​:26PM -0500\, Eric Brine wrote​:

On Mon\, Feb 17\, 2014 at 2​:23 PM\, Gian Piero Carrubba \gpiero@&#8203;rm\-rf\.itwrote​:

Still\, I wonder if there should be a way to enable (extra-)optional warnings. Something like​:

use warnings; use warnings 'private_chars'; # not enabled by previous statement.

C\<\< use warnings; >> is documented to enable all warnings. Don't break this promise.

I have no comments on this specific proposal\, but I do have thoughts about the idea of warnings which are not enabled by default. warnings.pm says​:

  If no import list is supplied\, all possible warnings are either enabled   or disabled.

First\, I'm not even sure whether this is totally accurate\, but even if it is\, I do not see it as a promise\, but rather as documenting the current situation. strict.pm says something very similar. In neither case do I see a problem with new categories being added which are not enabled by default.

However\, I do see a problem with adding new categories which are enabled by default and start complaining about constructs which may not be particularly problematic. And it would be a shame if there were no way to add such categories\, in both warnings and strict.

-- Paul Johnson - paul@​pjcj.net http​://www.pjcj.net

p5pRT commented 10 years ago

From @druud62

On 2014-02-18 20​:28\, Paul Johnson wrote​:

On Tue\, Feb 18\, 2014 at 12​:49​:26PM -0500\, Eric Brine wrote​:

On Mon\, Feb 17\, 2014 at 2​:23 PM\, Gian Piero Carrubba \gpiero@&#8203;rm\-rf\.itwrote​:

Still\, I wonder if there should be a way to enable (extra-)optional warnings. Something like​:

use warnings; use warnings 'private_chars'; # not enabled by previous statement.

C\<\< use warnings; >> is documented to enable all warnings. Don't break this promise.

I have no comments on this specific proposal\, but I do have thoughts about the idea of warnings which are not enabled by default. warnings.pm says​:

If no import list is supplied\, all possible warnings are either enabled or disabled.

First\, I'm not even sure whether this is totally accurate\, but even if it is\, I do not see it as a promise\, but rather as documenting the current situation. strict.pm says something very similar. In neither case do I see a problem with new categories being added which are not enabled by default.

However\, I do see a problem with adding new categories which are enabled by default and start complaining about constructs which may not be particularly problematic. And it would be a shame if there were no way to add such categories\, in both warnings and strict.

use warnings '​:most'; ;)

-- Ruud

p5pRT commented 10 years ago

From @ikegami

On Tue\, Feb 18\, 2014 at 2​:28 PM\, Paul Johnson \paul@&#8203;pjcj\.net wrote​:

On Tue\, Feb 18\, 2014 at 12​:49​:26PM -0500\, Eric Brine wrote​:

On Mon\, Feb 17\, 2014 at 2​:23 PM\, Gian Piero Carrubba \<gpiero@​rm-rf.it wrote​:

Still\, I wonder if there should be a way to enable (extra-)optional warnings. Something like​:

use warnings; use warnings 'private_chars'; # not enabled by previous statement.

C\<\< use warnings; >> is documented to enable all warnings. Don't break this promise.

I have no comments on this specific proposal\, but I do have thoughts about the idea of warnings which are not enabled by default. warnings.pm says​:

If no import list is supplied\, all possible warnings are either enabled or disabled.

First\, I'm not even sure whether this is totally accurate\, but even if it is\, I do not see it as a promise\, but rather as documenting the current situation.

If this is the case\, I haven't been doing and recommending when I think I have been. What should I use to enable all warnings if not C\<\< use warnings; >> (which is documented to be C\<\< use warnings '​:all'; >>)?

However\, I do see a problem with adding new categories which are enabled

by default and start complaining about constructs which may not be particularly problematic. And it would be a shame if there were no way to add such categories\, in both warnings and strict.

I didn't say they should be enabled by default. I said they should be enabled by C\<\< use warnings '​:all'; >> aka C\<\< use warnings; >>.

p5pRT commented 10 years ago

From @khwilliamson

Regardless of what the ultimate disposition of this is\, I have attached a patch that would clarify the current situation for at least 5.20. Any objections to it?

p5pRT commented 10 years ago

From @khwilliamson

0002-Proposed-5.20-wording-for-non-char-code-point-usage.patch ```diff From 6b1134ef7e53209fcf4f197707a95e4b5b330f86 Mon Sep 17 00:00:00 2001 From: Karl Williamson Date: Mon, 21 Apr 2014 08:49:00 -0600 Subject: [PATCH 2/2] Proposed 5.20 wording for non-char code point usage This clarifies how things work in 5.20. --- pod/perldiag.pod | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/pod/perldiag.pod b/pod/perldiag.pod index 5482684..64a1bff 100644 --- a/pod/perldiag.pod +++ b/pod/perldiag.pod @@ -5739,9 +5739,15 @@ with the characters in the Lao and Thai scripts. (S nonchar) Certain codepoints, such as U+FFFE and U+FFFF, are defined by the Unicode standard to be non-characters. Those are legal codepoints, but are reserved for internal use; so, applications -shouldn't attempt to exchange them. If you know what you are doing +shouldn't attempt to exchange them. An application may not be +expecting any of these characters at all, and receiving them +may lead to bugs. If you know what you are doing you can turn off this warning by C. +This is not really a "serious" error, but it is supposed to be raised +by default even if warnings are not enabled, and currently the only +way to do that in Perl is to mark it as serious. + =item Unicode surrogate U+%X is illegal in UTF-8 (S surrogate) You had a UTF-16 surrogate in a context where they are -- 1.8.3.2 ```
p5pRT commented 10 years ago

From @rjbs

* Karl Williamson via RT \perlbug\-followup@&#8203;perl\.org [2014-04-21T10​:57​:26]

Regardless of what the ultimate disposition of this is\, I have attached a patch that would clarify the current situation for at least 5.20. Any objections to it?

None.

-- rjbs

p5pRT commented 10 years ago

From @jhi

It seems that Perl is lagging on the handling for Unicode "non-characters" [1]​: they are these days valid for interchange​:

http​://www.unicode.org/versions/corrigendum9.html

In other words\, they should be handled much like PUA (private use area) characters [2]​: passed through as-is.

How we are currently doing it wrong​:

(a) ./perl -CO -we 'print chr(0xFFFF)'

Unicode non-character U+FFFF is illegal for open interchange at -e line 1. �%

(Somewhat strangely\, the -CO is required for the warning to appear.)

We shouldn't warn.

It is possible we still could warn somehow\, to alert users about the special nature of "non-characters" (a very unfortunate name)\, but they are definitely legal characters\, and they can be interchanged. (They are not *intended* for interchange\, but that is quite different from "forbidden".)

(b) In Encode\, the "utf8" lets the non-chars through\, but the strict "UTF-8" mangles them to the Unicode REPLACEMENT CHARACTER U+FFFD​:

./perl -Ilib -MEncode=decode -MDevel​::Peek -we 'Dump(decode("utf8"\, "\xEF\xBF\xBF"))' SV = PV(0x7ffba18041f0) at 0x7ffba1803438   REFCNT = 1   FLAGS = (TEMP\,POK\,pPOK\,UTF8)   PV = 0x7ffba143e6c0 "\357\277\277"\0 [UTF8 "\x{ffff}"]   CUR = 3   LEN = 16 ./perl -Ilib -MEncode=decode -MDevel​::Peek -we 'Dump(decode("UTF-8"\, "\xEF\xBF\xBF"))' {git​: nonchar SV = PV(0x7ff34104aa50) at 0x7ff341031f28   REFCNT = 1   FLAGS = (TEMP\,POK\,pPOK\,UTF8)   PV = 0x7ff340d022e0 "\357\277\275"\0 [UTF8 "\x{fffd}"]   CUR = 3   LEN = 16

We shouldn't mangle.


[1] http​://www.unicode.org/faq/private_use.html#nonchar1 [2] http​://www.unicode.org/faq/private_use.html

p5pRT commented 10 years ago

From @jhi

On Wednesday-201405-21\, 9​:55\, Jarkko Hietaniemi (via RT) wrote​:

How we are currently doing it wrong​:

Should've said​:

"Currently known wrongnesses include\, but are probably not limited to"

p5pRT commented 10 years ago

From @tonycoz

On Wed May 21 06​:55​:23 2014\, jhi wrote​:

It seems that Perl is lagging on the handling for Unicode "non-characters" [1]​: they are these days valid for interchange​:

http​://www.unicode.org/versions/corrigendum9.html

In other words\, they should be handled much like PUA (private use area) characters [2]​: passed through as-is.

How we are currently doing it wrong​:

(a) ./perl -CO -we 'print chr(0xFFFF)'

Unicode non-character U+FFFF is illegal for open interchange at -e line 1. �%

(Somewhat strangely\, the -CO is required for the warning to appear.)

We shouldn't warn.

It is possible we still could warn somehow\, to alert users about the special nature of "non-characters" (a very unfortunate name)\, but they are definitely legal characters\, and they can be interchanged. (They are not *intended* for interchange\, but that is quite different from "forbidden".)

(b) In Encode\, the "utf8" lets the non-chars through\, but the strict "UTF-8" mangles them to the Unicode REPLACEMENT CHARACTER U+FFFD​:

./perl -Ilib -MEncode=decode -MDevel​::Peek -we 'Dump(decode("utf8"\, "\xEF\xBF\xBF"))' SV = PV(0x7ffba18041f0) at 0x7ffba1803438 REFCNT = 1 FLAGS = (TEMP\,POK\,pPOK\,UTF8) PV = 0x7ffba143e6c0 "\357\277\277"\0 [UTF8 "\x{ffff}"] CUR = 3 LEN = 16 ./perl -Ilib -MEncode=decode -MDevel​::Peek -we 'Dump(decode("UTF-8"\, "\xEF\xBF\xBF"))' {git​: nonchar SV = PV(0x7ff34104aa50) at 0x7ff341031f28 REFCNT = 1 FLAGS = (TEMP\,POK\,pPOK\,UTF8) PV = 0x7ff340d022e0 "\357\277\275"\0 [UTF8 "\x{fffd}"] CUR = 3 LEN = 16

We shouldn't mangle.

This looks like a duplicate of

https://rt-archive.perl.org/perl5/Ticket/Display.html?id=121226

Tony

p5pRT commented 10 years ago

The RT System itself - Status changed from 'new' to 'open'

p5pRT commented 10 years ago

From @jhi

On Wednesday-201405-21\, 23​:42\, Tony Cook via RT wrote​:

This looks like a duplicate of

https://rt-archive.perl.org/perl5/Ticket/Display.html?id=121226

Yup\, the same issue.

FWIW\, I started poking at this.

p5pRT commented 10 years ago

From @khwilliamson

On Thu May 22 05​:32​:49 2014\, jhi wrote​:

On Wednesday-201405-21\, 23​:42\, Tony Cook via RT wrote​:

This looks like a duplicate of

https://rt-archive.perl.org/perl5/Ticket/Display.html?id=121226

Yup\, the same issue.

FWIW\, I started poking at this.

I have now merged these two tickets. I've been thinking about and doing some research in the Unicode standard about this\, and am having trouble with the idea that we should now just change to accept non-characters without warning.

Non-characters are still "permanently reserved for internal use"\, quoting from Corrigendum #9. I want to emphasize that word "internal". An application should be able to presume that data it receives from an external source does not contain non-characters\, so it is free to use them in any way it wishes. This is the whole point of non-characters\, to have some code points available for you that you are assured won't be coming from somewhere else.

And how do things come from somewhere else? through I/O. Hence\, the presumption by Perl should be that I/O is related to an external interface. It may be that an application is composed of cooperating processes that communicate via I/O\, but Perl's presumption must be\, unless indicated otherwise\, that I/O is for external interfaces.

An application that uses non-characters will want its inputs to not have any of them coming in to it. It wants them filtered out; the best choice is to have them turned into REPLACEMENT CHARACTERS. My claim is that Perl should do this by default. Corrigendum #9 doesn't change this. And there should be a way to change the default. That is what Corrigendum #9 makes clear\, and which Perl already does in (too) many cases. That Corrigendum was not aimed at Perl\, but other Unicode implementations. My point is that Perl already implements this Corrigendum\, and need not nor should not change because of it.

We have long ago agreed that the default input for Perl should be strict\, and that explicit action should be taken to override that. strict input should continue to exclude non-characters. If we were to change that\, existing applications would be suddenly and silently exposed to security holes\, where an attacker who knows the internal structure of the application inserts non-characters to fool it.

Let me reiterate my main point. We already implement Corrigendum #9. We should not make changes because of it.

Private-use characters are not the same as non-characters. An application has no right to presume that external inputs don't include private-use characters. But it is free to ascribe its own meanings to them. In practice\, most applications will just treat them as some generic code points.

I think David Golden's ideas would be a useful addition\, but it's not my itch. I would be happy to consult with someone who wishes to scratch it though -- Karl Williamson

p5pRT commented 10 years ago

From @jhi

There's input and there's output.

I agree that default input should be strict​: but I think stricter than what we have now\, e.g. not accept U+200000. And not accept non-chars.

(There's also more spectrum than just spewing warnings​: currently we generate U+FFFD but then *continue* reading. We could e.g. truncate and stop reading\, and/or croak...)

I am not entirely certain about the definition of "internal" here\, though. Internal to what? One "process"? What if Perl is just a "library" and not an "application"? A set of Perl applications? A set of mixed applications?

But on output if I output U+FFFF I don't want to output U+FFFD. (This doesn't happen now\, either​: we just warn. But being strict the wrong way\, this could happen.) This is no different from chr(0xFFFF)\, really\, if I write that I don't want magic making it chr(0xFFFD).

Again\, quoting the C9​: "However\, they are not illegal in interchange nor do they cause ill-formed Unicode text. This has always been the intent of the standard\, as expressed by the Unicode Technical Committee." So us currently warning non-chars being illegal for interchange is wrong. They are not.

p5pRT commented 10 years ago

From @jhi

On Thursday-201405-29\, 15​:49\, Jarkko Hietaniemi wrote​:

This is no different from chr(0xFFFF)\, really\, if I write that I don't want magic making it chr(0xFFFD).

Or this​:

perl -MEncode=decode -MDevel​::Peek -we 'Dump(decode("UTF-8"\, "\xEF\xBF\xBF"))'

giving me the bytes \xEF\xBF\xBD\, aka U+FFFD.

p5pRT commented 10 years ago

From @khwilliamson

On 05/29/2014 01​:49 PM\, Jarkko Hietaniemi wrote​:

There's input and there's output.

I agree that default input should be strict​: but I think stricter than what we have now\, e.g. not accept U+200000. And not accept non-chars.

This has been hashed around a lot before\, and I think every one now agrees with you here.

(There's also more spectrum than just spewing warnings​: currently we generate U+FFFD but then *continue* reading. We could e.g. truncate and stop reading\, and/or croak...)

Perhaps options.

I am not entirely certain about the definition of "internal" here\, though. Internal to what? One "process"? What if Perl is just a "library" and not an "application"? A set of Perl applications? A set of mixed applications?

That's why there has to be flexibility. We have to make the default the sanest and safest\, but allow the programmer(s) to override it for their needs.

But on output if I output U+FFFF I don't want to output U+FFFD. (This doesn't happen now\, either​: we just warn. But being strict the wrong way\, this could happen.) This is no different from chr(0xFFFF)\, really\, if I write that I don't want magic making it chr(0xFFFD).

Agreed. The reason we warn is so you know you're outputting something somebody else likely wont be able to handle. The only time something should be translated into FFFD is on input. I don't know about ENV or ARGV.

Again\, quoting the C9​: "However\, they are not illegal in interchange nor do they cause ill-formed Unicode text. This has always been the intent of the standard\, as expressed by the Unicode Technical Committee." So us currently warning non-chars being illegal for interchange is wrong. They are not. The wording should change\, but I do believe there should be a warning nonetheless. I do wish Unicode had phrased the original and the Corrigendum better. They do seem to me to have an aversion to straightforward language.

p5pRT commented 9 years ago

From @jhi

There was some discussion but it was all over the place\, and this ticket as such is pretty useless. Rejecting.

p5pRT commented 9 years ago

@jhi - Status changed from 'open' to 'rejected'