Perl / perl5

🐪 The Perl programming language
https://dev.perl.org/perl5/
Other
1.85k stars 527 forks source link

$! returned with UTF-8 flag under UTF-8 locales only under 5.19.2+ #13208

Closed p5pRT closed 10 years ago

p5pRT commented 10 years ago

Migrated from rt.perl.org#119499 (status was 'resolved')

Searchable as RT119499$

p5pRT commented 10 years ago

From victor@vsespb.ru

$! returned as character string under 5.19.2+ and UTF-8 locales. But as binary strings under single-byte encoding locales.

I believe this is useless and just makes it harder to decode $! value properly.

Also I am not sure if it will be possible to decode it when language with Latin-1 -only characters is set.

LANG=ru_RU LANGUAGE=ru_RU​:ru LC_ALL=ru_RU.utf8 perl -MPOSIX -MDevel​::Peek -e '$!=EACCES; Dump "$!"'

SV = PV(0x144dd80) at 0x14702a0   REFCNT = 1   FLAGS = (PADTMP\,POK\,pPOK\,UTF8)   PV = 0x1468e30 "\320\236\321\202\320\272\320\260\320\267\320\260\320\275\320\276 \320\262 \320\264\320\276\321\201\321\202\321\203\320\277\320\265"\0 [UTF8 "\x{41e}\x{442}\x{43a}\x{430}\x{437}\x{430}\x{43d}\x{43e} \x{432} \x{434}\x{43e}\x{441}\x{442}\x{443}\x{43f}\x{435}"]   CUR = 34   LEN = 40

LANG=ru_RU LANGUAGE=ru_RU​:ru LC_ALL=ru_RU.CP1251 LC_MESSAGES=ru_RU.CP1251 perl -MPOSIX -MDevel​::Peek -e '$!=EACCES; Dump "$!"'

SV = PV(0x1db8d80) at 0x1ddf7e0   REFCNT = 1   FLAGS = (PADTMP\,POK\,pPOK)   PV = 0x1f680d0 "\316\362\352\340\347\340\355\356 \342 \344\356\361\362\363\357\345"\0   CUR = 18   LEN = 24

p5pRT commented 10 years ago

From victor@vsespb.ru

Seems this is result of https://rt-archive.perl.org/perl5/Ticket/Display.html?id=112208 fix.

However I think fix is wrong.

1) it breaks old code\, which​:

a) tries to decode $! using Encode​::decode and I18N​::Langinfo​::langinfo(I18N​::Langinfo​::CODESET())

b) which prints error messages to screen as-is (without "binmode STDOUT :encoding")

2) Sometimes it returns binary string (under non-utf8 locales\, or when message is ASCII-only)\, sometimes character string (when locale is UTF-8).

It's hard to distinct one from another. Possible solution is utf8​::is_utf8()\, but use of utf8​::is_utf8 advertised as a danger way.

Another solution is use Encode​::decode_utf8 when locale is UTF-8 ( but not Encode​::decode("UTF-8"...) ).

Problem that this method's documentation is wrong - several people reported this​:

https://rt.cpan.org/Public/Bug/Display.html?id=87267 https://rt.cpan.org/Public/Bug/Display.html?id=61671 https://github.com/dankogai/p5-encode/pull/11 https://github.com/dankogai/p5-encode/pull/10

3) It's not documented in perllocale\, perlunicode\, perlvar.

4) It's not clear how it works in case of Latin-1 characters in UTF-8 locale.

On Wed Aug 28 01​:52​:13 2013\, vsespb wrote​:

$! returned as character string under 5.19.2+ and UTF-8 locales. But as binary strings under single-byte encoding locales.

I believe this is useless and just makes it harder to decode $! value properly.

Also I am not sure if it will be possible to decode it when language with Latin-1 -only characters is set.

LANG=ru_RU LANGUAGE=ru_RU​:ru LC_ALL=ru_RU.utf8 perl -MPOSIX -MDevel​::Peek -e '$!=EACCES; Dump "$!"'

SV = PV(0x144dd80) at 0x14702a0 REFCNT = 1 FLAGS = (PADTMP\,POK\,pPOK\,UTF8) PV = 0x1468e30 "\320\236\321\202\320\272\320\260\320\267\320\260\320\275\320\276 \320\262 \320\264\320\276\321\201\321\202\321\203\320\277\320\265"\0 [UTF8 "\x{41e}\x{442}\x{43a}\x{430}\x{437}\x{430}\x{43d}\x{43e} \x{432} \x{434}\x{43e}\x{441}\x{442}\x{443}\x{43f}\x{435}"] CUR = 34 LEN = 40

LANG=ru_RU LANGUAGE=ru_RU​:ru LC_ALL=ru_RU.CP1251 LC_MESSAGES=ru_RU.CP1251 perl -MPOSIX -MDevel​::Peek -e '$!=EACCES; Dump "$!"'

SV = PV(0x1db8d80) at 0x1ddf7e0 REFCNT = 1 FLAGS = (PADTMP\,POK\,pPOK) PV = 0x1f680d0 "\316\362\352\340\347\340\355\356 \342 \344\356\361\362\363\357\345"\0 CUR = 18 LEN = 24

p5pRT commented 10 years ago

From @khwilliamson

On 08/28/2013 02​:52 AM\, Victor Efimov (via RT) wrote​:

# New Ticket Created by Victor Efimov # Please include the string​: [perl #119499] # in the subject line of all future correspondence about this issue. # \<URL​: https://rt-archive.perl.org/perl5/Ticket/Display.html?id=119499 >

I am trying to understand your issues with this change. I believe it is working correctly now.

$! returned as character string under 5.19.2+ and UTF-8 locales. But as binary strings under single-byte encoding locales.

I don't understand your use of the word 'binary' here. In both cases\, it returns characters in contexts where strings are appropriate\, and the numeric value in contexts where numbers are appropriate

In string contexts\, it returns the appropriate encoding. In UTF-8 locales\, it returns the UTF-8 encoded character string. In non-UTF-8 locales\, it returns the single-byte string in the correct encoding.

I believe this is useless and just makes it harder to decode $! value properly.

I don't have a clue as to why you think this is useless. This change was to fix https://rt-archive.perl.org/perl5/Ticket/Display.html?id=112208 (reported also as perl #117429\, so more than one person found this to be a bug). The patch merely examines the string text of $!\, and if it is UTF-8\, sets the flag indicating that.

Code that is trying to decode $! should be using the (constant) numeric value rather than trying to parse the (locale-dependent) string.

Also I am not sure if it will be possible to decode it when language with Latin-1 -only characters is set.

Again\, use the numeric value when trying to parse the error.

LANG=ru_RU LANGUAGE=ru_RU​:ru LC_ALL=ru_RU.utf8 perl -MPOSIX -MDevel​::Peek -e '$!=EACCES; Dump "$!"'

SV = PV(0x144dd80) at 0x14702a0 REFCNT = 1 FLAGS = (PADTMP\,POK\,pPOK\,UTF8) PV = 0x1468e30 "\320\236\321\202\320\272\320\260\320\267\320\260\320\275\320\276 \320\262 \320\264\320\276\321\201\321\202\321\203\320\277\320\265"\0 [UTF8 "\x{41e}\x{442}\x{43a}\x{430}\x{437}\x{430}\x{43d}\x{43e} \x{432} \x{434}\x{43e}\x{441}\x{442}\x{443}\x{43f}\x{435}"] CUR = 34 LEN = 40

I ran this\, substituting 'say $!' for the Dump\, and got this output​: Отказано в доступе

which is the correct Cyrillic text. Prior to the patch\, this would have printed garbage.

LANG=ru_RU LANGUAGE=ru_RU​:ru LC_ALL=ru_RU.CP1251 LC_MESSAGES=ru_RU.CP1251 perl -MPOSIX -MDevel​::Peek -e '$!=EACCES; Dump "$!"'

SV = PV(0x1db8d80) at 0x1ddf7e0 REFCNT = 1 FLAGS = (PADTMP\,POK\,pPOK) PV = 0x1f680d0 "\316\362\352\340\347\340\355\356 \342 \344\356\361\362\363\357\345"\0 CUR = 18 LEN = 24

I do not have a Windows machine with CP1251\, but I hand looked at this dump\, and the characters are Отказано в доступе in that code page. So this looks proper.

p5pRT commented 10 years ago

The RT System itself - Status changed from 'new' to 'open'

p5pRT commented 10 years ago

From victor@vsespb.ru

Code that is trying to decode $! should be using the (constant) numeric value rather than trying to parse the (locale-dependent) string.

I am not trying to parse $!. I am trying to print original error message to the screen for the user.

In string contexts\, it returns the appropriate encoding. In UTF-8 locales\, it returns the UTF-8 encoded character string. In non-UTF-8 locales\, it returns the single-byte string in the correct encoding.

That is just wrong to sometimes return bytes\, sometimes characters.

The following example worked fine before this change​:

use strict; use warnings; use I18N​::Langinfo; use Encode; my $enc = I18N​::Langinfo​::langinfo(I18N​::Langinfo​::CODESET()); binmode STDOUT\, "​:encoding($enc)";

my $filename = "not a file ".chr(0x444);

open my $f\, "\<"\, $filename or do {   my $error = "$!";   $error = decode($enc\, "$error");   print "Error accessing file $filename​: $error\n"; };

but with this change​:

- under non-Unicode locales works fine. - under UTF-8 locales fails with "Cannot decode string with wide characters "

Possible fix for this example is​:

replace   $error = decode($enc\, "$error"); with   $error = utf8​::is_utf8($error) ? $error : decode($enc\, "$error");

Another place where it breaks old code is​:

perl -e 'open my $f\, "\<"\, "notafile" or die $!'

now prints warning​: "Wide character in die" when locale is UTF-8 and message contains wide characters.

I ran this\, substituting 'say $!' for the Dump\, and got this output​: Отказано в доступе which is the correct Cyrillic text. Prior to the patch\, this would have printed garbage.

No\, prior to this patch it prints correct (same) text but without "Wide character" warnings.

On Wed Aug 28 10​:19​:47 2013\, public@​khwilliamson.com wrote​:

On 08/28/2013 02​:52 AM\, Victor Efimov (via RT) wrote​:

# New Ticket Created by Victor Efimov # Please include the string​: [perl #119499] # in the subject line of all future correspondence about this issue. # \<URL​: https://rt-archive.perl.org/perl5/Ticket/Display.html?id=119499 >

I am trying to understand your issues with this change. I believe it is working correctly now.

$! returned as character string under 5.19.2+ and UTF-8 locales. But as binary strings under single-byte encoding locales.

I don't understand your use of the word 'binary' here. In both cases\, it returns characters in contexts where strings are appropriate\, and the numeric value in contexts where numbers are appropriate

In string contexts\, it returns the appropriate encoding. In UTF-8 locales\, it returns the UTF-8 encoded character string. In non-UTF-8 locales\, it returns the single-byte string in the correct encoding.

I believe this is useless and just makes it harder to decode $! value properly.

I don't have a clue as to why you think this is useless. This change was to fix https://rt-archive.perl.org/perl5/Ticket/Display.html?id=112208 (reported also as perl #117429\, so more than one person found this to be a bug). The patch merely examines the string text of $!\, and if it is UTF-8\, sets the flag indicating that.

Code that is trying to decode $! should be using the (constant) numeric value rather than trying to parse the (locale-dependent) string.

Also I am not sure if it will be possible to decode it when language with Latin-1 -only characters is set.

Again\, use the numeric value when trying to parse the error.

LANG=ru_RU LANGUAGE=ru_RU​:ru LC_ALL=ru_RU.utf8 perl -MPOSIX -MDevel​::Peek -e '$!=EACCES; Dump "$!"'

SV = PV(0x144dd80) at 0x14702a0 REFCNT = 1 FLAGS = (PADTMP\,POK\,pPOK\,UTF8) PV = 0x1468e30 "\320\236\321\202\320\272\320\260\320\267\320\260\320\275\320\276 \320\262 \320\264\320\276\321\201\321\202\321\203\320\277\320\265"\0 [UTF8 "\x{41e}\x{442}\x{43a}\x{430}\x{437}\x{430}\x{43d}\x{43e} \x{432} \x{434}\x{43e}\x{441}\x{442}\x{443}\x{43f}\x{435}"] CUR = 34 LEN = 40

I ran this\, substituting 'say $!' for the Dump\, and got this output​: Отказано в доступе

which is the correct Cyrillic text. Prior to the patch\, this would have printed garbage.

LANG=ru_RU LANGUAGE=ru_RU​:ru LC_ALL=ru_RU.CP1251 LC_MESSAGES=ru_RU.CP1251 perl -MPOSIX -MDevel​::Peek -e '$!=EACCES; Dump "$!"'

SV = PV(0x1db8d80) at 0x1ddf7e0 REFCNT = 1 FLAGS = (PADTMP\,POK\,pPOK) PV = 0x1f680d0 "\316\362\352\340\347\340\355\356 \342 \344\356\361\362\363\357\345"\0 CUR = 18 LEN = 24

I do not have a Windows machine with CP1251\, but I hand looked at this dump\, and the characters are Отказано в доступе in that code page. So this looks proper.

p5pRT commented 10 years ago

From @Leont

On Wed\, Aug 28\, 2013 at 10​:52 AM\, Victor Efimov \perlbug\-followup@&#8203;perl\.orgwrote​:

$! returned as character string under 5.19.2+ and UTF-8 locales. But as binary strings under single-byte encoding locales.

I believe this is useless and just makes it harder to decode $! value properly.

Automatic decoding is definitely the more useful behavior. I agree inconsistency is a bad thing though. Not sure it's easy to fix though. Patches welcome.

Also I am not sure if it will be possible to decode it when language with

Latin-1 -only characters is set.

AFAIK that should work perfectly fine.

Leon

p5pRT commented 10 years ago

From victor@vsespb.ru

Automatic decoding is definitely the more useful behavior

yes. when a) it's documented (perllocale or perlunicode or perlvar)

b) it's not breaking existing code. OR c) it turned on with 'use feature' or something.

I agree inconsistency is a bad thing though.

yes\, especially when sometimes it's bytes\, sometimes character and you have to check UTF-8 flag.

Not sure it's easy to fix though.

I think in Perl you can get encoding with I18N​::Langinfo​::langinfo(I18N​::Langinfo​::CODESET()) Then decode using Encode module. (both are core modules)

Perlhaps that can be fixed in Perl code\, in Errno. (We already load Errno when %! accessed)\, which will auto-load I18N​::Langinfo and Encode?

And I am totally not sure about perl C internals.

Patches welcome.

I cannot do C coding.

Also I think that old code\, relying on old behaviour was not relying on something undocumented.

it was partly documented​:

http​://perldoc.perl.org/perllocale.html

Note especially that the string value of $! and the error messages given by external utilities may be changed by LC_MESSAGES

(also perllocale now have updates\, related to $! in blead)

http​://perldoc.perl.org/perlunicode.html

there are still many places where Unicode (in some encoding or another) could be given as arguments or received as results\, or both\, but it is not.

So ideal fix would be imho​: 1. document it (perllocale or perlunicode or perlvar) 2. decode $! on non-UTF locales. always return character strings. 3. turn on new behaviour only with 'use feature'

On Wed Aug 28 12​:44​:17 2013\, LeonT wrote​:

On Wed\, Aug 28\, 2013 at 10​:52 AM\, Victor Efimov \perlbug\-followup@&#8203;perl\.orgwrote​:

$! returned as character string under 5.19.2+ and UTF-8 locales. But as binary strings under single-byte encoding locales.

I believe this is useless and just makes it harder to decode $! value properly.

Automatic decoding is definitely the more useful behavior. I agree inconsistency is a bad thing though. Not sure it's easy to fix though. Patches welcome.

Also I am not sure if it will be possible to decode it when language with

Latin-1 -only characters is set.

AFAIK that should work perfectly fine.

Leon

p5pRT commented 10 years ago

From victor@vsespb.ru

There is a distribution which decodes POSIX​::strerror with I18N​::Langinfo​:

http​://search.cpan.org/~kryde/I18N-Langinfo-Wide-7/

also\, another possible problem\, that all examples in perl documentation\, with "die $!" now raise warnings​:

http​://perldoc.perl.org/perlopentut.html

  open(INFO\, "datafile") || die("can't open datafile​: $!");   open(INFO\, "\< datafile") || die("can't open datafile​: $!");   open(RESULTS\,"> runstats") || die("can't open runstats​: $!");   open(LOG\, ">> logfile ") || die("can't open logfile​: $!");

======== $ LANG=ru_RU LANGUAGE=ru_RU​:ru LC_ALL=ru_RU.utf8 perl -e 'open(INFO\,
  "datafile") || die $!;' Wide character in die at -e line 1. Нет такого файла или каталога at -e line 1.

On Wed Aug 28 13​:02​:14 2013\, vsespb wrote​:

Automatic decoding is definitely the more useful behavior

yes. when a) it's documented (perllocale or perlunicode or perlvar)

b) it's not breaking existing code. OR c) it turned on with 'use feature' or something.

I agree inconsistency is a bad thing though.

yes\, especially when sometimes it's bytes\, sometimes character and you have to check UTF-8 flag.

Not sure it's easy to fix though.

I think in Perl you can get encoding with I18N​::Langinfo​::langinfo(I18N​::Langinfo​::CODESET()) Then decode using Encode module. (both are core modules)

Perlhaps that can be fixed in Perl code\, in Errno. (We already load Errno when %! accessed)\, which will auto-load I18N​::Langinfo and Encode?

And I am totally not sure about perl C internals.

Patches welcome.

I cannot do C coding.

Also I think that old code\, relying on old behaviour was not relying on something undocumented.

it was partly documented​:

http​://perldoc.perl.org/perllocale.html

Note especially that the string value of $! and the error messages given by external utilities may be changed by LC_MESSAGES

(also perllocale now have updates\, related to $! in blead)

http​://perldoc.perl.org/perlunicode.html

there are still many places where Unicode (in some encoding or another) could be given as arguments or received as results\, or both\, but it is not.

So ideal fix would be imho​: 1. document it (perllocale or perlunicode or perlvar) 2. decode $! on non-UTF locales. always return character strings. 3. turn on new behaviour only with 'use feature'

On Wed Aug 28 12​:44​:17 2013\, LeonT wrote​:

On Wed\, Aug 28\, 2013 at 10​:52 AM\, Victor Efimov \perlbug\-followup@&#8203;perl\.orgwrote​:

$! returned as character string under 5.19.2+ and UTF-8 locales. But as binary strings under single-byte encoding locales.

I believe this is useless and just makes it harder to decode $! value properly.

Automatic decoding is definitely the more useful behavior. I agree inconsistency is a bad thing though. Not sure it's easy to fix though. Patches welcome.

Also I am not sure if it will be possible to decode it when language with

Latin-1 -only characters is set.

AFAIK that should work perfectly fine.

Leon

p5pRT commented 10 years ago

From sog@msg.mx

Time to set PERL_UNICODE=SL ?


Salvador Ortiz.

On 08/28/2013 04​:36 PM\, Victor Efimov via RT wrote​:

There is a distribution which decodes POSIX​::strerror with I18N​::Langinfo​:

http​://search.cpan.org/~kryde/I18N-Langinfo-Wide-7/

also\, another possible problem\, that all examples in perl documentation\, with "die $!" now raise warnings​:

http​://perldoc.perl.org/perlopentut.html

 open\(INFO\, "datafile"\) || die\("can't open datafile&#8203;: $\!"\);
 open\(INFO\, "\< datafile"\) || die\("can't open datafile&#8203;: $\!"\);
 open\(RESULTS\,"> runstats"\) || die\("can't open runstats&#8203;: $\!"\);
 open\(LOG\, ">> logfile "\) || die\("can't open logfile&#8203;: $\!"\);

======== $ LANG=ru_RU LANGUAGE=ru_RU​:ru LC_ALL=ru_RU.utf8 perl -e 'open(INFO\, "datafile") || die $!;' Wide character in die at -e line 1. Нет такого файла или каталога at -e line 1.

p5pRT commented 10 years ago

From @cpansprout

On Wed Aug 28 10​:19​:47 2013\, public@​khwilliamson.com wrote​:

On 08/28/2013 02​:52 AM\, Victor Efimov (via RT) wrote​:

I believe this is useless and just makes it harder to decode $! value properly.

I don't have a clue as to why you think this is useless. This change was to fix https://rt-archive.perl.org/perl5/Ticket/Display.html?id=112208 (reported also as perl #117429\, so more than one person found this to be a bug). The patch merely examines the string text of $!\, and if it is UTF-8\, sets the flag indicating that.

You are describing from the point of view of internals. From the user’s standpoint\, this means you are decoding $! if the character set is UTF-8\, but leaving it encoded otherwise.

This means even #112208 is not fixed\, because the test case was ‘use open \<​:std :encoding(utf-8)>’ followed by $!. If $! is not utf-8 and you try to feed it through STDOUT\, you still get garbage on the screen.

The ultimate problem is that perl has no way of guaranteeing that $! can be fed to STDOUT and come out correctly. Even if it could do that\, there is no way for it to tell that STDOUT/STDERR is where $! is going to go.

So now\, $! may or may not be encoded\, and you have to way of telling reliably without doing the same environment checks that perl itself did internally before deciding to decode $! itself.

--

Father Chrysostomos

p5pRT commented 10 years ago

From victor@vsespb.ru

On Wed Aug 28 23​:40​:08 2013\, sprout wrote​:

So now\, $! may or may not be encoded\, and you have to way of telling reliably without doing the same environment checks that perl itself did internally before deciding to decode $! itself.

Small corrections​:

a) Actually there is a way​: check is_utf8($!) flag (which is not good because is_utf8 marked as danger\, and it's documented you cant distinct characters from bytes with this flag)

b) Current fix does not do environment checks\, it just tries to do UTF-8 validity check http​://perl5.git.perl.org/perl.git/commitdiff/1500bd919ffeae0f3252f8d1bb28b03b043d328e

p5pRT commented 10 years ago

From @khwilliamson

On 08/29/2013 02​:15 AM\, Victor Efimov via RT wrote​:

On Wed Aug 28 23​:40​:08 2013\, sprout wrote​:

So now\, $! may or may not be encoded\, and you have to way of telling reliably without doing the same environment checks that perl itself did internally before deciding to decode $! itself.

I don't follow these arguments. What that commit did is only to look at the string returned by the operating system\, and if it is encoded in UTF-8\, to set that flag in the scalar. That's it (*). If the OS didn't return UTF-8\, it leaves the flag alone. I find it hard to comprehend that this isn't the right thing to do. For the first time\, $! in string context is no different than any other string scalar in Perl. They have a utf-8 bit set which means that the encoding is in UTF-8\, or they don't have it set\, which means that the encoding is unknown to Perl. This commit did not change the latter part one iota. We have conventions as to what the bytes in that scalar mean depending on the context it is used\, the pragmas that are in effect in those contexts\, and the operations that are being performed on it. But they are just conventions. This commit did not change that.

What is different about $! is that we have made the decision to respect locale when accessing it even when not in the scope of 'use locale'. In light of these issues\, perhaps this should be discussed again. I'll let the people who argued for that decision to again argue for it.

The change fixed two bug reports for the common case where the locales for messages and the I/O matched and where people had not taken pains to deal with locale. I think that should trump the less frequent cases\, given the conflicts.

If code wants $! to be expressed in a certain language\, it should set the locale to that language while accessing $! and then restore the old locale.

Small corrections​:

a) Actually there is a way​: check is_utf8($!) flag (which is not good because is_utf8 marked as danger\, and it's documented you cant distinct characters from bytes with this flag)

I don't see that danger marked currently in the pod for utf8.pm. Where do you see that?

b) Current fix does not do environment checks\, it just tries to do UTF-8 validity check http​://perl5.git.perl.org/perl.git/commitdiff/1500bd919ffeae0f3252f8d1bb28b03b043d328e

(*) To be precise

1) if the string returned by the OS is entirely ASCII\, it does not set the UTF-8 flag. This is because ASCII UTF-8 and non-UTF-8 are identical\, so the flag is irrelevant. And yes\, this is buggy if operating under a non-ASCII 7-bit locale\, as in ISO 646. These locales have all been superseded so should be rare today\, but a bug report could be written on this.

2) As Victor notes\, the commit does a UTF-8 validity check\, so it is possible that that could give false positives. But as Wikipedia says\, "One of the few cases where charset detection works reliably is detecting UTF-8. This is due to the large percentage of invalid byte sequences in UTF-8\, so that text in any other encoding that uses bytes with the high bit set is extremely unlikely to pass a UTF-8 validity test." (The original emphasized "extremely".) I checked this out with the CP1251 character set\, and the only modern Russian character that could be a continuation byte is ё. All other vowels and consonants must be start bytes. That means that to generate a false positive\, an OS message in CP1251 must only contain words whose 2nd\, 4th\, ... bytes are that vowel. That just isn't going to happen\, though the common Russian word Её (her\, hers\, ...) could be confusable if there were no other words in the message.

p5pRT commented 10 years ago

From victor@vsespb.ru

On Thu Aug 29 13​:05​:00 2013\, public@​khwilliamson.com wrote​:

I don't see that danger marked currently in the pod for utf8.pm. Where do you see that?

http​://perldoc.perl.org/perlunifaq.html#What-is-%22the-UTF8-flag%22?

Please\, unless you're hacking the internals\, or debugging weirdness\, don't think about the UTF8 flag at all. That means that you very probably shouldn't use is_utf8 \, _utf8_on or _utf8_off at all.

2) As Victor notes\, the commit does a UTF-8 validity check\, so it is possible that that could give false positives. But as Wikipedia says\, "One of the few cases where charset detection works reliably is detecting UTF-8. This is due to the large percentage of invalid byte sequences in UTF-8\, so that text in any other encoding that uses bytes with the high bit set is extremely unlikely to pass a UTF-8 validity test." (The original emphasized "extremely".) I checked this out with the CP1251 character set\, and the only modern Russian character that could be a continuation byte is ё. All other vowels and consonants must be start bytes. That means that to generate a false positive\, an OS message in CP1251 must only contain words whose 2nd\, 4th\, ... bytes are that vowel. That just isn't going to happen\, though the common Russian word Её (her\, hers\, ...) could be confusable if there were no other words in the message.

I agree that it's pretty reliable. However different languages and different encodings can show different misdetection rate. For example rate for CP866 (this is ancient encoding probably) higher than for CP1251. Also Russian alphabet does not contain A-Z characters\, unlike German or French. So French error message can contain just couple of non-ASCII-7bit characters\, unlike Russian.

I would not surprise if this detection is *not* introducing any single bug for any combinations of encoding and language.

However I would not surprise too\, if this detection is broken for some Language-Encoding pair (perhaps for non-Western\, non-Cyrilic languages).

p5pRT commented 10 years ago

From victor@vsespb.ru

On Thu Aug 29 14​:06​:57 2013\, vsespb wrote​:

On Thu Aug 29 13​:05​:00 2013\, public@​khwilliamson.com wrote​:

2) As Victor notes\, the commit does a UTF-8 validity check\, so it is I agree that it's pretty reliable. However different languages and

Generator of byte sequences that are valid in UTF-8 and in another encoding\, and which represend letters (\w) in another encoding.

#!/usr/bin/env perl

use strict; use warnings; use Encode; use utf8;

binmode STDOUT\, "​:encoding(UTF-8)";

my @​A = grep { /\w/ } map { chr($_) } (128..1024);

for my $z1 (@​A) { for my $z2 (''\, @​A) { for my $z3 (''\, @​A) { for my $encoding (qw/ISO-8859-1 ISO-8859-2 ISO-8859-3 ISO-8859-4 ISO-8859-7 ISO-8859-8 ISO-8859-9 ISO-8859-10/) { my $S = $z1.$z2.$z3; my $e = eval { encode($encoding\, "$S"\, Encode​::FB_CROAK); }; next unless $e; my $xx = $e; $xx =~ s/(.)/sprintf("\\x%02X"\,ord($1))/eg; Encode​::_utf8_on($e); if (utf8​::valid($e)) {   print "# $encoding [$S]".(length($S))." [$e] [$xx]\n";   print \<\<"END"; perl -e 'use Encode; binmode STDOUT\, "​:encoding(UTF-8)"; my \$z = "$xx"; print "["\, decode("UTF-8"\, "\$z"\, Encode​::FB_CROAK)\, "]\\t["\, decode("$encoding"\, "\$z"\, Encode​::FB_CROAK)\, "]\\n"' END } } }}} __END__

example output​:

perl -e 'use Encode; binmode STDOUT\, "​:encoding(UTF-8)"; my $z = "\xC3\xBE"; print "["\, decode("UTF-8"\, "$z"\, Encode​::FB_CROAK)\, "]\t["\, decode("ISO-8859-2"\, "$z"\, Encode​::FB_CROAK)\, "]\n"' perl -e 'use Encode; binmode STDOUT\, "​:encoding(UTF-8)"; my $z = "\xC3\xBC"; print "["\, decode("UTF-8"\, "$z"\, Encode​::FB_CROAK)\, "]\t["\, decode("ISO-8859-2"\, "$z"\, Encode​::FB_CROAK)\, "]\n"' perl -e 'use Encode; binmode STDOUT\, "​:encoding(UTF-8)"; my $z = "\xC3\xA1"; print "["\, decode("UTF-8"\, "$z"\, Encode​::FB_CROAK)\, "]\t["\, decode("ISO-8859-2"\, "$z"\, Encode​::FB_CROAK)\, "]\n"'

example output of output example​:

$perl -e 'use Encode; binmode STDOUT\, "​:encoding(UTF-8)"; my $z = "\xC3\xBC"; print "["\, decode("UTF-8"\, "$z"\, Encode​::FB_CROAK)\, "]\t["\, decode("ISO-8859-2"\, "$z"\, Encode​::FB_CROAK)\, "]\n"' [ü] [Ăź]

p5pRT commented 10 years ago

From victor@vsespb.ru

On Thu Aug 29 13​:05​:00 2013\, public@​khwilliamson.com wrote​:

have all been superseded so should be rare today\, but a bug report could be written on this.

Rare under linux. AFAIK FreeBSD 9 (latest stable) users still have single-byte encoding locale by default (at least they try hard to get UTF-8 working and it's only partly supported) http​://forums.freebsd.org/showthread.php?t=34682

There are real users with non-UTF8 locale\, I saw one. We've spent serveral hours trying to find why my perl script hangs sometimes\, and in the end found bug in perlio https://rt-archive.perl.org/perl5/Ticket/Display.html?id=117537 related to non-UTF

A real application which is broken by this change is 'ack' (ack 1 and ack 2).

Russian error messages now printed with warning (under UTF-8 locale!). French error messages now corrupted (because it's Latin-1) under UTF-8 locale too. https://github.com/petdance/ack2/issues/367

p5pRT commented 10 years ago

From @cpansprout

On Thu Aug 29 13​:05​:00 2013\, public@​khwilliamson.com wrote​:

On 08/29/2013 02​:15 AM\, Victor Efimov via RT wrote​:

On Wed Aug 28 23​:40​:08 2013\, sprout wrote​:

So now\, $! may or may not be encoded\, and you have to way of telling reliably without doing the same environment checks that perl itself did internally before deciding to decode $! itself.

I don't follow these arguments. What that commit did is only to look at the string returned by the operating system\, and if it is encoded in UTF-8\, to set that flag in the scalar. That's it (*). If the OS didn't return UTF-8\, it leaves the flag alone. I find it hard to comprehend that this isn't the right thing to do. For the first time\, $! in string context is no different than any other string scalar in Perl. They have a utf-8 bit set which means that the encoding is in UTF-8\,

You are still describing this from the point of view of the internals.

From the users point of view\, the utf8 flag does not mean it is encoded in utf8. It means it is *de*coded; just a sequence of Unicode characters.

or they don't have it set\, which means that the encoding is unknown to Perl.

I.e.\, still encoded.

This commit did not change the latter part one iota.

The former is the problem\, not the latter. If a program can find out what encoding the OS is using for errno messages\, it should be able to apply that encoding to $! via decode($os_encoding\, $!\, Encode​::FB_CROAK). But that fails now when perl thought it saw utf8.

We have conventions as to what the bytes in that scalar mean depending on the context it is used\, the pragmas that are in effect in those contexts\, and the operations that are being performed on it. But they are just conventions. This commit did not change that.

I don’t follow. The bytes inside the scalar are not visible to the Perl program without resorting to introspection that should never be used for dispatch.

Your commit changed the content of the scalar as returned by ord and substr\, but only sometimes. It’s the ‘only sometimes’ that is problematic.

What is different about $! is that we have made the decision to respect locale when accessing it even when not in the scope of 'use locale'.

The problem here is that the locale is only sometimes being respected.

In light of these issues\, perhaps this should be discussed again. I'll let the people who argued for that decision to again argue for it.

The change fixed two bug reports for the common case where the locales for messages and the I/O matched and where people had not taken pains to deal with locale. I think that should trump the less frequent cases\, given the conflicts.

But the less frequent cases now require one to introspect internal scalar flags that should make no difference.

Also\, is that really more frequent? What about scripts that pass $! straight to STDOUT without layers\, knowing that $! is already in the character set the terminal expects?

If code wants $! to be expressed in a certain language\, it should set the locale to that language while accessing $! and then restore the old locale.

Are you suggesting that perl itself start defaulting to the C locale for $!?

Small corrections​:

a) Actually there is a way​: check is_utf8($!) flag (which is not good because is_utf8 marked as danger\, and it's documented you cant distinct characters from bytes with this flag)

I don't see that danger marked currently in the pod for utf8.pm. Where do you see that?

  (Since Perl 5.8.1) Test whether I\<$string> is marked internally as   encoded in UTF-8. Functionally the same as Encode​::is_utf8().

I think he is referring to ‘internally’ here\, which indicates that you shouldn’t rely on it.

b) Current fix does not do environment checks\, it just tries to do UTF-8 validity check

http​://perl5.git.perl.org/perl.git/commitdiff/1500bd919ffeae0f3252f8d1bb28b03b043d328e

(*) To be precise

1) if the string returned by the OS is entirely ASCII\, it does not set the UTF-8 flag. This is because ASCII UTF-8 and non-UTF-8 are identical\, so the flag is irrelevant. And yes\, this is buggy if operating under a non-ASCII 7-bit locale\, as in ISO 646. These locales have all been superseded so should be rare today\, but a bug report could be written on this.

2) As Victor notes\, the commit does a UTF-8 validity check\, so it is possible that that could give false positives. But as Wikipedia says\, "One of the few cases where charset detection works reliably is detecting UTF-8. This is due to the large percentage of invalid byte sequences in UTF-8\, so that text in any other encoding that uses bytes with the high bit set is extremely unlikely to pass a UTF-8 validity test." (The original emphasized "extremely".) I checked this out with the CP1251 character set\, and the only modern Russian character that could be a continuation byte is ё. All other vowels and consonants must be start bytes. That means that to generate a false positive\, an OS message in CP1251 must only contain words whose 2nd\, 4th\, ... bytes are that vowel. That just isn't going to happen\, though the common Russian word Её (her\, hers\, ...) could be confusable if there were no other words in the message.

That is all very nice\, but how would you rewrite this code to work in 5.19.2 and up?

if (!open fh\, $filename) {   # add_to_log expects a string of characters\, so decode it   add_to_log($filename\, 0+$!\, Encode​::decode(   I18N​::Langinfo​::langinfo(I18N​::Langinfo​::CODESET())\,   $!   ));   return; }

--

Father Chrysostomos

p5pRT commented 10 years ago

From @khwilliamson

On 08/31/2013 07​:27 AM\, Father Chrysostomos via RT wrote​:

On Thu Aug 29 13​:05​:00 2013\, public@​khwilliamson.com wrote​:

On 08/29/2013 02​:15 AM\, Victor Efimov via RT wrote​:

On Wed Aug 28 23​:40​:08 2013\, sprout wrote​:

So now\, $! may or may not be encoded\, and you have to way of telling reliably without doing the same environment checks that perl itself did internally before deciding to decode $! itself.

I don't follow these arguments. What that commit did is only to look at the string returned by the operating system\, and if it is encoded in UTF-8\, to set that flag in the scalar. That's it (*). If the OS didn't return UTF-8\, it leaves the flag alone. I find it hard to comprehend that this isn't the right thing to do. For the first time\, $! in string context is no different than any other string scalar in Perl. They have a utf-8 bit set which means that the encoding is in UTF-8\,

You are still describing this from the point of view of the internals.

I persist in this because I believe your point is a red herring. I believe that it is a valid and strong argument that bringing outlier behavior into conformity with the rest of how Perl operates may very well trump other concerns. I was attempting to show that that is what this commit did.

Rather than address most of the rest of your email\, some of which I believe are speciour or false\, let's cut to the chase

how would you rewrite this code to work in 5.19.2 and up?

if (!open fh\, $filename) { # add_to_log expects a string of characters\, so decode it add_to_log($filename\, 0+$!\, Encode​::decode( I18N​::Langinfo​::langinfo(I18N​::Langinfo​::CODESET())\, $! )); return; }

I feel compelled to point out that this code is buggy. I18N​::Langinfo is not portable to all platforms that Perl runs on\, and CODESET gives the locale of LC_CTYPE\, which may not be the same locale that $! is returned in​: LC_MESSAGES. (Note that the code could be modified to change LC_CTYPE to the locale of LC_MESSAGES temporarily around the langinfo call to addess this bug.) Also\, some vendors' nl_langinfo() was\, at the time\, so buggy that the core .t for this doesn't do any "real" testing. http​://perl5.git.perl.org/perl.git/blame/HEAD​:/ext/I18N-Langinfo/t/Langinfo.t

But on platforms where it works reliably\, and the typical case where LC_CTYPE matches LC_MESSAGES\, my commit does break this code. If it were my code here\, I'd 'use bytes' (I don't believe bytes.pm should be removed from core; that this area is one of the few valid uses for it\, and this is not the thread to discuss it)\, or utf8​::is_utf8() (I think we should soften somewhat the admonition against using that.)

I think all of us would agree that deference should be paid to (apparently) working code when making changes. And it may be that this commit is so egregious\, or not really helpful in enough places that its cost benefit ratio is not high enough to keep.

And $! remains an outlier in the sense that it is AFAIK\, and I've looked hard (perhaps not hard enough)\, now the only place (except for some POSIX​:: routines) where the program's underlying current locale leaks outside the scope of 'use locale'. The main argument that I've heard for doing that is that $! is often for the end-user and not the programmer. But it isn't for the end user if what gets displayed is gibberish\, which includes being in some language the user doesn't know\, though the latter is better than garbage bytes. So what I'm advocating is re-examining whether we wish $! to respect 'use locale' or not. If we chose to respect 'use locale'\, outside that\, it would return messages in the system default locale\, typically "C".

I'm pretty confident that the problem can't be solved so that no code has to change and things just start working correctly for everybody. Currently\, using $! in production code that can be operated by users who might have their own locales is much more complicated than people imagine. "die $!" could print gibberish. Maybe a partial answer is to create a wrapper that does the best it can on the platform it is running on\, and suggest people change to use it.

If this commit is reverted\, we do need to decide how we will address the bugs it fixed and the new ones that are sure to come in (barring some better answer). Do we reject them and say you need to handle $! yourself?

p5pRT commented 10 years ago

From victor@vsespb.ru

2013/9/1 Karl Williamson \public@&#8203;khwilliamson\.com

And $! remains an outlier in the sense that it is AFAIK\, and I've looked hard (perhaps not hard enough)\, now the only place (except for some POSIX​:: routines) where the program's underlying current locale leaks outside the scope of 'use locale'.

But that is not the only place\, where non-ASCII character can appear.

The following is documented in perlunicode​:

"While Perl does have extensive ways to input and output in Unicode\, and a few other "entry points" like the @​ARGV array (which can sometimes be interpreted as UTF-8)\, there are still many places where Unicode (in some encoding or another) could be given as arguments or received as results\, or both\, but it is not."

I believe that note can mean that encoded $! is not a bug\, but a feature. If it's considered a bug\, then all other places where non-ASCII appears encoded\, and it's not explicitly documented\, can be considered as bug (examples are $0\, %INC values\, @​INC\, something else?)

Thus it's impossible for people to use those variables now\, as it may change anytime in the future.

p5pRT commented 10 years ago

From @Leont

On Sun\, Sep 1\, 2013 at 4​:36 PM\, Victor Efimov \victor@&#8203;vsespb\.ru wrote​:

But that is not the only place\, where non-ASCII character can appear.

The following is documented in perlunicode​:

"While Perl does have extensive ways to input and output in Unicode\, and a few other "entry points" like the @​ARGV array (which can sometimes be interpreted as UTF-8)\, there are still many places where Unicode (in some encoding or another) could be given as arguments or received as results\, or both\, but it is not."

I believe that note can mean that encoded $! is not a bug\, but a feature. If it's considered a bug\, then all other places where non-ASCII appears encoded\, and it's not explicitly documented\, can be considered as bug (examples are $0\, %INC values\, @​INC\, something else?)

Thus it's impossible for people to use those variables now\, as it may change anytime in the future.

$! is inherently a piece of text\, not piece of binary data. As such\, it makes perfect sense to treat it as such an automatically decode it. The same is not necessarily true for your other examples.

Leon

p5pRT commented 10 years ago

From @Leont

On Sun\, Sep 1\, 2013 at 6​:36 AM\, Karl Williamson \public@&#8203;khwilliamson\.comwrote​:

And $! remains an outlier in the sense that it is AFAIK\, and I've looked hard (perhaps not hard enough)\, now the only place (except for some POSIX​:: routines) where the program's underlying current locale leaks outside the scope of 'use locale'.

Yeah\, in POSIX strftime and the is* functions are also affected.

The main argument that I've heard for doing that is that $! is often for the end-user and not the programmer. But it isn't for the end user if what gets displayed is gibberish\, which includes being in some language the user doesn't know\, though the latter is better than garbage bytes. So what I'm advocating is re-examining whether we wish $! to respect 'use locale' or not. If we chose to respect 'use locale'\, outside that\, it would return messages in the system default locale\, typically "C".

That does sounds like consistency to me.

I'm pretty confident that the problem can't be solved so that no code has

to change and things just start working correctly for everybody.

That is my feeling too. The new situation feels rather unfinished to me\, but the old situation was clearly not the most useful behavior we can offer.

Currently\, using $! in production code that can be operated by users who

might have their own locales is much more complicated than people imagine. "die $!" could print gibberish.

Indeed.

Leon

p5pRT commented 10 years ago

From @khwilliamson

On 08/31/2013 10​:36 PM\, Karl Williamson wrote​:

I feel compelled to point out that this code is buggy. I18N​::Langinfo is not portable to all platforms that Perl runs on\, and CODESET gives the locale of LC_CTYPE\, which may not be the same locale that $! is returned in​: LC_MESSAGES. (Note that the code could be modified to change LC_CTYPE to the locale of LC_MESSAGES temporarily around the langinfo call to addess this bug.) Also\, some vendors' nl_langinfo() was\, at the time\, so buggy that the core .t for this doesn't do any "real" testing. http​://perl5.git.perl.org/perl.git/blame/HEAD​:/ext/I18N-Langinfo/t/Langinfo.

I now feel compelled to point out that I should have been more clear that this code is fine\, not buggy\, if used in the environment in which it was likely designed for. On a platform with a working nl_langinfo() and the programmer knows that LC_MESSAGES and LC_CTYPE are always in sync\, this worked well\, until I broke it.

p5pRT commented 10 years ago

From victor@vsespb.ru

2013/9/1 Leon Timmermans \fawaka@&#8203;gmail\.com

$! is inherently a piece of text\, not piece of binary data. As such\, it makes perfect sense to treat it as such an automatically decode it. The same is not necessarily true for your other examples.

btw\, interesting that $^E is not affected by this change\, i.e when $! is same as $^E (I tested on linux only)\, $^E does not have utf-8 flag\, while $! has.

p5pRT commented 10 years ago

From @khwilliamson

On 09/01/2013 10​:47 AM\, Victor Efimov wrote​:

2013/9/1 Leon Timmermans \<fawaka@​gmail.com \mailto&#8203;:fawaka@&#8203;gmail\.com>

$\! is inherently a piece of text\, not piece of binary data\. As such\,
it makes perfect sense to treat it as such an automatically decode
it\. The same is not necessarily true for your other examples\.

btw\, interesting that $^E is not affected by this change\, i.e when $! is same as $^E (I tested on linux only)\, $^E does not have utf-8 flag\, while $! has.

I've been wondering myself what should happen with $^E\, and I believe the two should be made consistent.

Some other thoughts I've had about this issue.

The commit did not break the ISO 646 7-bit codings\, as the behavior is unchanged for those.

Those encodings must not be very important nor have been for quite some time\, as it does not appear that Encode supports them.

We could have a feature automatically turned on in v5.20. I'll call it 'errno' for now ('mauve' having been taken ;) ).

Without it being on\, $! works as it did in \<=v5.18.

Within its scope Perl attempts to decode $! as best it can\, autoloading Encode and trying to determine the locale using nl_langinfo() if available.

This may be a crazy idea; but I thought I'd put it out there to stimulate discussion

p5pRT commented 10 years ago

From zefram@fysh.org

Karl Williamson wrote​:

Within its scope Perl attempts to decode $! as best it can\,

Scoping doesn't work well for this sort of thing. The decoding happens in get magic\, when the variable is being read. If that behaviour is affected by the lexical scope in which the reading happens\, this means that different readers will see different values in the same variable\, which is awfully confusing if a reference to the variable gets passed around. Worse\, XS code gets the behaviour of *its caller's* lexical scope.

Amusingly\, $[ used to influence $#foo magic variables in this manner. It's one of the reasons I'm glad we got rid of $[.

-zefram

p5pRT commented 10 years ago

From victor@vsespb.ru

one problem with lexical scope is also POSIX​::strerror. currently it's implemented using

strerror => 'errno => local $! = $_[0]; "$!"'\,

thus it has to be fixed too if we implement lexical featurization.

2013/9/1 Zefram \zefram@&#8203;fysh\.org

Karl Williamson wrote​:

Within its scope Perl attempts to decode $! as best it can\,

Scoping doesn't work well for this sort of thing. The decoding happens in get magic\, when the variable is being read. If that behaviour is affected by the lexical scope in which the reading happens\, this means that different readers will see different values in the same variable\, which is awfully confusing if a reference to the variable gets passed around. Worse\, XS code gets the behaviour of *its caller's* lexical scope.

Amusingly\, $[ used to influence $#foo magic variables in this manner. It's one of the reasons I'm glad we got rid of $[.

-zefram

p5pRT commented 10 years ago

From @cpansprout

On Sun Sep 01 09​:24​:19 2013\, public@​khwilliamson.com wrote​:

On 08/31/2013 10​:36 PM\, Karl Williamson wrote​:

I feel compelled to point out that this code is buggy. I18N​::Langinfo is not portable to all platforms that Perl runs on\, and CODESET gives the locale of LC_CTYPE\, which may not be the same locale that $! is returned in​: LC_MESSAGES. (Note that the code could be modified to change LC_CTYPE to the locale of LC_MESSAGES temporarily around the langinfo call to addess this bug.) Also\, some vendors' nl_langinfo() was\, at the time\, so buggy that the core .t for this doesn't do any "real" testing. http​://perl5.git.perl.org/perl.git/blame/HEAD​:/ext/I18N- Langinfo/t/Langinfo.

I now feel compelled to point out that I should have been more clear that this code is fine\, not buggy\, if used in the environment in which it was likely designed for. On a platform with a working nl_langinfo() and the programmer knows that LC_MESSAGES and LC_CTYPE are always in sync\, this worked well\, until I broke it.

More importantly\, as Victor pointed out\, it breaks programs that are not trying to do anything with character sets or locales\, such as ack\, when they are running on a utf8 terminal in a utf8 locale. I dare to bet those are the most common.

On dromedary I get this with the system perl (5.12.3)​:

$ LC_ALL=hu_HU.utf8 perl -e 'open "oentuheon" or die $!' Nincs ilyen fájl vagy könyvtár at -e line 1.

When I build my own (blead)perl\, I get this​:

$ LC_ALL=hu_HU.utf8 ./perl -e 'open "oentuheon" or die $!' Nincs ilyen f?jl vagy k?nyvt?r at -e line 1.

--

Father Chrysostomos

p5pRT commented 10 years ago

From @cpansprout

On Sun Sep 01 11​:09​:41 2013\, sprout wrote​:

On Sun Sep 01 09​:24​:19 2013\, public@​khwilliamson.com wrote​:

On 08/31/2013 10​:36 PM\, Karl Williamson wrote​:

I feel compelled to point out that this code is buggy. I18N​::Langinfo is not portable to all platforms that Perl runs on\, and CODESET gives the locale of LC_CTYPE\, which may not be the same locale that $! is returned in​: LC_MESSAGES. (Note that the code could be modified to change LC_CTYPE to the locale of LC_MESSAGES temporarily around the langinfo call to addess this bug.) Also\, some vendors' nl_langinfo() was\, at the time\, so buggy that the core .t for this doesn't do any "real" testing. http​://perl5.git.perl.org/perl.git/blame/HEAD​:/ext/I18N- Langinfo/t/Langinfo.

I now feel compelled to point out that I should have been more clear that this code is fine\, not buggy\, if used in the environment in which it was likely designed for. On a platform with a working nl_langinfo() and the programmer knows that LC_MESSAGES and LC_CTYPE are always in sync\, this worked well\, until I broke it.

More importantly\, as Victor pointed out\, it breaks programs that are not trying to do anything with character sets or locales\, such as ack\, when they are running on a utf8 terminal in a utf8 locale. I dare to bet those are the most common.

On dromedary I get this with the system perl (5.12.3)​:

$ LC_ALL=hu_HU.utf8 perl -e 'open "oentuheon" or die $!' Nincs ilyen f�jl vagy k�nyvt�r at -e line 1.

RT screwed it up. That appeared perfectly fine.

When I build my own (blead)perl\, I get this​:

$ LC_ALL=hu_HU.utf8 ./perl -e 'open "oentuheon" or die $!' Nincs ilyen f?jl vagy k?nyvt?r at -e line 1.

That is as it appeared\, with question marks.

--

Father Chrysostomos

p5pRT commented 10 years ago

From @cpansprout

On Sun Sep 01 10​:46​:13 2013\, zefram@​fysh.org wrote​:

Karl Williamson wrote​:

Within its scope Perl attempts to decode $! as best it can\,

Scoping doesn't work well for this sort of thing. The decoding happens in get magic\, when the variable is being read. If that behaviour is affected by the lexical scope in which the reading happens\, this means that different readers will see different values in the same variable\, which is awfully confusing if a reference to the variable gets passed around.

A new global variable is another option.

--

Father Chrysostomos

p5pRT commented 10 years ago

From victor@vsespb.ru

2013/9/1 Father Chrysostomos via RT \perlbug\-followup@&#8203;perl\.org

A new global variable is another option.

perhaps ${^DECODED_ERROR} ?

p5pRT commented 10 years ago

From @khwilliamson

On 09/02/2013 05​:10 PM\, Victor Efimov wrote​:

2013/9/1 Father Chrysostomos via RT \<perlbug-followup@​perl.org \mailto&#8203;:perlbug\-followup@&#8203;perl\.org>

A new global variable is another option\.

perhaps ${^DECODED_ERROR} ?

I have come to believe that this is probably the best way forward. That is\, revert the $! change\, and tell people who need it to use the new global variable which will decode as best it can on the given platform based on the locale in effect.

In any event\, there should be uniform treatment of $! and $^E. That means that a parallel variable should be provided for $^E.

Does anyone know if the strings for the platforms that have separate $^E strings return those in the current locale or not?

These include vms\, win32\, dos\, and os/2.

p5pRT commented 10 years ago

From victor@vsespb.ru

Also there should be unicode version of POSIX​::strerror implemented\, probably.

2013/9/10 Karl Williamson \public@&#8203;khwilliamson\.com

In any event\, there should be uniform treatment of $! and $^E. That means that a parallel variable should be provided for $^E.

p5pRT commented 10 years ago

From victor@vsespb.ru

On Win32 (Strawberry perl)​:

$chcp Текущая кодовая страница​: 866

$perl -MEncode -e "binmode STDOUT\, '​:encoding(CP866)'; open my $f\, '\<'\, 'notafile' or print decode('WINDOWS-1251'\, qq{error is​: [$^E]})" error is​: [Не удается найти указанный файл]

(firs command outputs "codepage" encoding\, last prints sane Russian error message)

2013/9/10 Karl Williamson \public@&#8203;khwilliamson\.com

Does anyone know if the strings for the platforms that have separate $^E strings return those in the current locale or not?

These include vms\, win32\, dos\, and os/2.

p5pRT commented 10 years ago

From @khwilliamson

On 09/09/2013 07​:06 PM\, Karl Williamson wrote​:

On 09/02/2013 05​:10 PM\, Victor Efimov wrote​:

2013/9/1 Father Chrysostomos via RT \<perlbug-followup@​perl.org \mailto&#8203;:perlbug\-followup@&#8203;perl\.org>

A new global variable is another option\.

perhaps ${^DECODED_ERROR} ?

I have come to believe that this is probably the best way forward. That is\, revert the $! change\, and tell people who need it to use the new global variable which will decode as best it can on the given platform based on the locale in effect.

In looking at this\, I thought of something else. I do believe that the current behavior is correct for such a variable within the lexical scope of "use locale". But outside such scope the behavior would be to decode fully\, as best as practicable on the platform being run on.

Then it occurred to me would merely changing $! (and $^E) to behave this way address your issues? It is a change in behavior from the way things have alway been\, but outside "use locale"\, it would fully decode\, which someone in the thread was the issue with the current fix.

p5pRT commented 10 years ago

From victor@vsespb.ru

So\, you propose​: 1. in scope of 'use locale' implement old behaviour (5.18 and earlier) 2. outside of scope - "decode fully\, as best as practicable on the platform being run on"

I don't think this will solves the problem.

Existing programs will still break (and to fix it you'll need to add 'use locale'\, which can introduce other bugs to program).

Existing programs might work with modern unicode and\, AFAIK\, adding 'use locale' just not recommended for this case. The fact that they use '$!' is not necessary means it's legacy code which don't work with unicode. It can be brand new code written for 5.18

Also\, @​Zefram mentioned here https://rt-archive.perl.org/perl5/Ticket/Display.html?id=119499#txn-1250019 that lexical scope for such things isn't a good idea.

decode fully\, as best as practicable on the platform being run on

function\, which sometimes returns character string with UTF-8 bit set\, and sometimes returns byte string in unknown encoding is useless IMHO. so if you decode $!\, decoding should be done always. if decoding is failed\, IMHO better to return undef or something.

On Mon Sep 16 09​:05​:17 2013\, public@​khwilliamson.com wrote​:

On 09/09/2013 07​:06 PM\, Karl Williamson wrote​:

On 09/02/2013 05​:10 PM\, Victor Efimov wrote​:

2013/9/1 Father Chrysostomos via RT \<perlbug-followup@​perl.org \mailto&#8203;:perlbug\-followup@&#8203;perl\.org>

A new global variable is another option\.

perhaps ${^DECODED_ERROR} ?

I have come to believe that this is probably the best way forward. That is\, revert the $! change\, and tell people who need it to use the new global variable which will decode as best it can on the given platform based on the locale in effect.

In looking at this\, I thought of something else. I do believe that the current behavior is correct for such a variable within the lexical scope of "use locale". But outside such scope the behavior would be to decode fully\, as best as practicable on the platform being run on.

Then it occurred to me would merely changing $! (and $^E) to behave this way address your issues? It is a change in behavior from the way things have alway been\, but outside "use locale"\, it would fully decode\, which someone in the thread was the issue with the current fix.

p5pRT commented 10 years ago

From @cpansprout

On Mon Sep 16 09​:05​:17 2013\, public@​khwilliamson.com wrote​:

On 09/09/2013 07​:06 PM\, Karl Williamson wrote​:

On 09/02/2013 05​:10 PM\, Victor Efimov wrote​:

2013/9/1 Father Chrysostomos via RT \<perlbug-followup@​perl.org \mailto&#8203;:perlbug\-followup@&#8203;perl\.org>

A new global variable is another option\.

perhaps ${^DECODED_ERROR} ?

I have come to believe that this is probably the best way forward. That is\, revert the $! change\, and tell people who need it to use the new global variable which will decode as best it can on the given platform based on the locale in effect.

In looking at this\, I thought of something else. I do believe that the current behavior is correct for such a variable within the lexical scope of "use locale". But outside such scope the behavior would be to decode fully\, as best as practicable on the platform being run on.

Then it occurred to me would merely changing $! (and $^E) to behave this way address your issues? It is a change in behavior from the way things have alway been\, but outside "use locale"\, it would fully decode\, which someone in the thread was the issue with the current fix.

I was the one who implied that. What I meant was that\, if decoding happens unconditionally\, at least one can check the Perl version to determine how to handle $!. It is still backward-incompatible. I was then going to suggest lexically scoping the new behaviour\, but Zefram has already pointed out why that is not a good idea. A new global variable is the best choice at this point.

--

Father Chrysostomos

p5pRT commented 10 years ago

From @khwilliamson

On 09/20/2013 09​:11 PM\, Father Chrysostomos via RT wrote​:

On Mon Sep 16 09​:05​:17 2013\, public@​khwilliamson.com wrote​:

On 09/09/2013 07​:06 PM\, Karl Williamson wrote​:

On 09/02/2013 05​:10 PM\, Victor Efimov wrote​:

2013/9/1 Father Chrysostomos via RT \<perlbug-followup@​perl.org \mailto&#8203;:perlbug\-followup@&#8203;perl\.org>

 A new global variable is another option\.

perhaps ${^DECODED_ERROR} ?

I have come to believe that this is probably the best way forward. That is\, revert the $! change\, and tell people who need it to use the new global variable which will decode as best it can on the given platform based on the locale in effect.

In looking at this\, I thought of something else. I do believe that the current behavior is correct for such a variable within the lexical scope of "use locale". But outside such scope the behavior would be to decode fully\, as best as practicable on the platform being run on.

Then it occurred to me would merely changing $! (and $^E) to behave this way address your issues? It is a change in behavior from the way things have alway been\, but outside "use locale"\, it would fully decode\, which someone in the thread was the issue with the current fix.

I was the one who implied that. What I meant was that\, if decoding happens unconditionally\, at least one can check the Perl version to determine how to handle $!. It is still backward-incompatible. I was then going to suggest lexically scoping the new behaviour\, but Zefram has already pointed out why that is not a good idea. A new global variable is the best choice at this point.

tl;dr

0) A brief overview of how locales work with Perl is presented 1) $! used to work as if it always was in the scope of both 'use locale' and 'use bytes' 2) The blamed commit removed the 'use bytes' component\, breaking code that relied on that; fixing some code that didn't. 3) Many people think that 'use bytes' should be outlawed. Thus we should take a good hard look before reverting the commit and restoring 'use bytes' behavior. 4) $! now acts (with regard to encoding) as any other scalar does within the scope of 'use locale'. My proposal is to leave it that way when in that scope. Thus\, it doesn't become an outlier that has to be treated specially. 5) Outside such scope​: on systems that have nl_langinfo()\, $! would automatically be decoded to UTF-8; otherwise to English (C locale)\, which the end user could google translate if necessary. 6) An objection has been raised that this creates problems when references to $! are passed\, and in XS code where it gets its caller's scope. But this is no different than any variable that deals with locales. 7) An alternative is to revert this commit (bringing back 'use bytes' behavior)\, and to create a new variable that always fully decodes. But that doesn't help code that is in 'use locale'. There would be no variable that gives correct behavior for that situation (The behavior of the current commit is that correct behavior). Perhaps another new variable would be created that does what the current commit does\, regardless of scope\, making 3 variables. Also\, $^E also has this problem\, and should have the same solution applied to it as we do to $!.   That would mean 4 new variables would have to be created\, making 6 variables. That seems overly ugly\, and confusing.

===================================

I'd like to start with a brief refresher on Perl and locales. Every C program always is running in a particular locale. Absent a setlocale() to the contrary\, that locale is the "C" locale\, which gives the behavior described in K&R. But a setlocale() call to something else will cause many libc functions to behave differently. Under those\, theoretically​: 1) any particular byte in a string could mean nearly any character (or portion of a character); 2) the language for the text of $! could be anything; 3) etc. There can be single-byte locales\, wide character (U16 or U32 usually) locales\, and varying character length locales (which UTF-8 is). Perl has never officially supported anything other than single byte locales.   In practice\, almost all published locales have every ASCII-range code point mean the corresponding ASCII character\, hence differing only in non-ASCII bytes. Perl avoids assuming this ASCII correspondence pretty much as best it can.

One of the first things that Perl does when it starts up (with a minor exception for embedded Perl\, added in the 5.19 series) is to call setlocale()\, thus causing the libc functions to change behavior. The locale that is set is determined from the caller's environment\, typically using the LANG or other environment variables. Increasingly\, on Linux systems anyway\, this is some UTF-8 locale.

But Perl isn't supposed to expose the underlying locale outside the scope of 'use locale'. Various patches in the 5.19 series have fixed all known such leaks except for various POSIX​:: functions where it doesn't make sense to hide\, and $!. The rationale for the latter is that $! is for the user of the program\, not the programmer\, and so should be output in the user's language\, as gleaned from his/her locale.

What happens if a string scalar is in some locale\, and a code point that requires UTF-8 is added to it? The answer is that this is generally not a good idea to do\, but Perl copes by converting the scalar to UTF-8\, with the code points below 256 assumed to be what they mean in the (single-byte) locale\, even if they require 2 UTF-8 bytes to represent. This means that operations that cross the 255/256 boundary in a UTF-8 locale are undefined. For example\, the uppercase of \xFF is \x{178} normally (as in Unicode they are the SMALL and UPPER y with diaresis respectively)\, but within the scope of 'use locale' uc("\xFF") remains \xFF\, because we don't know what character \xFF really represents in that locale. In just the ISO-8859 series of locales\, it can be U+FF\, or U+040F\, U+0138\, U+2019\, or unassigned. (Note that if we knew that a locale is UTF-8\, we would know what \xFF really is\, and so could treat things just like non-locale Perl does).

That the meaning of characters is context dependent means that when using locale\, it generally is not a good idea to pass references to variables. Correct me if I'm wrong\, but I believe this means that XS code gets its caller's lexical scope with regard to this.

Until the commit that generated this ticket\, $! returned the bytes that comprise the message regardless of whether the message was in UTF-8 or not. Thus it behaved as if it were in the scope of both 'use locale' and 'use bytes'. What the commit effectively did was to remove the 'use bytes' behavior\, causing $! to behave as any other string scalar does under 'use locale'. Many people on this list think that we should get rid of 'use bytes'; that its behavior is never desired. (I'm not one of them BTW\, but I think it should be used only very rarely.) Thus\, on the face of it\, it is suspect that $! should behave as if it is in 'use bytes'\, and I'm having a hard time groking the argument that we should revert back to that.

To clarify my proposal (since Victor misunderstood it)\, I propose\, within 'use locale' scope\, leaving the behavior as the commit changed it to. $! now behaves as other variables in such scope behave; it no longer is an outlier that has to be treated specially. Outside that scope\, I propose to fully decode $! into Perl's internal coding (essentially UTF-8). The latter would automatically load the needed modules. If the system did not have nl_langinfo()\, I now think that the best thing to do is to output the message in the C locale\, yielding it in English\, which the user could machine translate. We are not going to return undef\, as Victor suggested\, as that would be throwing away potentially crucial information.

As I mentioned above\, it's not a good idea to pass references to locale-encoded variables. I don't see how $! is different from other locale variables in its orneriness. It just comes with the territory.

The idea of reverting this commit and having another global variable that does the full decoding harms code within 'use locale' scope. Instead of this variable being a typical scalar there\, it becomes an outlier\, which has to have special treatment. We could add a third variable which behaves as the current commit now does to accommodate such code. This is getting unwieldy. Whatever behavior we decide to do has to also be applied to $^E. Now we would then have 6 variables instead of 2.

I think my proposal is the least bad of those presented so far.

p5pRT commented 10 years ago

From @craigberry

On Wed\, Oct 9\, 2013 at 8​:46 PM\, Karl Williamson \public@&#8203;khwilliamson\.com wrote​:

Whatever behavior we decide to do has to also be applied to $^E.

Probably not\, actually. perlvar.pod says that $^E is "Error information specific to the current operating system." As I've indicated to you off-list\, how you get different languages for system messages on any given operating system is as likely as not going to be completely orthogonal to the notion of a POSIX locale.

p5pRT commented 10 years ago

From victor@vsespb.ru

And how one should fix code below (both examples example1 and example2) to work same way in 5.18 and 5.20 ?

===== example1.pl use strict; use warnings; use Encode; my %Config = ( default_locale_encoding => 'UTF-8' ); # user supplied my $locale_encoding = eval {   require I18N​::Langinfo;   my $enc = I18N​::Langinfo​::langinfo(I18N​::Langinfo​::CODESET());   defined (find_encoding($enc)) ? $enc : undef; };

$locale_encoding ||= $Config{default_locale_encoding}; binmode STDERR\, "​:encoding($locale_encoding)";

open (my $f\, "\<"\, "not_a_file") or do {   die decode($locale_encoding\, "$!"\, Encode​::DIE_ON_ERR|Encode​::LEAVE_SRC); }

$ perl example1.pl No such file or directory at example1.pl line 15.

$ LANG=ru_RU LANGUAGE=ru_RU​:ru LC_ALL=ru_RU.utf8 perl example1.pl Нет такого файла или каталога at example1.pl line 15.

===== example2.pl use strict; use warnings; open (my $f\, "\<"\, "not_a_file") or do {   die "$!"; }

$ perl example2.pl No such file or directory at example2.pl line 4.

$ LANG=ru_RU LANGUAGE=ru_RU​:ru LC_ALL=ru_RU.utf8 perl example2.pl Нет такого файла или каталога at example2.pl line 4.

On Wed Oct 09 18​:46​:47 2013\, public@​khwilliamson.com wrote​:

tl;dr

0) A brief overview of how locales work with Perl is presented 1) $! used to work as if it always was in the scope of both 'use locale' and 'use bytes' 2) The blamed commit removed the 'use bytes' component\, breaking code that relied on that; fixing some code that didn't. 3) Many people think that 'use bytes' should be outlawed. Thus we should take a good hard look before reverting the commit and restoring 'use bytes' behavior. 4) $! now acts (with regard to encoding) as any other scalar does within the scope of 'use locale'. My proposal is to leave it that way when in that scope. Thus\, it doesn't become an outlier that has to be treated specially. 5) Outside such scope​: on systems that have nl_langinfo()\, $! would automatically be decoded to UTF-8; otherwise to English (C locale)\, which the end user could google translate if necessary. 6) An objection has been raised that this creates problems when references to $! are passed\, and in XS code where it gets its caller's scope. But this is no different than any variable that deals with locales. 7) An alternative is to revert this commit (bringing back 'use bytes' behavior)\, and to create a new variable that always fully decodes. But that doesn't help code that is in 'use locale'. There would be no variable that gives correct behavior for that situation (The behavior of the current commit is that correct behavior). Perhaps another new variable would be created that does what the current commit does\, regardless of scope\, making 3 variables. Also\, $^E also has this problem\, and should have the same solution applied to it as we do to $!. That would mean 4 new variables would have to be created\, making 6 variables. That seems overly ugly\, and confusing.

===================================

I'd like to start with a brief refresher on Perl and locales. Every C program always is running in a particular locale. Absent a setlocale() to the contrary\, that locale is the "C" locale\, which gives the behavior described in K&R. But a setlocale() call to something else will cause many libc functions to behave differently. Under those\, theoretically​: 1) any particular byte in a string could mean nearly any character (or portion of a character); 2) the language for the text of $! could be anything; 3) etc. There can be single-byte locales\, wide character (U16 or U32 usually) locales\, and varying character length locales (which UTF-8 is). Perl has never officially supported anything other than single byte locales. In practice\, almost all published locales have every ASCII-range code point mean the corresponding ASCII character\, hence differing only in non-ASCII bytes. Perl avoids assuming this ASCII correspondence pretty much as best it can.

One of the first things that Perl does when it starts up (with a minor exception for embedded Perl\, added in the 5.19 series) is to call setlocale()\, thus causing the libc functions to change behavior. The locale that is set is determined from the caller's environment\, typically using the LANG or other environment variables. Increasingly\, on Linux systems anyway\, this is some UTF-8 locale.

But Perl isn't supposed to expose the underlying locale outside the scope of 'use locale'. Various patches in the 5.19 series have fixed all known such leaks except for various POSIX​:: functions where it doesn't make sense to hide\, and $!. The rationale for the latter is that $! is for the user of the program\, not the programmer\, and so should be output in the user's language\, as gleaned from his/her locale.

What happens if a string scalar is in some locale\, and a code point that requires UTF-8 is added to it? The answer is that this is generally not a good idea to do\, but Perl copes by converting the scalar to UTF-8\, with the code points below 256 assumed to be what they mean in the (single-byte) locale\, even if they require 2 UTF-8 bytes to represent. This means that operations that cross the 255/256 boundary in a UTF-8 locale are undefined. For example\, the uppercase of \xFF is \x{178} normally (as in Unicode they are the SMALL and UPPER y with diaresis respectively)\, but within the scope of 'use locale' uc("\xFF") remains \xFF\, because we don't know what character \xFF really represents in that locale. In just the ISO-8859 series of locales\, it can be U+FF\, or U+040F\, U+0138\, U+2019\, or unassigned. (Note that if we knew that a locale is UTF-8\, we would know what \xFF really is\, and so could treat things just like non-locale Perl does).

That the meaning of characters is context dependent means that when using locale\, it generally is not a good idea to pass references to variables. Correct me if I'm wrong\, but I believe this means that XS code gets its caller's lexical scope with regard to this.

Until the commit that generated this ticket\, $! returned the bytes that comprise the message regardless of whether the message was in UTF-8 or not. Thus it behaved as if it were in the scope of both 'use locale' and 'use bytes'. What the commit effectively did was to remove the 'use bytes' behavior\, causing $! to behave as any other string scalar does under 'use locale'. Many people on this list think that we should get rid of 'use bytes'; that its behavior is never desired. (I'm not one of them BTW\, but I think it should be used only very rarely.) Thus\, on the face of it\, it is suspect that $! should behave as if it is in 'use bytes'\, and I'm having a hard time groking the argument that we should revert back to that.

To clarify my proposal (since Victor misunderstood it)\, I propose\, within 'use locale' scope\, leaving the behavior as the commit changed it to. $! now behaves as other variables in such scope behave; it no longer is an outlier that has to be treated specially. Outside that scope\, I propose to fully decode $! into Perl's internal coding (essentially UTF-8). The latter would automatically load the needed modules. If the system did not have nl_langinfo()\, I now think that the best thing to do is to output the message in the C locale\, yielding it in English\, which the user could machine translate. We are not going to return undef\, as Victor suggested\, as that would be throwing away potentially crucial information.

As I mentioned above\, it's not a good idea to pass references to locale-encoded variables. I don't see how $! is different from other locale variables in its orneriness. It just comes with the territory.

The idea of reverting this commit and having another global variable that does the full decoding harms code within 'use locale' scope. Instead of this variable being a typical scalar there\, it becomes an outlier\, which has to have special treatment. We could add a third variable which behaves as the current commit now does to accommodate such code. This is getting unwieldy. Whatever behavior we decide to do has to also be applied to $^E. Now we would then have 6 variables instead of 2.

I think my proposal is the least bad of those presented so far.

p5pRT commented 10 years ago

From @khwilliamson

On 10/10/2013 06​:12 AM\, Victor Efimov via RT wrote​:

And how one should fix code below (both examples example1 and example2) to work same way in 5.18 and 5.20 ?

===== example1.pl use strict; use warnings; use Encode; my %Config = ( default_locale_encoding => 'UTF-8' ); # user supplied my $locale_encoding = eval { require I18N​::Langinfo; my $enc = I18N​::Langinfo​::langinfo(I18N​::Langinfo​::CODESET()); defined (find_encoding($enc)) ? $enc : undef; };

$locale_encoding ||= $Config{default_locale_encoding}; binmode STDERR\, "​:encoding($locale_encoding)";

open (my $f\, "\<"\, "not_a_file") or do { die decode($locale_encoding\, "$!"\, Encode​::DIE_ON_ERR|Encode​::LEAVE_SRC); }

$ perl example1.pl No such file or directory at example1.pl line 15.

$ LANG=ru_RU LANGUAGE=ru_RU​:ru LC_ALL=ru_RU.utf8 perl example1.pl Нет такого файла или каталога at example1.pl line 15.

===== example2.pl use strict; use warnings; open (my $f\, "\<"\, "not_a_file") or do { die "$!"; }

$ perl example2.pl No such file or directory at example2.pl line 4.

$ LANG=ru_RU LANGUAGE=ru_RU​:ru LC_ALL=ru_RU.utf8 perl example2.pl Нет такого файла или каталога at example2.pl line 4.

What you want is for $! to work like it's in 'use bytes'. I can change the patch so that it checks for 'use bytes' and if within that scope returns without the utf8 flag set. You would then just need to add a 'use bytes' to get it to work the same way it always has.

There are people who would disapprove of ever using bytes\, which means they think the behavior you want is wrong. I'm not one of them. I think that 'use bytes' should be rare\, mostly used in testing\, but it sometimes is the easiest\, clearest way of getting at the bytes that comprise a UTF-8-encoded character. utf8​::encode() can be used for that\, but destroys its argument and I think its name is much less clear than 'use bytes'.

I have tested doing this\, and it works.

On Wed Oct 09 18​:46​:47 2013\, public@​khwilliamson.com wrote​:

tl;dr

0) A brief overview of how locales work with Perl is presented 1) $! used to work as if it always was in the scope of both 'use locale' and 'use bytes' 2) The blamed commit removed the 'use bytes' component\, breaking code that relied on that; fixing some code that didn't. 3) Many people think that 'use bytes' should be outlawed. Thus we should take a good hard look before reverting the commit and restoring 'use bytes' behavior. 4) $! now acts (with regard to encoding) as any other scalar does within the scope of 'use locale'. My proposal is to leave it that way when in that scope. Thus\, it doesn't become an outlier that has to be treated specially. 5) Outside such scope​: on systems that have nl_langinfo()\, $! would automatically be decoded to UTF-8; otherwise to English (C locale)\, which the end user could google translate if necessary. 6) An objection has been raised that this creates problems when references to $! are passed\, and in XS code where it gets its caller's scope. But this is no different than any variable that deals with locales. 7) An alternative is to revert this commit (bringing back 'use bytes' behavior)\, and to create a new variable that always fully decodes. But that doesn't help code that is in 'use locale'. There would be no variable that gives correct behavior for that situation (The behavior of the current commit is that correct behavior). Perhaps another new variable would be created that does what the current commit does\, regardless of scope\, making 3 variables. Also\, $^E also has this problem\, and should have the same solution applied to it as we do to $!. That would mean 4 new variables would have to be created\, making 6 variables. That seems overly ugly\, and confusing.

===================================

I'd like to start with a brief refresher on Perl and locales. Every C program always is running in a particular locale. Absent a setlocale() to the contrary\, that locale is the "C" locale\, which gives the behavior described in K&R. But a setlocale() call to something else will cause many libc functions to behave differently. Under those\, theoretically​: 1) any particular byte in a string could mean nearly any character (or portion of a character); 2) the language for the text of $! could be anything; 3) etc. There can be single-byte locales\, wide character (U16 or U32 usually) locales\, and varying character length locales (which UTF-8 is). Perl has never officially supported anything other than single byte locales. In practice\, almost all published locales have every ASCII-range code point mean the corresponding ASCII character\, hence differing only in non-ASCII bytes. Perl avoids assuming this ASCII correspondence pretty much as best it can.

One of the first things that Perl does when it starts up (with a minor exception for embedded Perl\, added in the 5.19 series) is to call setlocale()\, thus causing the libc functions to change behavior. The locale that is set is determined from the caller's environment\, typically using the LANG or other environment variables. Increasingly\, on Linux systems anyway\, this is some UTF-8 locale.

But Perl isn't supposed to expose the underlying locale outside the scope of 'use locale'. Various patches in the 5.19 series have fixed all known such leaks except for various POSIX​:: functions where it doesn't make sense to hide\, and $!. The rationale for the latter is that $! is for the user of the program\, not the programmer\, and so should be output in the user's language\, as gleaned from his/her locale.

What happens if a string scalar is in some locale\, and a code point that requires UTF-8 is added to it? The answer is that this is generally not a good idea to do\, but Perl copes by converting the scalar to UTF-8\, with the code points below 256 assumed to be what they mean in the (single-byte) locale\, even if they require 2 UTF-8 bytes to represent. This means that operations that cross the 255/256 boundary in a UTF-8 locale are undefined. For example\, the uppercase of \xFF is \x{178} normally (as in Unicode they are the SMALL and UPPER y with diaresis respectively)\, but within the scope of 'use locale' uc("\xFF") remains \xFF\, because we don't know what character \xFF really represents in that locale. In just the ISO-8859 series of locales\, it can be U+FF\, or U+040F\, U+0138\, U+2019\, or unassigned. (Note that if we knew that a locale is UTF-8\, we would know what \xFF really is\, and so could treat things just like non-locale Perl does).

That the meaning of characters is context dependent means that when using locale\, it generally is not a good idea to pass references to variables. Correct me if I'm wrong\, but I believe this means that XS code gets its caller's lexical scope with regard to this.

Until the commit that generated this ticket\, $! returned the bytes that comprise the message regardless of whether the message was in UTF-8 or not. Thus it behaved as if it were in the scope of both 'use locale' and 'use bytes'. What the commit effectively did was to remove the 'use bytes' behavior\, causing $! to behave as any other string scalar does under 'use locale'. Many people on this list think that we should get rid of 'use bytes'; that its behavior is never desired. (I'm not one of them BTW\, but I think it should be used only very rarely.) Thus\, on the face of it\, it is suspect that $! should behave as if it is in 'use bytes'\, and I'm having a hard time groking the argument that we should revert back to that.

To clarify my proposal (since Victor misunderstood it)\, I propose\, within 'use locale' scope\, leaving the behavior as the commit changed it to. $! now behaves as other variables in such scope behave; it no longer is an outlier that has to be treated specially. Outside that scope\, I propose to fully decode $! into Perl's internal coding (essentially UTF-8). The latter would automatically load the needed modules. If the system did not have nl_langinfo()\, I now think that the best thing to do is to output the message in the C locale\, yielding it in English\, which the user could machine translate. We are not going to return undef\, as Victor suggested\, as that would be throwing away potentially crucial information.

As I mentioned above\, it's not a good idea to pass references to locale-encoded variables. I don't see how $! is different from other locale variables in its orneriness. It just comes with the territory.

The idea of reverting this commit and having another global variable that does the full decoding harms code within 'use locale' scope. Instead of this variable being a typical scalar there\, it becomes an outlier\, which has to have special treatment. We could add a third variable which behaves as the current commit now does to accommodate such code. This is getting unwieldy. Whatever behavior we decide to do has to also be applied to $^E. Now we would then have 6 variables instead of 2.

I think my proposal is the least bad of those presented so far.

--- via perlbug​: queue​: perl5 status​: open https://rt-archive.perl.org/perl5/Ticket/Display.html?id=119499

p5pRT commented 10 years ago

From victor@vsespb.ru

On Tue Oct 15 14​:59​:45 2013\, public@​khwilliamson.com wrote​:

What you want is for $! to work like it's in 'use bytes'. I can change the patch so that it checks for 'use bytes' and if within that scope returns without the utf8 flag set. You would then just need to add a 'use bytes' to get it to work the same way it always has.

There are people who would disapprove of ever using bytes\, which means they think the behavior you want is wrong. I'm not one of them. I think that 'use bytes' should be rare\, mostly used in testing\, but it sometimes is the easiest\, clearest way of getting at the bytes that comprise a UTF-8-encoded character. utf8​::encode() can be used for that\, but destroys its argument and I think its name is much less clear than 'use bytes'.

I have tested doing this\, and it works.

New behaviour looks sane to me. It's probably thay way it's supposed to work from beginning. Main problem solved (when $! sometimes returned characters\, sometimes bytes).

There were comments that enabling new behaviour in lexical scope is not good and danger (but you stated that it's probably OK). We enabled it by default\, and users now can switch to *old* behaviour in *lexical* scope (with use bytes or use locale). I think arguments that lexical scope is not good can apply here too.

The big problem that I see now is backward compatibility. Any existing code that uses $! is probably broken.

Users will have to fix it with use locale/use bytes.

Few examples that I found (where filenames are concatenated with $!)​:

==== Fild​::Temp   unless ($!{EEXIST}) {   ${$options{ErrStr}} = "Could not create temp file $path​: $!";   return ();   }

File​::Find   unless (defined $topnlink) {   warnings​::warnif "Can't stat $top_item​: $!\n";   next Proc_Top_Item;   }

LWP​::UserAgent   my @​stat = stat($tmpfile) or die "Could not stat tmpfile '$tmpfile'​: $!";   or die "Cannot rename '$tmpfile' to '$file'​: $!\n";

====

Note\, that if filename here contains non-ASCII characters and is binary string\, merging it with character string $! would produce broken result.

Even if filename is ASCII\, it would break old behaviour when die exception printed to STDERR.

If filename is character string\, that code did not work correctly previously.

Another issue that there is POSIX​::strerror\, and IMHO it should behave just like $! for consistency (i.e. produce different things depending on lexical scope). POSIX​::strerror is pure perl.

p5pRT commented 10 years ago

From @cpansprout

On Wed Oct 09 18​:46​:47 2013\, public@​khwilliamson.com wrote​:

Until the commit that generated this ticket\, $! returned the bytes that comprise the message regardless of whether the message was in UTF-8 or not. Thus it behaved as if it were in the scope of both 'use locale' and 'use bytes'. What the commit effectively did was to remove the 'use bytes' behavior\, causing $! to behave as any other string scalar does under 'use locale'. Many people on this list think that we should get rid of 'use bytes'; that its behavior is never desired. (I'm not one of them BTW\, but I think it should be used only very rarely.) Thus\, on the face of it\, it is suspect that $! should behave as if it is in 'use bytes'\, and I'm having a hard time groking the argument that we should revert back to that.

The problem with the bytes pragma is that two scalars may compare equal ($a eq $b) outside its scope\, but be different ($a ne $b) within its scope. It changes the contents of scalars\, but only some scalars.

$! does not do that. In fact\, it is more akin to the default input and output streams\, which do not do any automatic decoding or encoding until one asks for it.

I don’t have enough room in my brain to fit all the issues that are currently going on\, so I can’t really comment on what makes sense under ‘use locale’. But I would ask that you consider things at a more practical level.

Simple programs like ack that do not take encodings into account should work without any change. The one-liner that I posted is still broken in bleadperl. Try running

LC_ALL=hu_HU.utf8 perl -e 'open "oentuheon" or die $!'

on dromedary with and without bleadperl. That type of code should continue to work\, regardless of what we come up with.

Maybe what you are really after is a *function* that returns a decoded $!.

--

Father Chrysostomos

p5pRT commented 10 years ago

From @mauke

On 22.10.2013 22​:48\, Father Chrysostomos via RT wrote​:

Simple programs like ack that do not take encodings into account should work without any change. The one-liner that I posted is still broken in bleadperl. Try running

LC_ALL=hu_HU.utf8 perl -e 'open "oentuheon" or die $!'

on dromedary with and without bleadperl. That type of code should continue to work\, regardless of what we come up with.

With or without PERL_UNICODE=SL?

Because that's on by default in my environment.

Maybe what you are really after is a *function* that returns a decoded $!.

Doesn't interpolate nicely in error messages.

-- Lukas Mai \plokinom@&#8203;gmail\.com

p5pRT commented 10 years ago

From @cpansprout

On Tue Oct 22 14​:41​:09 2013\, plokinom@​gmail.com wrote​:

On 22.10.2013 22​:48\, Father Chrysostomos via RT wrote​:

Simple programs like ack that do not take encodings into account should work without any change. The one-liner that I posted is still broken in bleadperl. Try running

LC_ALL=hu_HU.utf8 perl -e 'open "oentuheon" or die $!'

on dromedary with and without bleadperl. That type of code should continue to work\, regardless of what we come up with.

With or without PERL_UNICODE=SL?

Because that's on by default in my environment.

All I can say is\, ouch! I have always found use of PERL_UNICODE to be suspicious. The problem with PERL_UNICODE is that it enforces things on a program that might have its own STDIN/STDERR handling.

--

Father Chrysostomos

p5pRT commented 10 years ago

From @cpansprout

On Tue Oct 22 14​:46​:12 2013\, sprout wrote​:

On Tue Oct 22 14​:41​:09 2013\, plokinom@​gmail.com wrote​:

On 22.10.2013 22​:48\, Father Chrysostomos via RT wrote​:

Simple programs like ack that do not take encodings into account should work without any change. The one-liner that I posted is still broken in bleadperl. Try running

LC_ALL=hu_HU.utf8 perl -e 'open "oentuheon" or die $!'

on dromedary with and without bleadperl. That type of code should continue to work\, regardless of what we come up with.

With or without PERL_UNICODE=SL?

Because that's on by default in my environment.

All I can say is\, ouch! I have always found use of PERL_UNICODE to be suspicious.

I think I meant suspect\, or whatever.

The problem with PERL_UNICODE is that it enforces things on a program that might have its own STDIN/STDERR handling.

In particular\, PERL_UNICODE=SL breaks any simple Perl implementation of cat.

--

Father Chrysostomos

p5pRT commented 10 years ago

From victor@vsespb.ru

2013/10/23 Lukas Mai \plokinom@&#8203;gmail\.com

On 22.10.2013 22​:48\, Father Chrysostomos via RT wrote​:

Simple programs like ack that do not take encodings into account should work without any change. The one-liner that I posted is still broken in bleadperl. Try running

LC_ALL=hu_HU.utf8 perl -e 'open "oentuheon" or die $!'

on dromedary with and without bleadperl. That type of code should continue to work\, regardless of what we come up with.

With or without PERL_UNICODE=SL?

Because that's on by default in my environment.

I think things like 'ack' won't work this way. They read data also from @​ARGV\, config files\, they work with filesystem's filenames. Actually\, use of PERL_UNICODE=SL is pretty limited\, imho.

p5pRT commented 10 years ago

From @mauke

On 22.10.2013 23​:52\, Father Chrysostomos via RT wrote​:

On Tue Oct 22 14​:46​:12 2013\, sprout wrote​:

The problem with PERL_UNICODE is that it enforces things on a program that might have its own STDIN/STDERR handling.

In particular\, PERL_UNICODE=SL breaks any simple Perl implementation of cat.

Isn't such a "simple" implementation already broken on systems like Windows?

-- Lukas Mai \plokinom@&#8203;gmail\.com

p5pRT commented 10 years ago

From @khwilliamson

I have now pushed a series of patches that make the handling of this uniform for $^E and $! on Win32 and OS/2. That means changing a single place will automatically propagate to all areas\, once we decide what that is. I hope to soon have some time to look further.

p5pRT commented 10 years ago

From @tonycoz

On Tue Nov 26 20​:25​:58 2013\, public@​khwilliamson.com wrote​:

I have now pushed a series of patches that make the handling of this uniform for $^E and $! on Win32 and OS/2. That means changing a single place will automatically propagate to all areas\, once we decide what that is. I hope to soon have some time to look further.

This is a 5.20 blocker.

Did you have time to look further?

Though I'll admit the conversation has gone back and forth so much I'm not sure what remaining issues there are.

Tony

p5pRT commented 10 years ago

From @khwilliamson

On 02/05/2014 03​:52 PM\, Tony Cook via RT wrote​:

On Tue Nov 26 20​:25​:58 2013\, public@​khwilliamson.com wrote​:

I have now pushed a series of patches that make the handling of this uniform for $^E and $! on Win32 and OS/2. That means changing a single place will automatically propagate to all areas\, once we decide what that is. I hope to soon have some time to look further.

This is a 5.20 blocker.

Did you have time to look further?

Though I'll admit the conversation has gone back and forth so much I'm not sure what remaining issues there are.

Tony

This is correctly listed as a blocker. I have thought further about this\, but am not ready to pursue it; I am trying to get all the user-visible changes in before I finish up my research on this.

p5pRT commented 10 years ago

From @khwilliamson

This is my attempt to bring some clarity to this issue and stake out my position regarding it. I haven't re-read the thread thoroughly just now\, so I may miss some issues\, but I have been very aware of the central problem regarding this for months now\, and have been thinking about it for the same amount of time\, so I believe that what follows is an adequate summary of that.

First the background. This ticket is about a commit that fixed two tickets with the same underlying cause\, https://rt-archive.perl.org/perl5/Ticket/Display.html?id=112208 "printing $! when open.pm sets utf8 default on filehandles yields garbage"\, and #117429\, merged with the earlier ticket.

The problem is that $! was returning UTF-8 encoded text\, but the UTF-8 flag was not set\, so it displayed as garbage.

The fix was simply to set the UTF-8 flag if the text is valid UTF-8.

The problem with this is that it breaks code that just output the returned bytes and the filehandle doesn't have the utf8 default on. Now what gets displayed there looks like garbage.

If the broken code uses $! within the scope of the hated-by-some 'use bytes'\, then the UTF-8 flag doesn't get set\, and the return is precisely what it used to be.

Thus a potential solution is to force such code to change to do a 'use bytes'. FC is concerned that programs like ack will have to change if we choose this scenario.

Otherwise we are in a quandary. If we revert the commit\, code that "does the right thing" by setting their filehandle appropriately gets garbage; whereas if we keep it\, code that is unprepared to handle UTF-8 can get garbage. There's probably far more of the latter than the former\, but do we wish to punish code that DTRT?

Before proceeding\, I want to make an assertion​: I think that it is better for someone to get output in a language foreign to them\, than it is to get garbage bytes. This is because they can put the former into something like Google translate to get a reasonable translation back into their own language; and I believe that what appears to be garbage bytes is much more problematical to figure out what was intended.

Do you accept or reject this assertion?

If you don't accept it\, then you need to persuade me and others who do accept it\, why not\, and there's not much point in you reading the rest of this message.

If you do accept it\, one solution is to always output $! in English\, which we would do by always using the C locale when generating its text.

This could be relaxed by using the POSIX locale instead. On most platforms this will be identical to the C locale\, but on VMS\, at least\, it can include Western European languages as well\, though I think that VMS only returns $! in English.

A more general solution would be to output it in the native locale unless it is UTF-8 encoded\, in which case it would be converted to English. This would then cause the code like (apparently) ack to see no change in behavior\, except that some errors would now come out in English; and the code that was affected by #119499 would get English\, instead of garbage.

The reason that this issue comes up for programs that don't handle UTF-8 is because $! does not respect 'use locale'. The reason for this is that $! typically gives the user an OS error that is outside Perl's purview\, and it's best that these messages be displayed in the user's preferred language. But since what we have now causes garbage to be displayed for one class of user\, it seems to me to be a higher priority\, given my assertion\, to output something sane for everybody\, rather than something ideal for some\, and garbage for others.

That leads to yet another possibility\, one that rjbs has previously vetoed\, but which I'm bringing up again here alongside this background that he may not have considered​: And that is to have $! respect 'use locale'. Outside of 'use locale' it would be C or POSIX\, which would mean English. Within the scope of 'use locale'\, it would be the user's language. Programs that do a 'use locale' can be assumed to be written to be able to handle them\, including the increasingly common UTF-8 locales.

It seems to me wrong to deliver $! locale-encoded to programs that aren't prepared to accept it. We may have gotten away with this for non-UTF-8 locales because most code won't try to parse the stringified $! (and it's probably foolish to try to parse it)\, but the UTF-8 flag throws a wrench into this uneasy truce.

To state my position explicitly​: I don't think it's a good idea to return a UTF-8 encoded string to code that isn't expecting that possibility. And I don't think it's OK to have user's see garbage bytes. To avoid doing these\, we have to return English whenever that could happen. 'use locale' in code should be enough to signal it's prepared to handle UTF-8; otherwise it's buggy.

So still another possibility is to deliver $! in the current locale if $! isn't UTF-8; otherwise to use English outside of 'use locale' and the UTF-8 inside. That leaves code that sets things up properly to get $! returned in the user's language; and code that doesn't will also get the user's language unless what is returned would be in UTF-8\, in which case it will come out in English\, instead of garbage. This seems to me to be the best solution.

Another possibility\, suggested by FC\, is to leave $! as-is\, but create a new variable that behaves differently. I think it's far better to get $! to work reasonably than to come up with an alternative variable.