Perl / perl5

🐪 The Perl programming language
https://dev.perl.org/perl5/
Other
1.86k stars 527 forks source link

$! returned with UTF-8 flag under UTF-8 locales only under 5.19.2+ #13208

Closed p5pRT closed 10 years ago

p5pRT commented 10 years ago

Migrated from rt.perl.org#119499 (status was 'resolved')

Searchable as RT119499$

p5pRT commented 10 years ago

From victor@vsespb.ru

2014-03-02 9​:43 GMT+04​:00 Karl Williamson \public@​khwilliamson\.com​:

Before proceeding\, I want to make an assertion​: I think that it is better for someone to get output in a language foreign to them\, than it is to get garbage bytes. This is because they can put the former into something like Google translate to get a reasonable translation back into their own language; and I believe that what appears to be garbage bytes is much more problematical to figure out what was intended.

Do you accept or reject this assertion?

Of course English better than garbage. BUT this is correct only for "broken" programs - "It's better if broken program output English than garbage" You want new programs (which follow perl documentation\, which are without bugs) to output sometimes English sometimes other languages\, depending on locale charset.

For end users it will look like this​: "one one machine everything is fine\, on another machine perl don't respect locale\, all Gnu tools and Python scripts works fine and print messages in my language but Perl script don't seem to respect locale in random circumstances".

to me it looks like English better than garbage\, but 5.18 behaviour even better anyway (we can put $! to the list http​://perldoc.perl.org/perlunicode.html#When-Unicode-Does-Not-Happen together with @​ARGV %ENV etc)

p5pRT commented 10 years ago

From @khwilliamson

tl;dr summary of this

I assert it is better to have an error message come out in a foreign language (probably English) than to have apparent garbage bytes emitted.

If we output UTF-8 bytes without the UTF-8 flag being on to code that handles UTF-8\, they will appear to be garbage bytes. But if we set the flag\, this breaks code that isn't expecting to handle UTF-8. We break one class or the other. The only way around it is to output bytes that are the same in both UTF-8 and non-UTF8\, unless we are confident that the code can handle UTF-8. That means outputting ASCII when we don't have that confidence.

My bottom line proposal is to look at the $! text\, and if it contains only ASCII\, output it as-is.

We can be reasonably confident that the program can handle UTF-8 if we are within the scope of 'use locale'. ($! should not be UTF-8 unless the current locale is UTF-8.). Within that scope we also output $!\, as-is\, setting the UTF-8 flag if it is UTF-8.

But if we are not within such scope we can't be confident at all about how the I/O is set up\, etc. In that case\, for non-ASCII $! text\, we switch momentarily to the C locale\, and re-get $!\, which we then output.   This text will be in ASCII and (almost certainly) English\, which can be placed in something like Google translate.

This is not ideal but it pretty much assures that no one is going to get garbage bytes that Google translate won't likely be able to figure out.

On 03/01/2014 10​:43 PM\, Karl Williamson wrote​:

This is my attempt to bring some clarity to this issue and stake out my position regarding it. I haven't re-read the thread thoroughly just now\, so I may miss some issues\, but I have been very aware of the central problem regarding this for months now\, and have been thinking about it for the same amount of time\, so I believe that what follows is an adequate summary of that.

First the background. This ticket is about a commit that fixed two tickets with the same underlying cause\, https://rt-archive.perl.org/perl5/Ticket/Display.html?id=112208 "printing $! when open.pm sets utf8 default on filehandles yields garbage"\, and #117429\, merged with the earlier ticket.

The problem is that $! was returning UTF-8 encoded text\, but the UTF-8 flag was not set\, so it displayed as garbage.

The fix was simply to set the UTF-8 flag if the text is valid UTF-8.

The problem with this is that it breaks code that just output the returned bytes and the filehandle doesn't have the utf8 default on. Now what gets displayed there looks like garbage.

If the broken code uses $! within the scope of the hated-by-some 'use bytes'\, then the UTF-8 flag doesn't get set\, and the return is precisely what it used to be.

Thus a potential solution is to force such code to change to do a 'use bytes'. FC is concerned that programs like ack will have to change if we choose this scenario.

Otherwise we are in a quandary. If we revert the commit\, code that "does the right thing" by setting their filehandle appropriately gets garbage; whereas if we keep it\, code that is unprepared to handle UTF-8 can get garbage. There's probably far more of the latter than the former\, but do we wish to punish code that DTRT?

Before proceeding\, I want to make an assertion​: I think that it is better for someone to get output in a language foreign to them\, than it is to get garbage bytes. This is because they can put the former into something like Google translate to get a reasonable translation back into their own language; and I believe that what appears to be garbage bytes is much more problematical to figure out what was intended.

Do you accept or reject this assertion?

If you don't accept it\, then you need to persuade me and others who do accept it\, why not\, and there's not much point in you reading the rest of this message.

If you do accept it\, one solution is to always output $! in English\, which we would do by always using the C locale when generating its text.

This could be relaxed by using the POSIX locale instead. On most platforms this will be identical to the C locale\, but on VMS\, at least\, it can include Western European languages as well\, though I think that VMS only returns $! in English.

A more general solution would be to output it in the native locale unless it is UTF-8 encoded\, in which case it would be converted to English. This would then cause the code like (apparently) ack to see no change in behavior\, except that some errors would now come out in English; and the code that was affected by #119499 would get English\, instead of garbage.

The reason that this issue comes up for programs that don't handle UTF-8 is because $! does not respect 'use locale'. The reason for this is that $! typically gives the user an OS error that is outside Perl's purview\, and it's best that these messages be displayed in the user's preferred language. But since what we have now causes garbage to be displayed for one class of user\, it seems to me to be a higher priority\, given my assertion\, to output something sane for everybody\, rather than something ideal for some\, and garbage for others.

That leads to yet another possibility\, one that rjbs has previously vetoed\, but which I'm bringing up again here alongside this background that he may not have considered​: And that is to have $! respect 'use locale'. Outside of 'use locale' it would be C or POSIX\, which would mean English. Within the scope of 'use locale'\, it would be the user's language. Programs that do a 'use locale' can be assumed to be written to be able to handle them\, including the increasingly common UTF-8 locales.

It seems to me wrong to deliver $! locale-encoded to programs that aren't prepared to accept it. We may have gotten away with this for non-UTF-8 locales because most code won't try to parse the stringified $! (and it's probably foolish to try to parse it)\, but the UTF-8 flag throws a wrench into this uneasy truce.

To state my position explicitly​: I don't think it's a good idea to return a UTF-8 encoded string to code that isn't expecting that possibility. And I don't think it's OK to have user's see garbage bytes. To avoid doing these\, we have to return English whenever that could happen. 'use locale' in code should be enough to signal it's prepared to handle UTF-8; otherwise it's buggy.

So still another possibility is to deliver $! in the current locale if $! isn't UTF-8; otherwise to use English outside of 'use locale' and the UTF-8 inside. That leaves code that sets things up properly to get $! returned in the user's language; and code that doesn't will also get the user's language unless what is returned would be in UTF-8\, in which case it will come out in English\, instead of garbage. This seems to me to be the best solution.

Another possibility\, suggested by FC\, is to leave $! as-is\, but create a new variable that behaves differently. I think it's far better to get $! to work reasonably than to come up with an alternative variable.

p5pRT commented 10 years ago

From @khwilliamson

I looked at https://github.com/petdance/ack2/issues/367 which shows that ack is broken by the 5.19.2 change.

If you look at that link\, you'll see that the russian comes out fine\, but with a warning that didn't use to be there; the french is broken.

What is happening is that ack treats everything as bytes\, and so everything just worked. STDERR is opened as a byte-oriented file\, and if $! actually did contain UTF-8\, it wasn't marked as such\, and its component bytes were output as-is\, so that if in fact the terminal is expecting UTF-8\, they come out looking like UTF-8 to it\, and everything held together. (Garbage would ensue if the terminal wasn't expecting the encoding that $! is in; I haven't checked\, but my guess is that the grep output is also output as-is\, so if the file encodings differ from the terminal expectation\, that garbage could be printed; but in practice I doubt that this is a problem.)

What the 5.19 change did effectively is to make the stringification of "$!" obey "use bytes". Most code isn't in bytes' scope\, so the UTF-8 flag gets turned on if appropriate.

Perl's do_print() function checks if the stream is listed as UTF-8 or not. The string being output is converted to the stream's encoding if necessary and possible. If not possible\, things are just output as-is\, possibly with warnings. In ack's case the stream never is (AFAIK) UTF-8. Starting in 5.19.2+\, the message can be marked as UTF-8\, and so tries to get converted to the non-UTF-8 stream. This is impossible in Russian\, so the bytes are output as-is\, with a warning. Since the terminal really is UTF-8\, they display correctly. But it is possible to convert the French text\, as all the characters in the message in the bug report are Latin1. So do_print() does this\, but since the terminal's encoding doesn't match what ack thinks it is\, the non-ascii characters come out as garbage.

Note that ack has some of its messages hard-coded in English. For example\, it does a -e on the file name\, and outputs English-only if it doesn't exist. rjbs has pointed out to me privately that typical uses of $! are of the form

die "my message in English​: $!"

I am not an ack user\, but it appears to me that ack is like a filter which doesn't care about encodings. It is byte rather than character oriented. This seems to me to be an appropriate use of 'use bytes'\, and if ack did this\, this bug would not arise.

My proposal to only use ASCII characters in error messages unless within 'use locale' would also fix this problem. All messages that print in Russian and some messages in French\, would now appear in English\, adding to the several that already print in English no matter what.

p5pRT commented 10 years ago

From victor@vsespb.ru

2014-03-27 1​:41 GMT+04​:00 Karl Williamson via RT \perlbug\-followup@​perl\.org​:

I looked at https://github.com/petdance/ack2/issues/367 which shows that ack is broken by the 5.19.2 change.

If you look at that link\, you'll see that the russian comes out fine\, but with a warning that didn't use to be there; the french is broken.

What is happening is that ack treats everything as bytes\, and so everything just worked. STDERR is opened as a byte-oriented file\, and if $! actually did contain UTF-8\, it wasn't marked as such\, and its component bytes were output as-is\, so that if in fact the terminal is expecting UTF-8\, they come out looking like UTF-8 to it\, and everything held together. (Garbage would ensue if the terminal wasn't expecting the encoding that $! is in; I haven't checked\, but my guess is that the grep output is also output as-is\, so if the file encodings differ from the terminal expectation\, that garbage could be printed; but in practice I doubt that this is a problem.)

What the 5.19 change did effectively is to make the stringification of "$!" obey "use bytes". Most code isn't in bytes' scope\, so the UTF-8 flag gets turned on if appropriate.

Perl's do_print() function checks if the stream is listed as UTF-8 or not. The string being output is converted to the stream's encoding if necessary and possible. If not possible\, things are just output as-is\, possibly with warnings. In ack's case the stream never is (AFAIK) UTF-8. Starting in 5.19.2+\, the message can be marked as UTF-8\, and so tries to get converted to the non-UTF-8 stream. This is impossible in Russian\, so the bytes are output as-is\, with a warning. Since the terminal really is UTF-8\, they display correctly. But it is possible to convert the French text\, as all the characters in the message in the bug report are Latin1. So do_print() does this\, but since the terminal's encoding doesn't match what ack thinks it is\, the non-ascii characters come out as garbage.

yes agree. anyway warnings are bad. and broken latin1 bad too.

Note that ack has some of its messages hard-coded in English. For example\, it does a -e on the file name\, and outputs English-only if it doesn't exist. rjbs has pointed out to me privately that typical uses of $! are of the form

die "my message in English​: $!"

Right\, usually "my message in English" indeed is in English because authors don't bother with full localization and translations to all languages\, but for consistency it's better to see $! in locale's language. Other programs usually show it in user language.

I am not an ack user\, but it appears to me that ack is like a filter which doesn't care about encodings. It is byte rather than character oriented. This seems to me to be an appropriate use of 'use bytes'\, and if ack did this\, this bug would not arise.

I would disagree\, they try to migrate to unicode

https://github.com/petdance/ack2/issues/120 https://github.com/petdance/ack2/issues/344 https://github.com/petdance/ack2/issues/350 https://github.com/petdance/ack2/issues/355

ack is searching _text_ using _perl regexps_ in text files. it even ignore files detected as binary (by default\, at least\, in my installation)

My proposal to only use ASCII characters in error messages unless within 'use locale' would also fix this problem. All messages that print in Russian and some messages in French\, would now appear in English\, adding to the several that already print in English no matter what.

I am writing programs with correct use of modern Perl unicode now\, but never used 'use locale'\, seems it adds additional side effect to code? Can there be special option for 'use locale' to not change anything at all\, except $! behaviour (in lexical scope) ?

also\, can code without 'use locale' behave like 5.18 (i.e. not always in English; bytes) ? and with 'use locale :errno_only' change $! to return unicode character string.

--- via perlbug​: queue​: perl5 status​: open https://rt-archive.perl.org/perl5/Ticket/Display.html?id=119499

p5pRT commented 10 years ago

From @khwilliamson

On 03/26/2014 04​:06 PM\, Victor Efimov wrote​:

2014-03-27 1​:41 GMT+04​:00 Karl Williamson via RT \perlbug\-followup@​perl\.org​:

I looked at https://github.com/petdance/ack2/issues/367 which shows that ack is broken by the 5.19.2 change.

If you look at that link\, you'll see that the russian comes out fine\, but with a warning that didn't use to be there; the french is broken.

What is happening is that ack treats everything as bytes\, and so everything just worked. STDERR is opened as a byte-oriented file\, and if $! actually did contain UTF-8\, it wasn't marked as such\, and its component bytes were output as-is\, so that if in fact the terminal is expecting UTF-8\, they come out looking like UTF-8 to it\, and everything held together. (Garbage would ensue if the terminal wasn't expecting the encoding that $! is in; I haven't checked\, but my guess is that the grep output is also output as-is\, so if the file encodings differ from the terminal expectation\, that garbage could be printed; but in practice I doubt that this is a problem.)

What the 5.19 change did effectively is to make the stringification of "$!" obey "use bytes". Most code isn't in bytes' scope\, so the UTF-8 flag gets turned on if appropriate.

Perl's do_print() function checks if the stream is listed as UTF-8 or not. The string being output is converted to the stream's encoding if necessary and possible. If not possible\, things are just output as-is\, possibly with warnings. In ack's case the stream never is (AFAIK) UTF-8. Starting in 5.19.2+\, the message can be marked as UTF-8\, and so tries to get converted to the non-UTF-8 stream. This is impossible in Russian\, so the bytes are output as-is\, with a warning. Since the terminal really is UTF-8\, they display correctly. But it is possible to convert the French text\, as all the characters in the message in the bug report are Latin1. So do_print() does this\, but since the terminal's encoding doesn't match what ack thinks it is\, the non-ascii characters come out as garbage.

yes agree. anyway warnings are bad. and broken latin1 bad too.

It's arguable that the warnings should have been output all along. since really it is UTF-8 being output to a terminal that perl thinks can't handle it.

Note that ack has some of its messages hard-coded in English. For example\, it does a -e on the file name\, and outputs English-only if it doesn't exist. rjbs has pointed out to me privately that typical uses of $! are of the form

die "my message in English​: $!"

Right\, usually "my message in English" indeed is in English because authors don't bother with full localization and translations to all languages\, but for consistency it's better to see $! in locale's language. Other programs usually show it in user language.

I am not an ack user\, but it appears to me that ack is like a filter which doesn't care about encodings. It is byte rather than character oriented. This seems to me to be an appropriate use of 'use bytes'\, and if ack did this\, this bug would not arise.

I would disagree\, they try to migrate to unicode

https://github.com/petdance/ack2/issues/120 https://github.com/petdance/ack2/issues/344 https://github.com/petdance/ack2/issues/350 https://github.com/petdance/ack2/issues/355

ack is searching _text_ using _perl regexps_ in text files. it even ignore files detected as binary (by default\, at least\, in my installation)

I stand corrected.

My proposal to only use ASCII characters in error messages unless within 'use locale' would also fix this problem. All messages that print in Russian and some messages in French\, would now appear in English\, adding to the several that already print in English no matter what.

I am writing programs with correct use of modern Perl unicode now\, but never used 'use locale'\, seems it adds additional side effect to code? Can there be special option for 'use locale' to not change anything at all\, except $! behaviour (in lexical scope) ?

locale works a lot better (I anticipate) in 5.20 than before. I think it should finally be possible to 'use locale' as a matter of habit.

I was already thinking that 'use locale' in 5.22 should have the ability to select LC_CTYPE and LC_COLLATE individually. It seems logical to make this general\, so you could say

'use locale '​:messages\, numeric';

to get just the effects you want. Some of this could conceivably be added in 5.20 if it helps to resolve this blocker.

also\, can code without 'use locale' behave like 5.18 (i.e. not always in English; bytes)

The problem is that the commit fixed real bugs in code that didn't "use locale" Thus the quandary. If we go back to 5.18 behavior\, those bugs come back. I believe that my proposal that only ASCII messages get displayed outside of 'use locale' is the only "sure" method that doesn't display garbage to someone. (Note that ASCII doesn't mean necessarily English. Many error messages in Western European languages consist only of ASCII characters. I realize that doesn't help Russian or Chinese\, etc.)

Also\, I hadn't realized this before\, but sometimes the message's characters aren't just garbage that someone with the motivation and skill could figure out\, but the UNICODE REPLACEMENT CHARACTER can be displayed instead\, so information is lost and can't be recovered.

? and with 'use locale :errno_only' change $! to return unicode character string.

I don't see how this differs from your suggestion above for an option to 'use locale' to just effect $! (which is BTW LC_MESSAGES).

And that reminds me\, MS Windows doesn't have LC_MESSAGES\, AFAIK. Can someone explain what languages error messages are displayed in under varied locales?

p5pRT commented 10 years ago

From @khwilliamson

On 03/26/2014 05​:12 PM\, Karl Williamson wrote​:

On 03/26/2014 04​:06 PM\, Victor Efimov wrote​:

2014-03-27 1​:41 GMT+04​:00 Karl Williamson via RT \perlbug\-followup@​perl\.org​:

I looked at https://github.com/petdance/ack2/issues/367 which shows that ack is broken by the 5.19.2 change.

If you look at that link\, you'll see that the russian comes out fine\, but with a warning that didn't use to be there; the french is broken.

What is happening is that ack treats everything as bytes\, and so everything just worked. STDERR is opened as a byte-oriented file\, and if $! actually did contain UTF-8\, it wasn't marked as such\, and its component bytes were output as-is\, so that if in fact the terminal is expecting UTF-8\, they come out looking like UTF-8 to it\, and everything held together. (Garbage would ensue if the terminal wasn't expecting the encoding that $! is in; I haven't checked\, but my guess is that the grep output is also output as-is\, so if the file encodings differ from the terminal expectation\, that garbage could be printed; but in practice I doubt that this is a problem.)

What the 5.19 change did effectively is to make the stringification of "$!" obey "use bytes". Most code isn't in bytes' scope\, so the UTF-8 flag gets turned on if appropriate.

Perl's do_print() function checks if the stream is listed as UTF-8 or not. The string being output is converted to the stream's encoding if necessary and possible. If not possible\, things are just output as-is\, possibly with warnings. In ack's case the stream never is (AFAIK) UTF-8. Starting in 5.19.2+\, the message can be marked as UTF-8\, and so tries to get converted to the non-UTF-8 stream. This is impossible in Russian\, so the bytes are output as-is\, with a warning. Since the terminal really is UTF-8\, they display correctly. But it is possible to convert the French text\, as all the characters in the message in the bug report are Latin1. So do_print() does this\, but since the terminal's encoding doesn't match what ack thinks it is\, the non-ascii characters come out as garbage.

yes agree. anyway warnings are bad. and broken latin1 bad too.

It's arguable that the warnings should have been output all along. since really it is UTF-8 being output to a terminal that perl thinks can't handle it.

Note that ack has some of its messages hard-coded in English. For example\, it does a -e on the file name\, and outputs English-only if it doesn't exist. rjbs has pointed out to me privately that typical uses of $! are of the form

die "my message in English​: $!"

Right\, usually "my message in English" indeed is in English because authors don't bother with full localization and translations to all languages\, but for consistency it's better to see $! in locale's language. Other programs usually show it in user language.

I am not an ack user\, but it appears to me that ack is like a filter which doesn't care about encodings. It is byte rather than character oriented. This seems to me to be an appropriate use of 'use bytes'\, and if ack did this\, this bug would not arise.

I would disagree\, they try to migrate to unicode

https://github.com/petdance/ack2/issues/120 https://github.com/petdance/ack2/issues/344 https://github.com/petdance/ack2/issues/350 https://github.com/petdance/ack2/issues/355

ack is searching _text_ using _perl regexps_ in text files. it even ignore files detected as binary (by default\, at least\, in my installation)

I stand corrected.

My proposal to only use ASCII characters in error messages unless within 'use locale' would also fix this problem. All messages that print in Russian and some messages in French\, would now appear in English\, adding to the several that already print in English no matter what.

I am writing programs with correct use of modern Perl unicode now\, but never used 'use locale'\, seems it adds additional side effect to code? Can there be special option for 'use locale' to not change anything at all\, except $! behaviour (in lexical scope) ?

locale works a lot better (I anticipate) in 5.20 than before. I think it should finally be possible to 'use locale' as a matter of habit.

I was already thinking that 'use locale' in 5.22 should have the ability to select LC_CTYPE and LC_COLLATE individually. It seems logical to make this general\, so you could say

'use locale '​:messages\, numeric';

to get just the effects you want. Some of this could conceivably be added in 5.20 if it helps to resolve this blocker.

also\, can code without 'use locale' behave like 5.18 (i.e. not always in English; bytes)

The problem is that the commit fixed real bugs in code that didn't "use locale" Thus the quandary. If we go back to 5.18 behavior\, those bugs come back. I believe that my proposal that only ASCII messages get displayed outside of 'use locale' is the only "sure" method that doesn't display garbage to someone. (Note that ASCII doesn't mean necessarily English. Many error messages in Western European languages consist only of ASCII characters. I realize that doesn't help Russian or Chinese\, etc.)

Also\, I hadn't realized this before\, but sometimes the message's characters aren't just garbage that someone with the motivation and skill could figure out\, but the UNICODE REPLACEMENT CHARACTER can be displayed instead\, so information is lost and can't be recovered.

? and with 'use locale :errno_only' change $! to return unicode character string.

I don't see how this differs from your suggestion above for an option to 'use locale' to just effect $! (which is BTW LC_MESSAGES).

And that reminds me\, MS Windows doesn't have LC_MESSAGES\, AFAIK. Can someone explain what languages error messages are displayed in under varied locales?

Another possibility to get programs like ack to work unchanged is to add a non-printing above-Latin1 character to the stringification of $! when it is UTF-8 and there are only Latin1 characters in it. A possibility is a ZERO WIDTH SPACE. Then do_print() wouldn't try to downgrade. The drawback is that code that analyzes $! could be thrown off. But code generally should be analyzing the numeric value anyway\, and not the string representation

p5pRT commented 10 years ago

From victor@vsespb.ru

2014-03-27 3​:12 GMT+04​:00 Karl Williamson \public@​khwilliamson\.com​:

locale works a lot better (I anticipate) in 5.20 than before.

So\, it worked bad before? Than it will be hard to write code compatible with 5.20 and\, say\, 5.8.8 at same time (that again related to 'ack'-like programs - it's command line program that should work in system perl installed by end users. it's not a web application where programmer can choose perl version)

The problem is that the commit fixed real bugs in code that didn't "use locale" Thus the quandary. If we go back to 5.18 behavior\, those bugs come back.

Who told that it was bug? I saw this behaviour but never thought it is a bug\, because there is note in documentation​:

While Perl does have extensive ways to input and output in Unicode\, and a few other "entry points" like the @​ARGV array (which can sometimes be interpreted as UTF-8)\, there are still many places where Unicode (in some encoding or another) could be given as arguments or received as results\, or both\, but it is not.

a user reported this as bug because he did not read this. for me it's documented behaviour.

p5pRT commented 10 years ago

From @ap

* Karl Williamson \public@​khwilliamson\.com [2014-03-27 03​:10]​:

Another possibility to get programs like ack to work unchanged is to add a non-printing above-Latin1 character to the stringification of $! when it is UTF-8 and there are only Latin1 characters in it. A possibility is a ZERO WIDTH SPACE. Then do_print() wouldn't try to downgrade. The drawback is that code that analyzes $! could be thrown off. But code generally should be analyzing the numeric value anyway\, and not the string representation

Maybe you can attach magic that prevents a downgrade?

p5pRT commented 10 years ago

From @khwilliamson

On 03/27/2014 04​:57 AM\, Aristotle Pagaltzis wrote​:

* Karl Williamson \public@​khwilliamson\.com [2014-03-27 03​:10]​:

Another possibility to get programs like ack to work unchanged is to add a non-printing above-Latin1 character to the stringification of $! when it is UTF-8 and there are only Latin1 characters in it. A possibility is a ZERO WIDTH SPACE. Then do_print() wouldn't try to downgrade. The drawback is that code that analyzes $! could be thrown off. But code generally should be analyzing the numeric value anyway\, and not the string representation

Maybe you can attach magic that prevents a downgrade?

That sounds like a better approach\, but it is an area that I know essentially nothing about. If I were to do it\, it seems not so likely that I could get it right by 5.20; I don't know how hard it would be for someone experienced in the magical arts of Perl™.

Likewise\, adding the ZERO WIDTH SPACE would need to be done early in the development cycle to see what might break\, not late\, so shouldn't be considered as a 5.20 solution.

p5pRT commented 10 years ago

From @khwilliamson

On 03/27/2014 02​:01 AM\, Victor Efimov wrote​:

2014-03-27 3​:12 GMT+04​:00 Karl Williamson \public@​khwilliamson\.com​:

locale works a lot better (I anticipate) in 5.20 than before.

So\, it worked bad before? Than it will be hard to write code compatible with 5.20 and\, say\, 5.8.8 at same time (that again related to 'ack'-like programs - it's command line program that should work in system perl installed by end users. it's not a web application where programmer can choose perl version)

I don't follow your logic. 5.20 will contain a bunch of bug fixes related to locale handling. Earlier versions will continue to work as before. Perhaps what you meant is that it will be hard to write code that takes advantage of whatever 5.20 has\, but still works in older releases. That could be true\, but it's not something that there is anything that can be done about\, except possibly some things in PPPort.h\, if we end up adding new macros.

It's a given that we can't break things like ack unless there is an easy workaround that is backwards compatible.

The problem is that the commit fixed real bugs in code that didn't "use locale" Thus the quandary. If we go back to 5.18 behavior\, those bugs come back.

Who told that it was bug? I saw this behaviour but never thought it is a bug\, because there is note in documentation​:

While Perl does have extensive ways to input and output in Unicode\, and a few other "entry points" like the @​ARGV array (which can sometimes be interpreted as UTF-8)\, there are still many places where Unicode (in some encoding or another) could be given as arguments or received as results\, or both\, but it is not.

a user reported this as bug because he did not read this. for me it's documented behaviour.

I disagree that documenting bad behavior means it should not eventually be fixed. The commit that led to this ticket fixed two other tickets\, now merged as https://rt-archive.perl.org/perl5/Ticket/Display.html?id=112208.  Those tickets seem to me to be perfectly legitimate as being bugs deserving of being fixed.

If we revert this commit\, those bugs come back.

p5pRT commented 10 years ago

From victor@vsespb.ru

2014-03-27 22​:14 GMT+04​:00 Karl Williamson \public@​khwilliamson\.com​:

I don't follow your logic. 5.20 will contain a bunch of bug fixes related to locale handling. Earlier versions will continue to work as before. Perhaps what you meant is that it will be hard to write code that takes advantage of whatever 5.20 has\, but still works in older releases.

That is hard to write code which works in 5.8 and 5.20 at same time (_without_ taking advantages of 5.20)\, because now I need to 'use locale'\, and I assume in old version of perl 'use locale' works bad and introduce additional complexities.

I disagree that documenting bad behavior means it should not eventually be fixed. The commit that led to this ticket fixed two other tickets\, now merged as https://rt-archive.perl.org/perl5/Ticket/Display.html?id=112208.  Those tickets seem to me to be perfectly legitimate as being bugs deserving of being fixed.

But those are not a bugs compared to real trouble now. And if it was documented\, those are feature requests. Real trouble now​: old code open(my $f\, ">"\, $filename) or die $!; will issue warnings. there are lot of "or die $!" in perl documentation and now everything broken.

Why it's so complex to just introduce $DECODED_ERRNO or a pragma to turn utf8 flag on (which works in lexical scope)? That's much better than breaking so much old code and inserting "zero width whitespaces" into messages.

p5pRT commented 10 years ago

From @demerphq

On 2 March 2014 06​:43\, Karl Williamson \public@​khwilliamson\.com wrote​:

This is my attempt to bring some clarity to this issue and stake out my position regarding it. I haven't re-read the thread thoroughly just now\, so I may miss some issues\, but I have been very aware of the central problem regarding this for months now\, and have been thinking about it for the same amount of time\, so I believe that what follows is an adequate summary of that.

First the background. This ticket is about a commit that fixed two tickets with the same underlying cause\, https://rt-archive.perl.org/perl5/Ticket/Display.html?id=112208 "printing $! when open.pm sets utf8 default on filehandles yields garbage"\, and #117429\, merged with the earlier ticket.

The problem is that $! was returning UTF-8 encoded text\, but the UTF-8 flag was not set\, so it displayed as garbage.

The fix was simply to set the UTF-8 flag if the text is valid UTF-8.

The problem with this is that it breaks code that just output the returned bytes and the filehandle doesn't have the utf8 default on. Now what gets displayed there looks like garbage.

If the broken code uses $! within the scope of the hated-by-some 'use bytes'\, then the UTF-8 flag doesn't get set\, and the return is precisely what it used to be.

Thus a potential solution is to force such code to change to do a 'use bytes'. FC is concerned that programs like ack will have to change if we choose this scenario.

Unless I have misunderstood then it is not just ack.

But pretty much every Perl program I ever wrote\, or saw\, that was in Perl.

This type of pattern is extremely pervasive​:

open my $fh\, ">"\, $file   or die "Failed to open '$file' for writing​: $!";

I am under the impression you are saying they all have change to​:

open my $fh\, ">"\, $file   or do { use bytes; die "Failed to open '$file' for writing​: $!" };

Which I find almost astounding. Please tell me I have misunderstood.

Otherwise we are in a quandary. If we revert the commit\, code that "does the right thing" by setting their filehandle appropriately gets garbage; whereas if we keep it\, code that is unprepared to handle UTF-8 can get garbage. There's probably far more of the latter than the former\, but do we wish to punish code that DTRT?

Before proceeding\, I want to make an assertion​: I think that it is better for someone to get output in a language foreign to them\, than it is to get garbage bytes. This is because they can put the former into something like Google translate to get a reasonable translation back into their own language; and I believe that what appears to be garbage bytes is much more problematical to figure out what was intended.

Do you accept or reject this assertion?

I accept it. However I think it is secondary to the question of requiring pretty much every script that uses filehandles to change. Maybe I am wrong that is what you are suggestion\, but if it is then IMO it cannot be the right answer.

If you don't accept it\, then you need to persuade me and others who do accept it\, why not\, and there's not much point in you reading the rest of this message.

If you do accept it\, one solution is to always output $! in English\, which we would do by always using the C locale when generating its text.

This could be relaxed by using the POSIX locale instead. On most platforms this will be identical to the C locale\, but on VMS\, at least\, it can include Western European languages as well\, though I think that VMS only returns $! in English.

A more general solution would be to output it in the native locale unless it is UTF-8 encoded\, in which case it would be converted to English. This would then cause the code like (apparently) ack to see no change in behavior\, except that some errors would now come out in English; and the code that was affected by #119499 would get English\, instead of garbage.

The reason that this issue comes up for programs that don't handle UTF-8 is because $! does not respect 'use locale'. The reason for this is that $! typically gives the user an OS error that is outside Perl's purview\, and it's best that these messages be displayed in the user's preferred language. But since what we have now causes garbage to be displayed for one class of user\, it seems to me to be a higher priority\, given my assertion\, to output something sane for everybody\, rather than something ideal for some\, and garbage for others.

For me prioritising "use locale" over every other script is inappropriate. IMO relatively few scripts use it. IMO for years the general recommendation about "use locale" has been to avoid it. I personally would get rid of it completely.

That leads to yet another possibility\, one that rjbs has previously vetoed\, but which I'm bringing up again here alongside this background that he may not have considered​: And that is to have $! respect 'use locale'. Outside of 'use locale' it would be C or POSIX\, which would mean English. Within the scope of 'use locale'\, it would be the user's language. Programs that do a 'use locale' can be assumed to be written to be able to handle them\, including the increasingly common UTF-8 locales.

It seems to me wrong to deliver $! locale-encoded to programs that aren't prepared to accept it. We may have gotten away with this for non-UTF-8 locales because most code won't try to parse the stringified $! (and it's probably foolish to try to parse it)\, but the UTF-8 flag throws a wrench into this uneasy truce.

To state my position explicitly​: I don't think it's a good idea to return a UTF-8 encoded string to code that isn't expecting that possibility. And I don't think it's OK to have user's see garbage bytes. To avoid doing these\, we have to return English whenever that could happen. 'use locale' in code should be enough to signal it's prepared to handle UTF-8; otherwise it's buggy.

So still another possibility is to deliver $! in the current locale if $! isn't UTF-8; otherwise to use English outside of 'use locale' and the UTF-8 inside. That leaves code that sets things up properly to get $! returned in the user's language; and code that doesn't will also get the user's language unless what is returned would be in UTF-8\, in which case it will come out in English\, instead of garbage. This seems to me to be the best solution.

Another possibility\, suggested by FC\, is to leave $! as-is\, but create a new variable that behaves differently. I think it's far better to get $! to work reasonably than to come up with an alternative variable.

I personally think that $! should be left alone\, and you should introduce a new pragma to control the decoding behavior of $!. Those people with bugs related to it can use the pragma.

Yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 10 years ago

From @khwilliamson

I was wrong in several things when I wrote this; please skip to later posts on the thread.

On 03/27/2014 04​:07 PM\, demerphq wrote​:

On 2 March 2014 06​:43\, Karl Williamson \public@​khwilliamson\.com wrote​:

This is my attempt to bring some clarity to this issue and stake out my position regarding it. I haven't re-read the thread thoroughly just now\, so I may miss some issues\, but I have been very aware of the central problem regarding this for months now\, and have been thinking about it for the same amount of time\, so I believe that what follows is an adequate summary of that.

First the background. This ticket is about a commit that fixed two tickets with the same underlying cause\, https://rt-archive.perl.org/perl5/Ticket/Display.html?id=112208 "printing $! when open.pm sets utf8 default on filehandles yields garbage"\, and #117429\, merged with the earlier ticket.

The problem is that $! was returning UTF-8 encoded text\, but the UTF-8 flag was not set\, so it displayed as garbage.

The fix was simply to set the UTF-8 flag if the text is valid UTF-8.

The problem with this is that it breaks code that just output the returned bytes and the filehandle doesn't have the utf8 default on. Now what gets displayed there looks like garbage.

If the broken code uses $! within the scope of the hated-by-some 'use bytes'\, then the UTF-8 flag doesn't get set\, and the return is precisely what it used to be.

Thus a potential solution is to force such code to change to do a 'use bytes'. FC is concerned that programs like ack will have to change if we choose this scenario.

Unless I have misunderstood then it is not just ack.

But pretty much every Perl program I ever wrote\, or saw\, that was in Perl.

This type of pattern is extremely pervasive​:

open my $fh\, ">"\, $file or die "Failed to open '$file' for writing​: $!";

I am under the impression you are saying they all have change to​:

open my $fh\, ">"\, $file or do { use bytes; die "Failed to open '$file' for writing​: $!" };

Which I find almost astounding. Please tell me I have misunderstood.

Otherwise we are in a quandary. If we revert the commit\, code that "does the right thing" by setting their filehandle appropriately gets garbage; whereas if we keep it\, code that is unprepared to handle UTF-8 can get garbage. There's probably far more of the latter than the former\, but do we wish to punish code that DTRT?

Before proceeding\, I want to make an assertion​: I think that it is better for someone to get output in a language foreign to them\, than it is to get garbage bytes. This is because they can put the former into something like Google translate to get a reasonable translation back into their own language; and I believe that what appears to be garbage bytes is much more problematical to figure out what was intended.

Do you accept or reject this assertion?

I accept it. However I think it is secondary to the question of requiring pretty much every script that uses filehandles to change. Maybe I am wrong that is what you are suggestion\, but if it is then IMO it cannot be the right answer.

If you don't accept it\, then you need to persuade me and others who do accept it\, why not\, and there's not much point in you reading the rest of this message.

If you do accept it\, one solution is to always output $! in English\, which we would do by always using the C locale when generating its text.

This could be relaxed by using the POSIX locale instead. On most platforms this will be identical to the C locale\, but on VMS\, at least\, it can include Western European languages as well\, though I think that VMS only returns $! in English.

A more general solution would be to output it in the native locale unless it is UTF-8 encoded\, in which case it would be converted to English. This would then cause the code like (apparently) ack to see no change in behavior\, except that some errors would now come out in English; and the code that was affected by #119499 would get English\, instead of garbage.

The reason that this issue comes up for programs that don't handle UTF-8 is because $! does not respect 'use locale'. The reason for this is that $! typically gives the user an OS error that is outside Perl's purview\, and it's best that these messages be displayed in the user's preferred language. But since what we have now causes garbage to be displayed for one class of user\, it seems to me to be a higher priority\, given my assertion\, to output something sane for everybody\, rather than something ideal for some\, and garbage for others.

For me prioritising "use locale" over every other script is inappropriate. IMO relatively few scripts use it. IMO for years the general recommendation about "use locale" has been to avoid it. I personally would get rid of it completely.

That leads to yet another possibility\, one that rjbs has previously vetoed\, but which I'm bringing up again here alongside this background that he may not have considered​: And that is to have $! respect 'use locale'. Outside of 'use locale' it would be C or POSIX\, which would mean English. Within the scope of 'use locale'\, it would be the user's language. Programs that do a 'use locale' can be assumed to be written to be able to handle them\, including the increasingly common UTF-8 locales.

It seems to me wrong to deliver $! locale-encoded to programs that aren't prepared to accept it. We may have gotten away with this for non-UTF-8 locales because most code won't try to parse the stringified $! (and it's probably foolish to try to parse it)\, but the UTF-8 flag throws a wrench into this uneasy truce.

To state my position explicitly​: I don't think it's a good idea to return a UTF-8 encoded string to code that isn't expecting that possibility. And I don't think it's OK to have user's see garbage bytes. To avoid doing these\, we have to return English whenever that could happen. 'use locale' in code should be enough to signal it's prepared to handle UTF-8; otherwise it's buggy.

So still another possibility is to deliver $! in the current locale if $! isn't UTF-8; otherwise to use English outside of 'use locale' and the UTF-8 inside. That leaves code that sets things up properly to get $! returned in the user's language; and code that doesn't will also get the user's language unless what is returned would be in UTF-8\, in which case it will come out in English\, instead of garbage. This seems to me to be the best solution.

Another possibility\, suggested by FC\, is to leave $! as-is\, but create a new variable that behaves differently. I think it's far better to get $! to work reasonably than to come up with an alternative variable.

I personally think that $! should be left alone\, and you should introduce a new pragma to control the decoding behavior of $!. Those people with bugs related to it can use the pragma.

Yves

p5pRT commented 10 years ago

From @khwilliamson

In this post\, I will just give some new insights I had today.

There are real bugs (even if the others previously mentioned aren't regarded as such) when "$!" isn't returned with the UTF-8 flag on\, and when $! is stringified to its locale string outside of "use locale" scope.

Consider this one liner​:

LC_ALL=zh_CN.utf8 ./perl -Ilib -le 'use utf8; $!=1; die "致命錯誤​: $!"'

In blead\, it prints\, as it should\, Wide character in die at -e line 1 致命錯誤​: 不允许的操作 at -e line 1

In 5.18.2 it prints this garbage instead Wide character in die at -e line 1 致命錯誤​: 不允许的操作 at -e line 1

The reason is that the program is encoded in utf8\, and $! has returned utf8 (only in the 5.18 case) without setting the utf8 flag\, and so Perl takes the bytes that form $! and upgrades those bytes into utf8 (again).   In other words\, its encoding twice.

(I chose Chinese because its script could not be confused with Western European characters\, and I used Google translate\, so the constant portion of the text may not make sense; I apologize to the Chinese speakers reading this.)

"use utf8" is not necessary for this. It could be "die "$prefix​: $!" where $prefix has its utf8 flag on.

These examples show\, once again\, the perils of having a scalar that's in UTF-8\, but pretending it's not\, even if it's just in a die(). I claim they conclusively show the brokenness of the 5.18 code.

Another problem with all existing versions is if the $prefix is written in Latin1. Recall that the default character sets of Perl are ASCII\, Latin1\, and full Unicode\, each a superset of the previous. So someone might in Hungarian might write

./perl -Ilib -le '$!=1; die "fatális hibát​: $!"'

(apologies to the Hungarian speakers)

If this is however run in a non-Latin1 locale\, like say

LC_ALL=el_GR.iso88597 ./perl -Ilib -le '$!=1; die "fatális hibát​: $!"'

The first part of the string is in Latin1\, and the 2nd part is in Latin7. These are not compatible (except for their common ASCII range and a few punctuation characters). If the terminal is set to display Latin1\, the first part looks ok\, the second is garbage\, and vice versa (except the common characters will look ok in both)

There is no current way for an application to guard against this; it is a sitting duck. $! always comes out in the underlying locale. (The reason this doesn't show up more often\, is apparently people write their prefix messages in English\, hence ASCII\, and all the locales\, like 88597\, are supersets of ASCII.

I claim this shows the perils of having stuff appear in the underlying locale outside the scope of 'use locale'. An unsuspecting application that doesn't even know that locales exist can be hit by the user's environment passing in a locale\, or by any module somewhere in the tool chain doing a setlocale().

I believe the solution is to make $! return the C locale messages outside the scope of 'use locale'\, just like the other categories. By being in such scope\, the caller is indicating its willingness to handle and be smart about locale issues. Otherwise it shouldn't have to be exposed to them.

My recent proposal also works. That is to use the $! locale value provided it is all ASCII. That means that a fair number of system messages in various European languages will come out natively\, but not those that might adversely affect things like ack. The problem with this is that the application still doesn't have control.

Note that in the messages above\, that Perl itself outputs its warnings and messages like "at -e line 1". Nobody has any control over that\, and I can't believe this fact hasn't discouraged some applications from using Perl in non-English settings.

What part of CPAN is expecting native-language $! ? I don't know\, but given the vagaries\, including some things always being in English\, and being at the mercy of the user's locale environment\, I suspect not much.

p5pRT commented 10 years ago

From @khwilliamson

Fixed for v5.20 by b17e32ea3ba5ef7362d2a3d1a433661afb897786

The plan for v5.21 is to make $! return locale messages only from within the scope of 'use locale'. In other words\, locale has to be opt-in. -- Karl Williamson

p5pRT commented 10 years ago

@khwilliamson - Status changed from 'open' to 'resolved'

p5pRT commented 10 years ago

From victor@vsespb.ru

I did not ever receive this message. Only receive a notice that the bug is resolved.

On Thu Mar 27 22​:09​:05 2014\, public@​khwilliamson.com wrote​:

In this post\, I will just give some new insights I had today.

There are real bugs (even if the others previously mentioned aren't regarded as such) when "$!" isn't returned with the UTF-8 flag on\, and when $! is stringified to its locale string outside of "use locale" scope.

Consider this one liner​:

LC_ALL=zh_CN.utf8 ./perl -Ilib -le 'use utf8; $!=1; die "致命錯誤​: $!"'

In blead\, it prints\, as it should\, Wide character in die at -e line 1 致命錯誤​: 不允许的操作 at -e line 1

In 5.18.2 it prints this garbage instead Wide character in die at -e line 1 致命錯誤​: 不允许的操作 at -e line 1

It's general limitation of perl - one should not merge character strings with binary strings. Not a bug\, but expected behaviour.

Another problem with all existing versions is if the $prefix is written in Latin1. Recall that the default character sets of Perl are ASCII\, Latin1\, and full Unicode\, each a superset of the previous. So someone might in Hungarian might write

./perl -Ilib -le '$!=1; die "fatális hibát​: $!"'

(apologies to the Hungarian speakers)

If this is however run in a non-Latin1 locale\, like say

LC_ALL=el_GR.iso88597 ./perl -Ilib -le '$!=1; die "fatális hibát​: $!"'

The first part of the string is in Latin1\, and the 2nd part is in Latin7. These are not compatible (except for their common ASCII range and a few punctuation characters). If the terminal is set to display Latin1\, the first part looks ok\, the second is garbage\, and vice versa (except the common characters will look ok in both)

Locale is iso88597 so terminal should be set to iso88597 (otherwise everything is garbage). And if it is\, it's not surprise that Latin1 is garbage.

What part of CPAN is expecting native-language $! ? I don't know\, but given the vagaries\, including some things always being in English\, and being at the mercy of the user's locale environment\, I suspect not much.

So you are worrying more about broken tests on CPAN\, and don't worry much about real bugs in users code (which not caught with tests). User will be surprised that perl stopped giving $! in locale's language\, but they cannot catch this in tests because they never ever suspect that such brokenness can be introduced (unit test are white box testing - you can test only for bugs you expect)