Properly handle UTF8-flagged strings when assigning to $0.

FGasper commented 2 years ago

Issue #19331: Use of SvPV_const and SvPV_force in S_set_dollarzero() wrote the PV internals directly to argv, which causes an improper UTF-8 encode if the SV is UTF8-flagged/upgraded.

This fixes that doing a downgrade prior to those SvPV* calls. If the string contains wide characters (and thus cannot be downgraded) a warning is thrown; this mirrors preexisting behavior with %ENV, print, and other output channels that convert Perl SVs to bytes.

xenu commented 2 years ago

This is a breaking change. If we're OK with that we might just as well fix all the other syscalls that have a similar problem.

FGasper commented 2 years ago

@xenu It's about on the level of %ENV, IMO less so than mkdir, open, et al.

FGasper commented 2 years ago

I’ve updated this so that it UTF8-encodes $0 itself when given a wide character. This (correctly, ISTM) breaks round-tripping in favour of having $0 more accurately represent the actual process name.

Grinnz commented 2 years ago

I don't think that's a better behavior, or at least not one easy to explain as intentional - it furthers the behavioral discrepancy between "correctly" upgraded strings and strings with non-ascii but not wide characters.

khwilliamson commented 2 years ago

Something to consider for this or other similar situations is that it doesn't take many UTF-8 encoded characters in a string to rule out it being in some other encoding. Some years ago I added heuristics to Pod::Simple to better distinguish between CP1252 and UTF-8. Since then there have been no reports of failures. It gets Muvrar\xE1\x9A\x9Aa right for example, which is syntactically legal in both, and a real modern word in CP1252. I use a variety of techniques to exclude syntactically legal UTF-8. In this case, no real person would name something using a mixture of modern Latin and ancient Ogham characters, which stopped being used 1500 years ago.

Grinnz commented 2 years ago

You do have to account for the possibility that the bytes being set aren't in UTF-8 or CP1252 though, or in many other instances, that it isn't text at all

FGasper commented 2 years ago

I don't think that's a better behavior, or at least not one easy to explain as intentional - it furthers the behavioral discrepancy between "correctly" upgraded strings and strings with non-ascii but not wide characters.

All that’s changed in my latest push is that $0 will always match /cmdline. Previously if you did:

$0 = "\x{100}";
Dump $0

… you’d see "\x{100}" as $0’s value, even though /cmdline is the 2 bytes of that character in UTF-8. Now $0 will be the 2 UTF-8 bytes instead.

FGasper commented 2 years ago

Something to consider for this or other similar situations is that it doesn't take many UTF-8 encoded characters in a string to rule out it being in some other encoding. Some years ago I added heuristics to Pod::Simple to better distinguish between CP1252 and UTF-8. Since then there have been no reports of failures. It gets Muvrar\xE1\x9A\x9Aa right for example, which is syntactically legal in both, and a real modern word in CP1252. I use a variety of techniques to exclude syntactically legal UTF-8. In this case, no real person would name something using a mixture of modern Latin and ancient Ogham characters, which stopped being used 1500 years ago.

Do you mind fleshing this out a bit? Right now I don’t see the relevance. (Likewise @Grinnz’s comment.) The UTF8-encode is just there to ensure, after a write to argv, that $0 matches the value we just wrote rather than what we tried—improperly—to write.

Grinnz commented 2 years ago

I don't think that's a better behavior, or at least not one easy to explain as intentional - it furthers the behavioral discrepancy between "correctly" upgraded strings and strings with non-ascii but not wide characters.

All that’s changed in my latest push is that $0 will always match /cmdline. Previously if you did:
$0 = "\x{100}";
Dump $0
… you’d see "\x{100}" as $0’s value, even though /cmdline is the 2 bytes of that character in UTF-8. Now $0 will be the 2 UTF-8 bytes instead.

Which is not the value that was assigned. There is no "correct" behavior here - instead this is less consistent, because the same thing does not happen if you assign \xFF.

FGasper commented 2 years ago

@Grinnz Wouldn’t assigning \x{100} as part of the program name ideally trigger an error, actually, since the program name—or, at least, the raw argv buffer—is, by definition, a byte string?

Assuming that to be the case, it seems more logical for $0 to represent the program name as documented rather than a value that the Perl programmer improperly tried to assign as the program name. This is how %ENV values work, for example:

> perl -MDevel::Peek -e'$ENV{foo} = "\x{100}"; Dump( "". $ENV{foo} )'
Wide character in setenv at -e line 1.
SV = PV(0x7fa4ad00c550) at 0x7fa4ad00bb38
  REFCNT = 1
  FLAGS = (PADTMP,POK,pPOK)
  PV = 0x7fa4ace04290 "\304\200"\0
  CUR = 2
  LEN = 10

Grinnz commented 2 years ago

I think it is harder to explain, but if it is consistent with %ENV that is fine.

FGasper commented 2 years ago

Marking ready-for-review as I believe the failing Cygwin test is unrelated.

khwilliamson commented 2 years ago

My point had nothing to do with CP1252 really, or necessarily with this PR. My point is that when needed, one can quite accurately distinguish between UTF-8 and not UTF-8. Besides the somewhat restrictive UTF-8 syntax, further analysis can lead to rejecting many more strings that are syntactically valid UTF-8. It's something to keep in mind when the situation arises.

FGasper commented 2 years ago

Leon, how shall I proceed here? The authors test failed again when I pulled my change there out.

I believe previously when this happened Paul just merged regardless. FWIW.

Thank you!

-FG

On Jan 15, 2022, at 16:33, Leon Timmermans @.***> wrote:

@Leont commented on this pull request.

In Porting/checkAUTHORS.pl:

@@ -900,6 +900,7 @@ sub _raw_address { eggert\100twinsun.com eggert\100sea.sm.unisys.com etj\100cpan.org mohawk2\100users.noreply.github.com

+felipe\100felipegasper.com fgasper\100users.noreply.github.com None of the commits is assigned to that email address, that makes no sense.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.

xenu commented 2 years ago

Hopefully #19351, which was just merged, should fix your problem. Could you rebase on the latest blead?

FGasper commented 2 years ago

@xenu Resubbed … 🤞!

xenu commented 2 years ago

It seems it worked everywhere except Windows. I've made a PR to fix it: #19352

FGasper commented 2 years ago

It seems it worked everywhere except Windows. I've made a PR to fix it: #19352

Thank you. I see that that merged, so I’ve rebased again. 🤞

Leont commented 2 years ago

This is a breaking change. If we're OK with that we might just as well fix all the other syscalls that have a similar problem.

Yeah, I think it may be wise to make this a PSC decision because of that.

FGasper commented 2 years ago

This is a breaking change. If we're OK with that we might just as well fix all the other syscalls that have a similar problem.

Yeah, I think it may be wise to make this a PSC decision because of that.

I think it’s less problematic than other syscalls. Consider:

Most of the affected syscalls concern the filesystem, which brings with it a host of character-encoding questions that don’t apply here because—as I understand?—$0 is always a byte string, whereas filesystem names aren’t always bytes (as they are in POSIX OSes).
$0 is usually more of an advisory/debugging mechanism rather than something mission-critical. So while both $0 and exec deal with byte strings exclusively, far more would break if exec were fixed.
This is not too different from similar updates to %ENV, which happened in 5.18 and 5.32. (%ENV is, unlike $0, local to the process but does propagate to child processes.)

That said, I don’t object to PSC being part of the convo. (@rjbs @leonerd @neilb)

Grinnz commented 2 years ago

$0 is usually more of an advisory/debugging mechanism rather than something mission-critical. So while both $0 and exec deal with byte strings exclusively, far more would break if exec were fixed.

Not to disagree but this is unfortunately not always the case, as it's used by FindBin. Which is a big part of why I recommend against FindBin. (all bets are off with FindBin if you assign to $0 beforehand, anyway)

FGasper commented 2 years ago

Not to disagree but this is unfortunately not always the case, as it's used by FindBin. Which is a big part of why I recommend against FindBin. (all bets are off with FindBin if you assign to $0 beforehand, anyway)

Interesting. Regardless, though, I do think the point still stands that changes to $0 are usually more informational than functional.

khwilliamson commented 2 years ago

LGTM

khwilliamson commented 2 years ago

@Perl/perl-steering-council please respond, if only to say you don't intend to get involved

demerphq commented 2 years ago

On Sat, 8 Jan 2022 at 19:33, Felipe Gasper @.***> wrote:

@.**** commented on this pull request.

In t/op/magic.t https://github.com/Perl/perl5/pull/19334#discussion_r780697112:

+# Check that assigning to $0 properly handles upgraded strings: +{

my $eacute = "\N{U+00e9}";

utf8::encode($eacute);

utf8::upgrade($eacute);

my $eacute_downgrade = $eacute;

utf8::downgrade($eacute_downgrade);

$0 = $eacute;

is ($0, $eacute, 'checked $0');

SKIP: {

skip "Test is for Linux, not $^O" if $^O ne 'linux';

my $slurp = cat /proc/$$/cmdline;

done

I find this code totally bizarre. Can you please explain why you are double encoding this? Anything created with \N{U+..} is already marked as unicode and encoded as utf8.

"\N{U+00E9}" is not the same as chr(0xe9). The former will be utf8-on and consist of two octets at the wire level. The latter will be utf8-off and consist of one octet at the wire level.

Then you encode it as utf8, so in effect you turn off the utf8 flag, and then you upgrade it, thus you have now double encoded the string. Eg, you have taken the octets in the utf8 encoding of codepoint E9 and encoded them as utf8. I cannot understand why you want to do this. Doing so is in my experience always a bug. So much so that my former workplace has a standard function to decode such things recursively which was our almost universal tool for decoding "utf8" data (recurse_decode_utf8).

You then later downgrade the string. It makes very little sense to me reading the code. If this is truly deliberate there should be a big comment explaining why you are doing something this bizarre. If I had to work on this code my prima-facie assumption would be that the code was broken and I would rip it out before I even started debugging.

cheers, Yves

$ perl -MDevel::Peek -le'my $str="\N{U+00e9}"; Dump($str); utf8::encode($str); Dump($str); utf8::upgrade($str); Dump($str)' SV = PV(0x5563b34affd0) at 0x5563b34d6a90 REFCNT = 1 FLAGS = (POK,IsCOW,pPOK,UTF8) PV = 0x5563b34c6820 "\303\251"\0 [UTF8 "\x{e9}"] CUR = 2 LEN = 10 COW_REFCNT = 1 SV = PV(0x5563b34affd0) at 0x5563b34d6a90 REFCNT = 1 FLAGS = (POK,IsCOW,pPOK) PV = 0x5563b34c6820 "\303\251"\0 CUR = 2 LEN = 10 COW_REFCNT = 1 SV = PV(0x5563b34affd0) at 0x5563b34d6a90 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x5563b34da830 "\303\203\302\251"\0 [UTF8 "\x{c3}\x{a9}"] CUR = 4 LEN = 10

demerphq commented 2 years ago

On Thu, 24 Feb 2022 at 03:29, Karl Williamson @.***> wrote:

@Perl/perl-steering-council https://github.com/orgs/Perl/teams/perl-steering-council please respond, if only to say you don't intend to get involved

I am not steering council, but as current coded I do not agree that it gets merged as is. See my other post. The t/op/magic.t code is either wrong, or it needs a big comment explaining why it is double encoding utf8 deliberately.

Sorry to come to the party so late.

cheers, Yves

Grinnz commented 2 years ago

On Sat, 8 Jan 2022 at 19:33, Felipe Gasper @.> wrote: @*.** commented on this pull request. ------------------------------ In t/op/magic.t <#19334 (comment)>: > +# Check that assigning to $0 properly handles upgraded strings: +{ + my $eacute = "\N{U+00e9}"; + utf8::encode($eacute); + utf8::upgrade($eacute); + + my $eacute_downgrade = $eacute; + utf8::downgrade($eacute_downgrade); + + $0 = $eacute; + is ($0, $eacute, 'checked $0'); + + SKIP: { + skip "Test is for Linux, not $^O" if $^O ne 'linux'; + + my $slurp = cat /proc/$$/cmdline; done I find this code totally bizarre. Can you please explain why you are double encoding this? Anything created with \N{U+..} is already marked as unicode and encoded as utf8.

This is inaccurate. It is not marked as unicode in any way, and may or may not be encoded to UTF-8 in internal storage.

"\N{U+00E9}" is not the same as chr(0xe9).

Also inaccurate, these result in indistinguishable strings at the logical level and there is no guarantee whether they will be stored the same or differently - however it is accurate on EBCDIC, where they result in completely different strings, as 0xe9 is translated to U+005A.

demerphq commented 2 years ago

On Thu, 24 Feb 2022 at 05:14, Dan Book @.***> wrote:

On Sat, 8 Jan 2022 at 19:33, Felipe Gasper @.*> wrote: @.** commented on this pull request. ------------------------------ In t/op/magic.t <#19334 (comment) https://github.com/Perl/perl5/pull/19334#discussion_r780697112>:

+# Check that assigning to $0 properly handles upgraded strings: +{ + my $eacute = "\N{U+00e9}"; + utf8::encode($eacute); + utf8::upgrade($eacute);

my $eacute_downgrade = $eacute; + utf8::downgrade($eacute_downgrade); +

$0 = $eacute; + is ($0, $eacute, 'checked $0'); + + SKIP: { + skip "Test is for Linux, not $^O" if $^O ne 'linux'; + + my $slurp = cat /proc/$$/cmdline; done I find this code totally bizarre. Can you please explain why you are double encoding this? Anything created with \N{U+..} is already marked as unicode and encoded as utf8.

This is inaccurate. It is not marked as unicode in any way, and may or may not be encoded to UTF-8 in internal storage.

It is NOT inaccurate. I directly worked on this code in the internals, and anything using U+ named characters ALWAYS returns a utf8-on string. chr() on the other hand returns a utf8-off codepoint for any codepoint under 256. (Modulo possible pragmatta in place which might change this.)

Regardless of that any code that does "utf8::encode" followed by "utf8::upgrade" will result in a double encoded string. It HAS to. utf8::encode means "turn this string whatever it is encoded as currently into octets containing its utf8 representation with the utf8 flag OFF", that is what the function and the internals functions are defined to do. utf8::upgrade means "take this string, regardless of how it is currently encoded, and turn it into the equivalent utf8 encoded string with the utf8 flag ON". Doing both to a string will result in a double encoded string. If perl were to change this then just about every piece of code that does C level or wire level interoperability would break. So it will NOT ever ever change without a pragmatta being in effect.

$ perl -MDevel::Peek -le'my $str=chr(0xe9); Dump($str); utf8::encode($str); Dump($str); utf8::upgrade($str); Dump($str)' SV = PV(0x55b397a96fd0) at 0x55b397abdab0 REFCNT = 1 FLAGS = (POK,pPOK) PV = 0x55b397ad81f0 "\351"\0 CUR = 1 LEN = 10 SV = PV(0x55b397a96fd0) at 0x55b397abdab0 REFCNT = 1 FLAGS = (POK,pPOK) PV = 0x55b397ad81f0 "\303\251"\0 CUR = 2 LEN = 10 SV = PV(0x55b397a96fd0) at 0x55b397abdab0 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x55b397ad81f0 "\303\203\302\251"\0 [UTF8 "\x{c3}\x{a9}"] CUR = 4 LEN = 10

$ perl -MDevel::Peek -le'my $str="\N{U+00e9}"; Dump($str); utf8::encode($str); Dump($str); utf8::upgrade($str); Dump($str)' SV = PV(0x561a039b0fd0) at 0x561a039d79d0 REFCNT = 1 FLAGS = (POK,IsCOW,pPOK,UTF8) PV = 0x561a039e8300 "\303\251"\0 [UTF8 "\x{e9}"] CUR = 2 LEN = 10 COW_REFCNT = 1 SV = PV(0x561a039b0fd0) at 0x561a039d79d0 REFCNT = 1 FLAGS = (POK,IsCOW,pPOK) PV = 0x561a039e8300 "\303\251"\0 CUR = 2 LEN = 10 COW_REFCNT = 1 SV = PV(0x561a039b0fd0) at 0x561a039d79d0 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x561a039db770 "\303\203\302\251"\0 [UTF8 "\x{c3}\x{a9}"] CUR = 4 LEN = 10

"\N{U+00E9}" is not the same as chr(0xe9).

Also inaccurate, these result in indistinguishable strings at the logical level and there is no guarantee whether they will be stored the same or differently.

Again, it is NOT inaccurate, it is precise AND accurate:

$ perl -MDevel::Peek -le'Dump("\N{U+00e9}"); Dump(chr(0xE9));' SV = PV(0x55912fb031a0) at 0x55912fb29a50 REFCNT = 1 FLAGS = (POK,IsCOW,READONLY,PROTECT,pPOK,UTF8) PV = 0x55912fb19820 "\303\251"\0 [UTF8 "\x{e9}"] CUR = 2 LEN = 10 COW_REFCNT = 0 SV = PV(0x55912fb03220) at 0x55912fb29ab0 REFCNT = 1 FLAGS = (PADTMP,POK,READONLY,PROTECT,pPOK) PV = 0x55912fb39f40 "\351"\0 CUR = 1 LEN = 10

If that code was to be agnostic to whether the strings started off as ut8 on or not then it should simply be:

utf8::encode($str)

and that is it. No utf8::upgrade at all, utf8::encode implies utf8::upgrade.

$ perl -MDevel::Peek -le'my $str=chr(0xe9); Dump($str); utf8::encode($str); Dump($str)' SV = PV(0x559b7b209fd0) at 0x559b7b230a80 REFCNT = 1 FLAGS = (POK,pPOK) PV = 0x559b7b2347f0 "\351"\0 CUR = 1 LEN = 10 SV = PV(0x559b7b209fd0) at 0x559b7b230a80 REFCNT = 1 FLAGS = (POK,pPOK) PV = 0x559b7b2347f0 "\303\251"\0 CUR = 2 LEN = 10

$ perl -MDevel::Peek -le'my $str="\N{U+00E9}"; Dump($str); utf8::encode($str); Dump($str)' SV = PV(0x557721ee8fd0) at 0x557721f0fa70 REFCNT = 1 FLAGS = (POK,IsCOW,pPOK,UTF8) PV = 0x557721f4b600 "\303\251"\0 [UTF8 "\x{e9}"] CUR = 2 LEN = 10 COW_REFCNT = 1 SV = PV(0x557721ee8fd0) at 0x557721f0fa70 REFCNT = 1 FLAGS = (POK,IsCOW,pPOK) PV = 0x557721f4b600 "\303\251"\0 CUR = 2 LEN = 10 COW_REFCNT = 1

These things cannot change with a pragma, and will not ever change without a pragma, or tons and tons and tons of code would break totally. The code that we use to build perl itself would break totally even. Perl is so backwards compatiblity focused that it is utterly unthinkable to someone well versed in the internals like me to think that they will ever change. The breakage would be massive. As I said, we couldn't even build perl itself if these things changed. The macros we use to match utf8 and non-utf8 character types and what not were written by me and use these features directly.

I wish you would stop trying to correct me about the internals with regard to unicode when you are provably and demonstrably wrong. You seem to suffer from misconceptions about how the internals actually work (by default at least). You also seem to be unaware that I worked on the internals of a lot of this stuff over the years, I literally in some cases wrote the code that you are "correcting" me on. Please next time you want to argue with me about this stuff please consult the docs and Devel::Peek first and include output from it to prove me wrong. I really am getting to the point of thinking you are trolling me.

Thanks, Yves

Grinnz commented 2 years ago

I appreciate your view but I am not trolling you, you are focusing on how internals work, which is not relevant to a discussion of how Perl is designed. What you speak of can be changed at any time without backwards compatibility concerns because it is not defined behavior, it is the current implementation.

Grinnz commented 2 years ago

utf8::upgrade means "take this string, regardless of how it is currently encoded, and turn it into the equivalent utf8 encoded string with the utf8 flag ON".

This is not correct, you are talking about what happens in the internal bytes, which does not correspond to the value of the string except by use of the UTF8 bit - this can be demonstrated by using any non-bugged string operator in Perl.

use strict;
use warnings;
my $str = "\xe9";
utf8::upgrade $str;
$str eq "\xe9"; # true

Grinnz commented 2 years ago

I do apologize for not including example code each time, there is only so much time in the day. But unless the discussion is about how strings work internally, the correct behavior I refer to is based on how string operators logically behave.

demerphq commented 2 years ago

On Thu, 24 Feb 2022 at 06:26, Dan Book @.***> wrote:

I appreciate your view but I am not trolling you, you are focusing on how internals work, which is not relevant to a discussion of how Perl is designed. What you speak of can be changed at any time without backwards compatibility concerns because it is not defined behavior, it is the current implementation.

No it can't be changed . We can't even remove a highly esoteric feature like empty regex match; we sure as heck can't change what chr() returns.

I dont know how to get this across clearly, you seem to think we can change things we absolutely cannot change. You seem to think that some of the generalizations about how perl programmers should think about how perl strings work are the defined behavior of perl when they aren't and without realizing there are tens of thousands of scripts out there expecting it to work in very specific ways that are contradictory to what you say. We won't break those scripts. We had a riot when we introduced the idea of making strict always on, that would pale in comparison to the riot that would happen when we change what chr() returns.

We have tons of tests that are based on the assumption that certain sequences of code result in non-utf8 strings or utf8-strings, those tests specify many of the things I have talked about. You seem to take the position that because the docs dont specify that chr(0xDF) produces a non-utf8 string that you can treat as something that we can just change. I guarantee you there is code that tests that it does exactly what I said. Those tests trump the lack of documentation of these issues. They trump it very hard.

cheers, Yves

demerphq commented 2 years ago

On Thu, 24 Feb 2022 at 06:30, Dan Book @.***> wrote:

utf8::upgrade means "take this string, regardless of how it is currently encoded, and turn it into the equivalent utf8 encoded string with the utf8 flag ON".

This is not correct, you are talking about what happens in the internal bytes, which does not correspond to the value of the string except by use of the UTF8 bit - this can be demonstrated by using any non-bugged string operator in Perl.

use strict; use warnings; my $str = "\xe9"; utf8::upgrade $str; $str eq "\xe9"; # true

This code doesn't demonstrate what you think it does.

We define the eq (and other string comparison operators and our hash functions) to hide the difference between physical representation of strings and treat them as logically equivalent. The upgrade function does exactly what I said and is documented to do so as well. Please go and read the relevant docs. here they are:

*   "$num_octets = utf8::upgrade($string)"

    (Since Perl v5.8.0) Converts in-place the internal representation of
    the string from an octet sequence in the native encoding (Latin-1 or
    EBCDIC) to UTF-8. The logical character sequence itself is
    unchanged. If *$string* is already upgraded, then this is a no-op.
    Returns the number of octets necessary to represent the string as
    UTF-8.

That says with slightly more words exactly what I said. Please stop arguing with me until you have actually checked that you are right. If you can produce documentation and tests and Devel::Peek output that shows that something I have said is wrong I am happy to hear it, but until then please just leave me alone and stop trolling me.

Thanks, Yves

FGasper commented 2 years ago

@demerphq The crux of the matter is this statement of yours:

"\N{U+00E9}" is not the same as chr(0xe9).

It depends on what you mean by “the same as”.

For you, a Jedi Master who maintains Perl’s internals, your statement is true: the \N creates an SV whose PV is Unicode U+00E9, encoded to UTF-8, while chr 0xe9 creates an SV whose PV is the same character encoded to Latin-1. You define “the same as” to include Perl’s internals, such as Devel::Peek shows.

For Perl users, however, your statement is false, or else "\N{U+00e9}" eq chr 0xe9 would be falsy (everywhere, not just on EBCDIC systems). For Perl users a string is just an opaque sequence of code points; the internal details that Devel::Peek shows aren’t of practical concern.

… except in the case of bugs like this one, #18636, and what Chip S. fixed in 613c63b. In those cases, Perl’s raw PV leaked to the environment, forcing the Perl programmer to consider Perl internals. The correct behaviour is what print() does: if the string is UTF8-flagged, then send its downgraded form to the OS, not its raw PV. The latter two cases fixed %ENV to do this in 5.18 and 5.32; this one proposes to fix $0.

Of note: $0 is, as far as I can tell, the last place in Perl where a Perl programmer has to care about string internals. After this fix, with this header:

use feature qw(unicode_strings unicode_eval);
use Sys::Binmode;

… the resulting Perl block will enjoy a fully-working Unicode abstraction: $0, %ENV, and all of the various “non-Unicode-aware” built-ins (open, exec, etc.).

FGasper commented 2 years ago

@demerphq Is it documented that \N{....} always creates a UTF8-flagged string?

FGasper commented 2 years ago

The t/op/magic.t code is either wrong, or it needs a big comment explaining why it is double encoding utf8 deliberately.

I’ve updated the tests to explain why this happens. I also removed the double-encode; instead it now just writes code point 0xe9.

khwilliamson commented 2 years ago

On 2/24/22 07:07, Felipe Gasper wrote:

@demerphq https://github.com/demerphq Is it documented that |\N{....}| always creates a UTF8-flagged string?

I don't know remember if it is documented, but I'm the one who added this, and it has to be this way, though there was some consternation from others at the time. Until and if 'unicode_strings' can't be turned off, making it UTF-8 is the only way a string can consistently follow Unicode rules across all modules that it may encounter during its life. Hence not doing this could cause spooky action at a distance bugs.

FGasper commented 2 years ago

making it UTF-8 is the only way a string can consistently follow Unicode rules across all modules that it may encounter during its life.

So the idea was to make \N{U+00e9} resilient against that kind of inconsistency in ways that \xe9 is not?

Of course, it comes at the price of other apparent inconsistencies, e.g.:

> perl -e'exec echo => "\N{U+00e9}"' | xxd
00000000: c3a9 0a

> perl -e'exec echo => "\xe9"' | xxd
00000000: e90a

> perl -e'print "\N{U+00e9}" eq "\xe9"'
1

xenu commented 2 years ago

@demerphq: please stop calling people who discuss in good faith "trolls", especially when you're being confidently wrong.

Anyway...

I find this code totally bizarre. Can you please explain why you are double encoding this? Anything created with \N{U+..} is already marked as unicode and encoded as utf8.

"Encoded as utf8" is ambiguous. "Has the utf8 flag" or, ideally, "upgraded" are more precise. It's ambiguous, because the term "utf8 encoded" is usually applied to the output of utf8::encode and similar functions.

"\N{U+00E9}" is not the same as chr(0xe9). The former will be utf8-on and consist of two octets at the wire level. The latter will be utf8-off and consist of one octet at the wire level.

It's true they use different internal representations, but they are logically fully equivalent and only buggy code differentiates between them. Fixing one instance of such buggy code is the purpose of this ticket.

Then you encode it as utf8, so in effect you turn off the utf8 flag, and then you upgrade it, thus you have now double encoded the string.

Nothing is double encoded. Upgrading a string doesn't change its logical contents. Before and after upgrading, the string consists of the same two codepoints: U+00C3 and U+00A9.

Perl strings always consist of codepoints. The internal representation (the UTF8 flag) is an implementation detail and it matters only in buggy code. This kind of bugs is what we call "the Unicode bug". The feature unicode_strings was created to fix the Unicode bug in many of the builtins, but unfortunately, syscalls such as open or $0 (which this ticket attempts to fix) remain broken.

Of course, it is true that fixing the behaviour of $0 is a breaking change, so it's up to PSC to decide whether this PR should be merged.

PS. Your persistent confusion about the Perl's Unicode model is one of the reasons why I think renaming the UTF8 flag would be a good change.

demerphq commented 2 years ago

On Thu, 24 Feb 2022 at 18:48, Karl Williamson @.***> wrote:

On 2/24/22 07:07, Felipe Gasper wrote:

@demerphq https://github.com/demerphq Is it documented that |\N{....}| always creates a UTF8-flagged string?

I don't know remember if it is documented, but I'm the one who added this, and it has to be this way, though there was some consternation from others at the time. Until and if 'unicode_strings' can't be turned off, making it UTF-8 is the only way a string can consistently follow Unicode rules across all modules that it may encounter during its life. Hence not doing this could cause spooky action at a distance bugs.

FWIW, I believe it was always true of \N{U+...} from the earliest days, I remember dealing with it myself before you were on the scene. Note, I'm not contradicting you about \N{ ... } in general, I have a hazy recollection that aligns with what you say here. But the U+ case is to me a no-brainer, as U+... is a unicode notation, so having it NOT return a uncode codepoint would be weird as heck, and I definitely remember dealing with it before you started working on perl. If you check 5.8.0 you will find it was first released in that version, and was implemented via logic in the Unicode namespace, so if it didnt return a unicode codepoint it would be weird.

chr() on the other hand traces back to early perl, and predates Unicode entirely, thus it had to return a non-utf8 octet for part of its history as it existed before utf8 did. This is also why it is something we could not and can not change as default behavior (pragmata can change things of course). If you were constructing a data packet to go over the wire in perl 5.6 it would have been a gross backwards compatibility violation if it started returning multiple octets for codepoints 128-255 in perl 5.8.0 which was when Jarkko introduced Unicode. I remember writing Inline::C code in 5.6 that used chr() to construct data structures in perl that would then be processed by Inline::C code for operational use, if 5.8.0 had broken code like that it would have been a major problem.

Cheers, Yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

Grinnz commented 2 years ago

chr() on the other hand traces back to early perl, and predates Unicode entirely, thus it had to return a non-utf8 octet for part of its history as it existed before utf8 did. This is also why it is something we could not and can not change as default behavior (pragmata can change things of course). If you were constructing a data packet to go over the wire in perl 5.6 it would have been a gross backwards compatibility violation if it started returning multiple octets for codepoints 128-255 in perl 5.8.0 which was when Jarkko introduced Unicode. I remember writing Inline::C code in 5.6 that used chr() to construct data structures in perl that would then be processed by Inline::C code for operational use, if 5.8.0 had broken code like that it would have been a major problem.

It would not, because if different data is sent over the wire based on whether the string is upgraded or downgraded, that is a bug.

Grinnz commented 2 years ago

my $str = "\xe9";
print $str;
utf8::upgrade $str;
print $str;

This code prints two E9 bytes, barring layers on STDOUT. The stored bytes in the string are only relevant to the contents of the string via the UTF8 bit; these strings are equal, and must be treated as such by all non-bugged code.

Grinnz commented 2 years ago

(unrelated, 5.8.0 was rather broken in this regard, I recommend using 5.8.1 as the earliest somewhat sane point)

demerphq commented 2 years ago

On Fri, 25 Feb 2022 at 02:17, xenu @.***> wrote:

@demerphq https://github.com/demerphq: please stop calling people who discuss in good faith "trolls", especially when you're being confidently wrong.

I dont feel it was in good faith. It directly ignored my request to quote docs and to quote Devel::Peek output. It also told me that I was wrong in my description of what utf8::upgrade does when my description was essentially a paraphrasing of the docs.

Anyway...

I find this code totally bizarre. Can you please explain why you are double encoding this? Anything created with \N{U+..} is already marked as unicode and encoded as utf8.

"Encoded as utf8" is ambiguous. "Has the utf8 flag" or, ideally, "upgraded" are more precise. It's ambiguous, because the term "utf8 encoded" is usually applied to the output of utf8::encode and similar functions.

I meant it was internally encoded as utf8, in other words it did not need to be upgraded as it was already upgraded. Part of the point of my comment was that that code is confusing and unusual and as it included no comments why it would do something so weird I could not discern his intent. Taking a codepoint, and then encoding it as utf8 (in the sense of utf8::encode) and then upgrading it is in my experience a super weird thing to do. Which I said in my original post.

Your experience might be that "utf8 encoded" is "usually" applied in specific circumstances. I am virtually certain the scope of circumstances I have been dealing with these issues is vastly larger than yours, after all my first commit to Perl is from 20 years ago (yours is from 5 years ago), and I have worked on very tricky logic in the regex engine (as far as I can tell you have not) where unicode matters a lot, and I have worked on pretty much every perl since 5.6.1 and I worked in the largest Perl shop in the world for a very long time (and as far as I know you did not). So my guess is I have used that phrase and heard that phrase and dealt with these issues a whole lot more than you have.

"\N{U+00E9}" is not the same as chr(0xe9). The former will be utf8-on and consist of two octets at the wire level. The latter will be utf8-off and consist of one octet at the wire level.

It's true they use different internal representations, but they are logically fully equivalent and only buggy code differentiates between them. Fixing one instance of such buggy code is the purpose of this ticket.

Actually they would be best described as "mostly logically equivalent". It seems unreasonable to me to consider something "logically fully equivalent" when some simple test code demonstrates they aren't:

$ perl -le'print chr(0xDF)=~/ss/i ? "yes" : "no"' no $ perl -le'print "\N{U+DF}"=~/ss/i ? "yes" : "no"' yes

You are welcome to say that this suffers from "the unicode bug" if you wish, but I personally am not going to just ignore well defined behavior as "being buggy" because it doesn't match my mental model. You can if you wish, but I won't.

Also I am aware of the intent of this ticket. My involvement is that saying:

my $str= "\n{U+E9}"; utf8::encode($str); utf8::upgrade($str);

is a very unusual thing to do, and typically leads to double encoding issues, and if it is deliberate it should have a comment explaining why someone would deliberately do something that weird. Heck calling utf8::encode($str) is a pretty weird thing to do, most people would call Encode::encode_utf8() and ONLY do so when they plan for the data to egress to another system immediately. At the same time I tried to point out to Felipe that saying "\N{U+E9}" results in a utf8 on string, and saying chr(0xE9) does not. Yet that also was argued with, despite the clear facts visible from Devel::Peek.

Then you encode it as utf8, so in effect you turn off the utf8 flag, and then you upgrade it, thus you have now double encoded the string.

Nothing is double encoded. Upgrading a string doesn't change its logical contents. Before and after upgrading, the string consists of the same two codepoints: U+00C3 and U+00A9.

You and I seem to have different ideas of what double encoded means. This code is a perfect example of what I would call double encoded. To me and in my experience double encoded means treating the octets that make up the utf8 encoding of a codepoint as codepoints themselves, which is what I consider is happening when they are "upgraded" to unicode.

$ perl -MDevel::Peek -le'my $str="\xE9"; Dump($str); utf8::encode $str; Dump($str); utf8::upgrade $str; Dump($str)' SV = PV(0x56399ef6aeb8) at 0x56399ef9b8b8 REFCNT = 1 FLAGS = (POK,IsCOW,pPOK) PV = 0x56399efa7728 "\351"\0 CUR = 1 LEN = 10 COW_REFCNT = 1 SV = PV(0x56399ef6aeb8) at 0x56399ef9b8b8 REFCNT = 1 FLAGS = (POK,pPOK) PV = 0x56399efb1a28 "\303\251"\0 CUR = 2 LEN = 10 SV = PV(0x56399ef6aeb8) at 0x56399ef9b8b8 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x56399efb1a28 "\303\203\302\251"\0 [UTF8 "\x{c3}\x{a9}"] CUR = 4 LEN = 10

If that final string is output "as unicode" and printed through encode_utf8, which it should be if it is output to a utf8 terminal, then it will be double encoded on the wire. It will NOT represent 0xE9 the original codepoint, it will be the utf8 encoding of C3 A9, which is C3 83 C2 A9. THAT is double encoding. You see the process? E9 -> C3 A9 -> C3 83 C2 A9. Boom double encoded.

Perl strings always consist of codepoints.

At the logical level yes. But utf8::encode and Encode::encode_utf8 allow you to translate codepoints to octets. Which then allows double encoding.

The internal representation (the UTF8 flag) is an implementation detail and it matters only in buggy code.

This is the perl internal development list where we fix bugs related to unicode; it matters a lot here. And I disagree, it matters a lot to a lot of code. I have worked on and debugged lots of code that gets unicode wrong throughout my career. Usually through dumb errors like encoding and then upgrading things that shouldnt have received either treatments, along with a whole host of other scenarios.

PS. Your persistent confusion about the Perl's Unicode model is one of the reasons why I think renaming the UTF8 flag would be a good change.

We are at an impasse, I think you and Dan want to believe in a mental model that doesn't reflect long standing tradition in perl, and you think I am confused. I have more than 20 years experience doing Perl, and I am comfortable with my mental model of how Perl works. I also think it is a bit rich to complain about me considering Dan's bogus "corrections" trolling, but you think it's ok to describe me as "persistently confused" yet some simple code samples show that things you consider "logically fully equivalent" produce different results... That seems a bit hypocritical frankly.

Anyway, I am done with this thread. Please don't bother replying. If you want to discuss other things in other threads I am happy to, but I don't want to discuss this subject with you or Dan anymore.

Thanks, Yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

FGasper commented 2 years ago

@demerphq I’ve altered the tests to remove the internal double-encode. It should now be clearer. Would you please recheck it?

Taking a codepoint, and then encoding it as utf8 (in the sense of utf8::encode) and then upgrading it is in my experience a super weird thing to do.

It’s definitely weird. It’s also valid. Given JSON decoders’ default behaviour of outputting upgraded strings, it’s also commonplace at my own $work. Consider:

> perl -MCpanel::JSON::XS -MDevel::Peek -e'Dump( Cpanel::JSON::XS->new->decode(qq<["\xc3\xa9"]>)->[0] )'
SV = PV(0x64b670) at 0x6638f0
  REFCNT = 1
  FLAGS = (POK,pPOK,UTF8)
  PV = 0x6d96f0 "\303\203\302\251"\0 [UTF8 "\x{c3}\x{a9}"]
  CUR = 4
  LEN = 10

$ perl -le'print chr(0xDF)=~/ss/i ? "yes" : "no"' no $ perl -le'print "\N{U+DF}"=~/ss/i ? "yes" : "no"' yes

You are welcome to say that this suffers from "the unicode bug" if you wish, but I personally am not going to just ignore well defined behavior as "being buggy" because it doesn't match my mental model. You can if you wish, but I won't.

perlre.pod says this is the Unicode Bug, too. Per that document, all new code should use the unicode_strings feature—which the feature bundle has included for a long time—to avoid this. Thus:

perl -lE'print chr(0xDF)=~/ss/i ? "yes" : "no"'
yes

You have clearly forgotten more about Perl internals—the regexp engine in particular—than most of us will ever know. But I do respectfully think that the model you’re applying diverges from both Perl’s documentation and the “modern Perl” that the feature bundle implicitly encourages.

Grinnz commented 2 years ago

perlre.pod says this is the Unicode Bug, too. Per that document, all new code should use the unicode_strings feature—which the feature bundle has included for a long time—to avoid this. Thus:
perl -lE'print chr(0xDF)=~/ss/i ? "yes" : "no"'
yes

Indeed, this is the Unicode bug introduced by regex behavior relying upon the upgraded/downgraded status of the string, avoided by either use of the unicode_strings feature or the /u or /a modifiers. The madness of the default behavior is extensively detailed here, and you can see there are far more ways to accidentally get different behavior than to pass an upgraded string (blead docs linked because the wording was recently fixed): https://perldoc.perl.org/blead/perlre#/d

Grinnz commented 2 years ago

If that final string is output "as unicode" and printed through encode_utf8, which it should be if it is output to a utf8 terminal, then it will be double encoded on the wire. It will NOT represent 0xE9 the original codepoint, it will be the utf8 encoding of C3 A9, which is C3 83 C2 A9. THAT is double encoding. You see the process? E9 -> C3 A9 -> C3 83 C2 A9. Boom double encoded.

I'm not sure the relevance of this statement. Of course encoding an already encoded string would be double encoding. The internal status of the string is irrelevant to this, it would also occur if you printed a downgraded encoded string to a handle with an encoding layer.

Grinnz commented 2 years ago

Anyway, I am done with this thread. Please don't bother replying. If you want to discuss other things in other threads I am happy to, but I don't want to discuss this subject with you or Dan anymore.

You are free to abstain from whatever discussions you wish, but the conditions of correcting the misinformed model you present cannot be "finding yet another person with a sufficient understanding of the Perl string model, the persistence to sustain such an argument, and willing to be called a troll."

rjbs commented 2 years ago

@FGasper @khwilliamson You asked for a PSC weigh-in. LeoNerd has already +1'd. I think this looks right to me, but would like you to confirm my understanding:

the character value in the Perl-space string will be written into the octet buffer of $0
this means that the string "\x20" . "\xFF" will write those two bytes to $0, even if the string's SV was SvUTF8 and the underlying byte array is "\x20\xCF\xBF"
if the string contains the character \x{1f480} then there will be a "wide" warning and the octets F0 9F 92 80 will end up in $0

FGasper commented 2 years ago

@rjbs Correct on all points.

("\xff" is "\xc3\xbf", not "\xcf\xbf", in UTF-8 … which I only mention to save anyone else potential confusion.)

To illustrate, this is status quo:

> perl -MDevel::Peek -e'my $v = "foo-é"; utf8::upgrade($v); $0 = $v; print `ps aux | grep foo | grep -v grep`'
felipe           27089   0.0  0.0  4440512   2908 s003  S+    9:14AM   0:00.01 foo-Ã©

With this branch:

> ./perl -Ilib -MDevel::Peek -e'my $v = "foo-é"; utf8::upgrade($v); $0 = $v; print `ps aux | grep foo | grep -v grep`'
felipe           27106   0.3  0.0  4449088   3088 s003  S+    9:15AM   0:00.01 foo-é

Per your third point: wide characters are treated the same way as now, but with the additional warning. Status quo:

> perl -MDevel::Peek -e'my $v = "foo-\x{1f480}"; $0 = $v; print `ps aux | grep foo | grep -v grep`'
felipe           27118   0.0  0.0  4460992   3016 s003  S+    9:16AM   0:00.01 foo-💀

This branch:

> ./perl -Ilib -MDevel::Peek -e'my $v = "foo-\x{1f480}"; $0 = $v; print `ps aux | grep foo | grep -v grep`'
Wide character in $0 at -e line 1.
felipe           27127   2.3  0.0  4438848   3044 s003  S+    9:16AM   0:00.01 foo-💀

Perl / perl5

Properly handle UTF8-flagged strings when assigning to $0. #19334

@.**** commented on this pull request.