Perl / perl5

🐪 The Perl programming language
https://dev.perl.org/perl5/
Other
1.86k stars 527 forks source link

Properly handle UTF8-flagged strings when assigning to $0. #19334

Closed FGasper closed 2 years ago

FGasper commented 2 years ago

Issue #19331: Use of SvPV_const and SvPV_force in S_set_dollarzero() wrote the PV internals directly to argv, which causes an improper UTF-8 encode if the SV is UTF8-flagged/upgraded.

This fixes that doing a downgrade prior to those SvPV* calls. If the string contains wide characters (and thus cannot be downgraded) a warning is thrown; this mirrors preexisting behavior with %ENV, print, and other output channels that convert Perl SVs to bytes.

FGasper commented 2 years ago

Assuming this is OK to merge, would it still be possible to get this into 5.36?

In my upcoming presentation I’d like to be able to say that 5.36 offers, for the first time, a fully-functional Unicode abstraction in Perl. This is the last “piece of the puzzle” for me to be able to say that.

haarg commented 2 years ago

I find this code totally bizarre. Can you please explain why you are double encoding this? Anything created with \N{U+..} is already marked as unicode and encoded as utf8. "\N{U+00E9}" is not the same as chr(0xe9). The former will be utf8-on and consist of two octets at the wire level. The latter will be utf8-off and consist of one octet at the wire level. Then you encode it as utf8, so in effect you turn off the utf8 flag, and then you upgrade it, thus you have now double encoded the string. Eg, you have taken the octets in the utf8 encoding of codepoint E9 and encoded them as utf8. I cannot understand why you want to do this. Doing so is in my experience always a bug. So much so that my former workplace has a standard function to decode such things recursively which was our almost universal tool for decoding "utf8" data (recurse_decode_utf8).

As a point of reference, this is the routine used at Booking:

our $_tail = "\1\1\1\1\1";

sub recurse_decode_utf8_inplace {
    foreach my $s (@_) {
        next if ref $s or not defined $s;
        Encode::_utf8_off($s);
        # work around a bug in Encode - if the last item is a valid start sequence byte
        # it silently swallows it when it should be trated as its latin 1 equivalent.
        # Specifically try: "ba\x{df}" versus "ba\x{df}\x{df}".
        # So we add on a byte that can not be a valid start/interim byte, and then afterwards
        # pull it off. - Yves
        $s .= $_tail;   # do NOT use "\0" here! Our perl doesnt chop strings ending in a utf8 null correctly.

        # remember the old string
        my $old_s= "";
        while (length($s) != length($old_s)) {
            $old_s= $s;
            eval {
                my $octets= $s;
                $s= decode_utf8( $octets, Encode::FB_CROAK );
                $s= $old_s if substr($s, -1, 1) ne "\1";
                1;
            } or do {
                last;
            } unless utf8::is_utf8($s);

            eval {
                my $utf8= $s;
                $s= encode( LATIN1, $utf8, Encode::FB_CROAK);
                1;
            } or do {
                last;
            };
        }
        $s=~s/\Q$_tail\E\z//o or warn "Possibly corrupted string decode: $s\n";
        utf8::upgrade($s); # this is a no-op if the string is already utf8, so dont bother guarding it.
    }
    return wantarray ? @_ : $_[0];
}
FGasper commented 2 years ago

@haarg What is this code meant to do? And how does it relate to this bug/PR?

haarg commented 2 years ago

I'm providing a copy of the routine @demerphq referenced in his post.

FGasper commented 2 years ago

I'm providing a copy of the routine @demerphq referenced in his post.

Ah OK. 👍

Just to reiterate, though: the test no longer does the encode-then-upgrade stuff.

khwilliamson commented 2 years ago

On 3/5/22 20:13, Felipe Gasper wrote:

@.**** commented on this pull request.


In t/op/magic.t https://github.com/Perl/perl5/pull/19334#discussion_r820173457:

@@ -433,6 +433,77 @@ EOP } }

+# Check that assigning to $0 properly handles UTF-8-stored strings: +{ +

  • We want $e9_with_utf8_pv’s PV to be UTF-8-encoded because we need to

  • test that Perl translates UTF-8-stored code points to plain octets when

  • assigning to $0.

  • #
  • my $e9_with_utf8_pv = "\xe9";
  • utf8::upgrade($e9_with_utf8_pv);
  • This will be the same logical code point as $e9_with_utf8_pv, but

  • implemented in Perl internally as a raw byte rather than UTF-8.

  • #
  • my $e9_with_plain_pv = $e9_with_utf8_pv;
  • utf8::downgrade($e9_with_plain_pv);

I didn’t want to assume that |$foo = "\xe9"| results in a downgraded/Latin-1 string.

That's reasonable; how about a comment?

— Reply to this email directly, view it on GitHub https://github.com/Perl/perl5/pull/19334#discussion_r820173457, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA2DH6LKVK6OQVF3GN7IATU6QPD3ANCNFSM5LPPZ3ZA. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

khwilliamson commented 2 years ago

On 3/5/22 20:08, Felipe Gasper wrote:

@.**** commented on this pull request.


In t/op/magic.t https://github.com/Perl/perl5/pull/19334#discussion_r820173100:

@@ -433,6 +433,77 @@ EOP } }

+# Check that assigning to $0 properly handles UTF-8-stored strings: +{ +

  • We want $e9_with_utf8_pv’s PV to be UTF-8-encoded because we need to

  • test that Perl translates UTF-8-stored code points to plain octets when

  • assigning to $0.

  • #
  • my $e9_with_utf8_pv = "\xe9";

Nothing here says that \xe9 is any specific character, right? Won’t this work the same way on EBCDIC?

No. E9 is CAPITAL LETTER Z, so is an ASCII equivalent, and hence doesn't exercise the new code at all.

perlport suggests generic characters for tests that are valid on both character sets. B6 would be fine for this case, with no translation needed.

khwilliamson commented 2 years ago

On 3/5/22 20:20, Felipe Gasper wrote:

@.**** commented on this pull request.


In t/op/magic.t https://github.com/Perl/perl5/pull/19334#discussion_r820173885:

+

  • my $linux_cmdline_cr = sub {
  • open my $rfh, '<', "/proc/$$/cmdline" or skip "open: $!", 1;
  • return do { local $/; <$rfh> };
  • };
  • SKIP: {
  • skip "Test is for Linux, not $^O", 2 if $^O ne 'linux';
  • my $slurp = $linux_cmdline_cr->();
  • like( $slurp, qr<\A$e9_with_utf8_pv\0>, '/proc cmdline shows as expected (compare to UTF8-flagged)' );
  • like( $slurp, qr<\A$e9_with_plain_pv\0>, '/proc cmdline shows as expected (compare to non-UTF8-flagged)' );
  • }
  • my $name_unicode = "haha\x{100}hoho";
  • my $name_utf8 = $name_unicode;

/Sigh …/

This is why I wanted to rename “UTF8”: too much confusion between what a Perl maintainer like yourself thinks of as a “UTF-8 string” versus what a Perl /user/ would think that term to mean (which is how I named this variable).

It's hard for me to put myself in the shoes of someone for which this is a black box. Ideally the name should be intuitive to both classes; I suppose I belong to the PIGs and the normal reader is just PG. (Perls Internal Geek)

I'm fine for my needs with the name you suggested at the end.

If we had different terms for Perl’s internal encoding versus logical, application-level encoding, there’d be no confusion in cases like this that “straddle” the two spheres.

Anyway. Would $name_utf8_bytes be clearer to you?

— Reply to this email directly, view it on GitHub https://github.com/Perl/perl5/pull/19334#discussion_r820173885, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA2DH2G2UOXIRL7M54XY7TU6QP7DANCNFSM5LPPZ3ZA. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

khwilliamson commented 2 years ago

On 3/5/22 20:23, Felipe Gasper wrote:

@.**** commented on this pull request.


In mg.c https://github.com/Perl/perl5/pull/19334#discussion_r820174105:

@@ -3367,6 +3367,16 @@ Perl_magicset(pTHX SV sv, MAGIC mg) else sv_setiv(mg->mg_obj, (IV)PerlProc_getpid()); break; case '0':

  • if (!sv_utf8_downgrade(sv, / fail_ok / TRUE)) {
  • /* Since we are going to set the string's UTF8-encoded form
  • as the process name we should update $0 itself to contain
  • that same (UTF8-encoded) value. */
  • sv_utf8_encode(GvSV(mg->mg_obj));

Can you talk a bit more about your concern? The tests do flex this change.

It's not a concern; it's just laying bare a blind spot of mine, so I'm not a competent reviewer of this part of the request

— Reply to this email directly, view it on GitHub https://github.com/Perl/perl5/pull/19334#discussion_r820174105, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA2DH5MMHRTBGR3YYBDZS3U6QQMLANCNFSM5LPPZ3ZA. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

FGasper commented 2 years ago

@khwilliamson Thank you for your review. Have I addressed your concerns?

khwilliamson commented 2 years ago

On 3/5/22 20:39, Felipe Gasper wrote:

@khwilliamson https://github.com/khwilliamson Thank you for your review. Have I addressed your concerns?

Yes. Please squash so it can be merged into blead

demerphq commented 2 years ago

On Sun, 6 Mar 2022 at 05:06, Karl Williamson @.***> wrote:

Merged #19334 into blead.

I think that was premature:

op/magic.t ................................... 66/208 # Failed test 106 - /proc cmdline shows as expected (compare to UTF8-flagged) at op/magic.t line 476

got '�'

expected /(?^u:\A�\0)/

Failed test 107 - /proc cmdline shows as expected (compare to

non-UTF8-flagged) at op/magic.t line 477

got '�'

expected /(?^:\A�\0)/

Failed test 111 - .. and /proc cmdline shows that at op/magic.t line 497

got "hahaĀhoho"

expected "hahaĀhoho\000"

op/magic.t ................................... Failed 3/208 subtests

Test verbose:

ok 104 - compare $0 to UTF8-flagged ok 105 - compare $0 to non-UTF8-flagged not ok 106 - /proc cmdline shows as expected (compare to UTF8-flagged)

Failed test 106 - /proc cmdline shows as expected (compare to

UTF8-flagged) at t/op/magic.t line 476

got '�'

expected /(?^u:\A�\0)/

not ok 107 - /proc cmdline shows as expected (compare to non-UTF8-flagged)

Failed test 107 - /proc cmdline shows as expected (compare to

non-UTF8-flagged) at t/op/magic.t line 477

got '�'

expected /(?^:\A�\0)/

ok 108 - warning after assignment of wide character ok 109 - .. and the warning is about a wide character ok 110 - .. and the UTF-8 version is written not ok 111 - .. and /proc cmdline shows that

Failed test 111 - .. and /proc cmdline shows that at t/op/magic.t line 497

got "hahaĀhoho"

expected "hahaĀhoho\000"

ok 112 - $0 from wide -> local non-wide: no warning

-- perl -Mre=debug -e "/just|another|perl|hacker/"

demerphq commented 2 years ago

On Mon, 7 Mar 2022 at 14:22, demerphq @.***> wrote:

On Sun, 6 Mar 2022 at 05:06, Karl Williamson @.***> wrote:

Merged #19334 into blead.

I think that was premature:

op/magic.t ................................... 66/208 # Failed test 106 - /proc cmdline shows as expected (compare to UTF8-flagged) at op/magic.t line 476

got '�'

expected /(?^u:\A�\0)/

Failed test 107 - /proc cmdline shows as expected (compare to

non-UTF8-flagged) at op/magic.t line 477

got '�'

expected /(?^:\A�\0)/

Failed test 111 - .. and /proc cmdline shows that at op/magic.t line 497

got "hahaĀhoho"

expected "hahaĀhoho\000"

op/magic.t ................................... Failed 3/208 subtests

Test verbose:

ok 104 - compare $0 to UTF8-flagged ok 105 - compare $0 to non-UTF8-flagged not ok 106 - /proc cmdline shows as expected (compare to UTF8-flagged)

Failed test 106 - /proc cmdline shows as expected (compare to

UTF8-flagged) at t/op/magic.t line 476

got '�'

expected /(?^u:\A�\0)/

not ok 107 - /proc cmdline shows as expected (compare to non-UTF8-flagged)

Failed test 107 - /proc cmdline shows as expected (compare to

non-UTF8-flagged) at t/op/magic.t line 477

got '�'

expected /(?^:\A�\0)/

ok 108 - warning after assignment of wide character ok 109 - .. and the warning is about a wide character ok 110 - .. and the UTF-8 version is written not ok 111 - .. and /proc cmdline shows that

Failed test 111 - .. and /proc cmdline shows that at t/op/magic.t line 497

got "hahaĀhoho"

expected "hahaĀhoho\000"

ok 112 - $0 from wide -> local non-wide: no warning

Fixed in https://github.com/Perl/perl5/pull/19505

Yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"