Closed FGasper closed 2 years ago
Assuming this is OK to merge, would it still be possible to get this into 5.36?
In my upcoming presentation I’d like to be able to say that 5.36 offers, for the first time, a fully-functional Unicode abstraction in Perl. This is the last “piece of the puzzle” for me to be able to say that.
I find this code totally bizarre. Can you please explain why you are double encoding this? Anything created with \N{U+..} is already marked as unicode and encoded as utf8. "\N{U+00E9}" is not the same as chr(0xe9). The former will be utf8-on and consist of two octets at the wire level. The latter will be utf8-off and consist of one octet at the wire level. Then you encode it as utf8, so in effect you turn off the utf8 flag, and then you upgrade it, thus you have now double encoded the string. Eg, you have taken the octets in the utf8 encoding of codepoint E9 and encoded them as utf8. I cannot understand why you want to do this. Doing so is in my experience always a bug. So much so that my former workplace has a standard function to decode such things recursively which was our almost universal tool for decoding "utf8" data (recurse_decode_utf8).
As a point of reference, this is the routine used at Booking:
our $_tail = "\1\1\1\1\1";
sub recurse_decode_utf8_inplace {
foreach my $s (@_) {
next if ref $s or not defined $s;
Encode::_utf8_off($s);
# work around a bug in Encode - if the last item is a valid start sequence byte
# it silently swallows it when it should be trated as its latin 1 equivalent.
# Specifically try: "ba\x{df}" versus "ba\x{df}\x{df}".
# So we add on a byte that can not be a valid start/interim byte, and then afterwards
# pull it off. - Yves
$s .= $_tail; # do NOT use "\0" here! Our perl doesnt chop strings ending in a utf8 null correctly.
# remember the old string
my $old_s= "";
while (length($s) != length($old_s)) {
$old_s= $s;
eval {
my $octets= $s;
$s= decode_utf8( $octets, Encode::FB_CROAK );
$s= $old_s if substr($s, -1, 1) ne "\1";
1;
} or do {
last;
} unless utf8::is_utf8($s);
eval {
my $utf8= $s;
$s= encode( LATIN1, $utf8, Encode::FB_CROAK);
1;
} or do {
last;
};
}
$s=~s/\Q$_tail\E\z//o or warn "Possibly corrupted string decode: $s\n";
utf8::upgrade($s); # this is a no-op if the string is already utf8, so dont bother guarding it.
}
return wantarray ? @_ : $_[0];
}
@haarg What is this code meant to do? And how does it relate to this bug/PR?
I'm providing a copy of the routine @demerphq referenced in his post.
I'm providing a copy of the routine @demerphq referenced in his post.
Ah OK. 👍
Just to reiterate, though: the test no longer does the encode-then-upgrade stuff.
On 3/5/22 20:13, Felipe Gasper wrote:
@.**** commented on this pull request.
In t/op/magic.t https://github.com/Perl/perl5/pull/19334#discussion_r820173457:
@@ -433,6 +433,77 @@ EOP } }
+# Check that assigning to $0 properly handles UTF-8-stored strings: +{ +
We want $e9_with_utf8_pv’s PV to be UTF-8-encoded because we need to
test that Perl translates UTF-8-stored code points to plain octets when
assigning to $0.
- #
- my $e9_with_utf8_pv = "\xe9";
- utf8::upgrade($e9_with_utf8_pv);
This will be the same logical code point as $e9_with_utf8_pv, but
implemented in Perl internally as a raw byte rather than UTF-8.
- #
- my $e9_with_plain_pv = $e9_with_utf8_pv;
- utf8::downgrade($e9_with_plain_pv);
I didn’t want to assume that |$foo = "\xe9"| results in a downgraded/Latin-1 string.
That's reasonable; how about a comment?
— Reply to this email directly, view it on GitHub https://github.com/Perl/perl5/pull/19334#discussion_r820173457, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA2DH6LKVK6OQVF3GN7IATU6QPD3ANCNFSM5LPPZ3ZA. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
You are receiving this because you were mentioned.Message ID: @.***>
On 3/5/22 20:08, Felipe Gasper wrote:
@.**** commented on this pull request.
In t/op/magic.t https://github.com/Perl/perl5/pull/19334#discussion_r820173100:
@@ -433,6 +433,77 @@ EOP } }
+# Check that assigning to $0 properly handles UTF-8-stored strings: +{ +
We want $e9_with_utf8_pv’s PV to be UTF-8-encoded because we need to
test that Perl translates UTF-8-stored code points to plain octets when
assigning to $0.
- #
- my $e9_with_utf8_pv = "\xe9";
Nothing here says that \xe9 is any specific character, right? Won’t this work the same way on EBCDIC?
No. E9 is CAPITAL LETTER Z, so is an ASCII equivalent, and hence doesn't exercise the new code at all.
perlport suggests generic characters for tests that are valid on both character sets. B6 would be fine for this case, with no translation needed.
On 3/5/22 20:20, Felipe Gasper wrote:
@.**** commented on this pull request.
In t/op/magic.t https://github.com/Perl/perl5/pull/19334#discussion_r820173885:
+
- my $linux_cmdline_cr = sub {
- open my $rfh, '<', "/proc/$$/cmdline" or skip "open: $!", 1;
- return do { local $/; <$rfh> };
- };
- SKIP: {
- skip "Test is for Linux, not $^O", 2 if $^O ne 'linux';
- my $slurp = $linux_cmdline_cr->();
- like( $slurp, qr<\A$e9_with_utf8_pv\0>, '/proc cmdline shows as expected (compare to UTF8-flagged)' );
- like( $slurp, qr<\A$e9_with_plain_pv\0>, '/proc cmdline shows as expected (compare to non-UTF8-flagged)' );
- }
- my $name_unicode = "haha\x{100}hoho";
- my $name_utf8 = $name_unicode;
/Sigh …/
This is why I wanted to rename “UTF8”: too much confusion between what a Perl maintainer like yourself thinks of as a “UTF-8 string” versus what a Perl /user/ would think that term to mean (which is how I named this variable).
It's hard for me to put myself in the shoes of someone for which this is a black box. Ideally the name should be intuitive to both classes; I suppose I belong to the PIGs and the normal reader is just PG. (Perls Internal Geek)
I'm fine for my needs with the name you suggested at the end.
If we had different terms for Perl’s internal encoding versus logical, application-level encoding, there’d be no confusion in cases like this that “straddle” the two spheres.
Anyway. Would $name_utf8_bytes be clearer to you?
— Reply to this email directly, view it on GitHub https://github.com/Perl/perl5/pull/19334#discussion_r820173885, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA2DH2G2UOXIRL7M54XY7TU6QP7DANCNFSM5LPPZ3ZA. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
You are receiving this because you were mentioned.Message ID: @.***>
On 3/5/22 20:23, Felipe Gasper wrote:
@.**** commented on this pull request.
In mg.c https://github.com/Perl/perl5/pull/19334#discussion_r820174105:
@@ -3367,6 +3367,16 @@ Perl_magicset(pTHX SV sv, MAGIC mg) else sv_setiv(mg->mg_obj, (IV)PerlProc_getpid()); break; case '0':
- if (!sv_utf8_downgrade(sv, / fail_ok / TRUE)) {
- /* Since we are going to set the string's UTF8-encoded form
- as the process name we should update $0 itself to contain
- that same (UTF8-encoded) value. */
- sv_utf8_encode(GvSV(mg->mg_obj));
Can you talk a bit more about your concern? The tests do flex this change.
It's not a concern; it's just laying bare a blind spot of mine, so I'm not a competent reviewer of this part of the request
— Reply to this email directly, view it on GitHub https://github.com/Perl/perl5/pull/19334#discussion_r820174105, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA2DH5MMHRTBGR3YYBDZS3U6QQMLANCNFSM5LPPZ3ZA. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
You are receiving this because you were mentioned.Message ID: @.***>
@khwilliamson Thank you for your review. Have I addressed your concerns?
On 3/5/22 20:39, Felipe Gasper wrote:
@khwilliamson https://github.com/khwilliamson Thank you for your review. Have I addressed your concerns?
Yes. Please squash so it can be merged into blead
On Sun, 6 Mar 2022 at 05:06, Karl Williamson @.***> wrote:
Merged #19334 into blead.
I think that was premature:
op/magic.t ................................... 66/208 # Failed test 106 - /proc cmdline shows as expected (compare to UTF8-flagged) at op/magic.t line 476
non-UTF8-flagged) at op/magic.t line 477
op/magic.t ................................... Failed 3/208 subtests
Test verbose:
ok 104 - compare $0 to UTF8-flagged ok 105 - compare $0 to non-UTF8-flagged not ok 106 - /proc cmdline shows as expected (compare to UTF8-flagged)
UTF8-flagged) at t/op/magic.t line 476
not ok 107 - /proc cmdline shows as expected (compare to non-UTF8-flagged)
non-UTF8-flagged) at t/op/magic.t line 477
ok 108 - warning after assignment of wide character ok 109 - .. and the warning is about a wide character ok 110 - .. and the UTF-8 version is written not ok 111 - .. and /proc cmdline shows that
ok 112 - $0 from wide -> local non-wide: no warning
-- perl -Mre=debug -e "/just|another|perl|hacker/"
On Mon, 7 Mar 2022 at 14:22, demerphq @.***> wrote:
On Sun, 6 Mar 2022 at 05:06, Karl Williamson @.***> wrote:
Merged #19334 into blead.
I think that was premature:
op/magic.t ................................... 66/208 # Failed test 106 - /proc cmdline shows as expected (compare to UTF8-flagged) at op/magic.t line 476
got '�'
expected /(?^u:\A�\0)/
Failed test 107 - /proc cmdline shows as expected (compare to
non-UTF8-flagged) at op/magic.t line 477
got '�'
expected /(?^:\A�\0)/
Failed test 111 - .. and /proc cmdline shows that at op/magic.t line 497
got "hahaĀhoho"
expected "hahaĀhoho\000"
op/magic.t ................................... Failed 3/208 subtests
Test verbose:
ok 104 - compare $0 to UTF8-flagged ok 105 - compare $0 to non-UTF8-flagged not ok 106 - /proc cmdline shows as expected (compare to UTF8-flagged)
Failed test 106 - /proc cmdline shows as expected (compare to
UTF8-flagged) at t/op/magic.t line 476
got '�'
expected /(?^u:\A�\0)/
not ok 107 - /proc cmdline shows as expected (compare to non-UTF8-flagged)
Failed test 107 - /proc cmdline shows as expected (compare to
non-UTF8-flagged) at t/op/magic.t line 477
got '�'
expected /(?^:\A�\0)/
ok 108 - warning after assignment of wide character ok 109 - .. and the warning is about a wide character ok 110 - .. and the UTF-8 version is written not ok 111 - .. and /proc cmdline shows that
Failed test 111 - .. and /proc cmdline shows that at t/op/magic.t line 497
got "hahaĀhoho"
expected "hahaĀhoho\000"
ok 112 - $0 from wide -> local non-wide: no warning
Fixed in https://github.com/Perl/perl5/pull/19505
Yves
-- perl -Mre=debug -e "/just|another|perl|hacker/"
Issue #19331: Use of SvPV_const and SvPV_force in S_set_dollarzero() wrote the PV internals directly to argv, which causes an improper UTF-8 encode if the SV is UTF8-flagged/upgraded.
This fixes that doing a downgrade prior to those SvPV* calls. If the string contains wide characters (and thus cannot be downgraded) a warning is thrown; this mirrors preexisting behavior with %ENV, print, and other output channels that convert Perl SVs to bytes.