Perl / perl5

🐪 The Perl programming language
https://dev.perl.org/perl5/
Other
1.95k stars 554 forks source link

Some Invalid UTF-8 Sequences Cause Panic #22597

Closed hiratara closed 1 month ago

hiratara commented 1 month ago

Hello,

Since commit a460925186154b270b7a647a4f30b2f01fd97c4b, utf8n_to_uvchr has been broken, causing some peculiar behavior when handling invalid UTF-8 sequences.

For example, invalid UTF-8 sequences cause a panic message:

# This is expected
$ perl -E 'say "use strict; use warnings; use utf8; \"\xf4\xE3\x81\x82\""' | perl
Malformed UTF-8 character: \xf4\xe3\x81\x82 (unexpected non-continuation byte 0xe3, immediately after start byte 0xf4; need 4 bytes, got 1) at - line 1.
Malformed UTF-8 character (fatal) at - line 1.

# This isn't expected
$ perl -E 'say "use strict; use warnings; use utf8; \"\xff\xE3\x81\x82\""' | perl
panic: _force_out_malformed_utf8_message should be called only when there are errors found at - line 1.

This bug also affects Encode.pm:

# This is expected
$ perl -MEncode -E 'say encode "UTF-8", decode "UTF-8", "\xf4\xE3\x81\x82", Encode::FB_PERLQQ'
\xF4あ

# This isn't expected
$ perl -MEncode -E 'say encode "UTF-8", decode "UTF-8", "\xff\xE3\x81\x82", Encode::FB_PERLQQ'
\xFF\xE3\x81\x82

This PR fixes those problems.

jkeenan commented 1 month ago

You assert that the problem began in https://github.com/Perl/perl5/commit/a460925186154b270b7a647a4f30b2f01fd97c4b (@khwilliamson):

commit a460925186154b270b7a647a4f30b2f01fd97c4b
Author:     Karl Williamson <khw@cpan.org>
AuthorDate: Thu May 13 10:50:23 2021
Commit:     Karl Williamson <khw@cpan.org>
CommitDate: Tue Feb 1 22:33:48 2022

    Refactor utf8 to code point conversion

I wanted to explore this argument in a test-driven manner. So I checked out the commit immediately before that one:

commit c0e63b1341c14ebd84a23fdbba9759d3fd686498
Author:     Karl Williamson <khw@cpan.org>
AuthorDate: Tue Feb 1 20:58:08 2022
Commit:     Karl Williamson <khw@cpan.org>
CommitDate: Tue Feb 1 22:01:13 2022

    builtin.c: Fix up 730f927

    That commit got pushed without a needed fix.

Then configured (-des -Dusedevel) and built an ordinary perl at that point, then added the two unit tests from your pull request to t/op/lex.t.

$ git show | cat
commit 8fbf143586f593a5f599f1b633dc558b7269a3a7
Author: James E Keenan <jkeenan@cpan.org>
Date:   Sun Sep 15 12:51:20 2024 -0400

    Add 2 unit tests from pull request GH #22597

    Added to the commit immediately preceding the commit which the poster
    states broke utf8n_to_uvchr, without adding code changes from that p.r.

diff --git a/t/op/lex.t b/t/op/lex.t
index e78fad2c42..a5703fc53e 100644
--- a/t/op/lex.t
+++ b/t/op/lex.t
@@ -7,7 +7,8 @@ use warnings;

 BEGIN { chdir 't' if -d 't'; require './test.pl'; }

-plan(tests => 36);
+#plan(tests => 36);
+plan(tests => 38);

 {
     print <<'';   # Yow!
@@ -284,3 +285,20 @@ EOM

 fresh_perl_like('flock  _$', qr/Not enough arguments for flock/, {stderr => 1},
                 "[perl #129190] intuit_method() invalidates PL_bufptr");
+
+# test-driven development; first, add the tests from GH #22597
+
+
+fresh_perl_like(
+    qq(use utf8; \xC2\xE3\x81\x82),
+    qr/^Malformed UTF-8 character:/,
+    {stderr => 1},
+    'Error handling for invalid UTF-8 sequences starting with leading bytes',
+);
+
+fresh_perl_like(
+    qq(use utf8; \xFF\xE3\x81\x82),
+    qr/^Malformed UTF-8 character:/,
+    {stderr => 1},
+    'Error handling for invalid UTF-8 sequences starting with unassigned bytes',
+);

I then ran that test program through the harness:

$ cd t;./perl harness -v op/lex.t; cd -

ok 1
ok 2
...
ok 37 - Error handling for invalid UTF-8 sequences starting with leading bytes
ok 38 - Error handling for invalid UTF-8 sequences starting with unassigned bytes
ok
All tests successful.
Files=1, Tests=38,  0 wallclock secs ( 0.02 usr  0.00 sys +  0.04 cusr  0.09 csys =  0.15 CPU)
Result: PASS
/home/jkeenan/gitwork/perl

I then cherry-picked a460925186154b270b7a647a4f30b2f01fd97c4b into that branch, rebuilt and re-tested.

$ cd t;./perl harness -v op/lex.t; cd -

ok 1
ok 2
...
ok 37 - Error handling for invalid UTF-8 sequences starting with leading bytes
not ok 38 - Error handling for invalid UTF-8 sequences starting with unassigned bytes
# Failed test 38 - Error handling for invalid UTF-8 sequences starting with unassigned bytes at ./test.pl line 1088
#      got 'panic: _force_out_malformed_utf8_message should be called only when there are errors found at - line 1.'
# expected /(?^:^Malformed UTF-8 character:)/
# PROG: 
# use utf8; �あ
# STATUS: 65280
Failed 1/38 subtests 

Test Summary Report
-------------------
op/lex.t (Wstat: 0 Tests: 38 Failed: 1)
  Failed test:  38
Files=1, Tests=38,  0 wallclock secs ( 0.02 usr  0.00 sys +  0.05 cusr  0.08 csys =  0.15 CPU)
Result: FAIL

So it appears that one of the two unit tests you are proposing to add would have PASSed both before the breaking commit and at the breaking commit. The breaking commit (a460925186) appears to have broken only the code in your second unit test.

Can you clarify?

(My diagnostic branch can be found here.)

NOTE: If this pull request is accepted, the changes may be backported to maintenance releases for perl-5.36, perl-5.38 and perl-5.40.

jkeenan commented 1 month ago

@hiratara, if there is something specifically wrong with Encode, you should file a bug ticket in that upstream distribution's bug tracker: https://rt.cpan.org/Dist/Display.html?Name=Encode. That would enable @dankogai to make changes for Encode against any version of perl.

khwilliamson commented 1 month ago

The tests belong instead in ext/XS-APItest/t/utf8.t. There already is a function accessible from that file test_utf8n_to_uvchr_error() that allows for arbitrary strings to be tested. If you don't feel like doing this, I will do so, so just pull out that portion of your commits.

I see the tests for translating to code point are minimal.. The functions that check if a string is well-formed UTF-8 have extensive tests, and they don't have this bug. I would adapt some of those tests to work on this.

hiratara commented 1 month ago

@jkeenan, I've updated the commit messages and removed the changes from t/op/lex.t. I attempted to add tests to ext/XS-APItest/t/utf8.t, but found it too complicated for me to handle. Could you please take care of adding the test cases instead?

I've force-pushed the updated commits to this branch and backed up the original branch here: https://github.com/hiratara/perl5/tree/tmp/panic-with-invalid-utf8-BK

khwilliamson commented 1 week ago

To be clear, this PR adds tests for this https://github.com/Perl/perl5/pull/22646