dankogai / p5-encode

Encode - character encodings (for Perl 5.8 or better)
https://metacpan.org/release/Encode
37 stars 51 forks source link

Detect invalid UTF-8 data at end of file when using PerlIO :encoding(utf-8) #59

Open hakonhagland opened 8 years ago

hakonhagland commented 8 years ago

PerlIO layer :encoding(utf-8) seems to fail to report malformed data at the end of a file. Suppose a file $fn contains valid UTF-8, except for the final character in the file. The last character in the file has an invalid UTF-8 encoding. I would like to have a warning printed to STDERR about invalid UTF-8 when reading this file, but strangely it seems not possible to achieve. For example:

use feature qw(say);
use strict;
use warnings;

binmode STDOUT, ':utf8';
binmode STDERR, ':utf8';

my $bytes = "\x{61}\x{E5}";  # 2 bytes in iso 8859-1: aå
my $fn = 'test.txt';
open ( my $fh, '>:raw', $fn ) or die "Could not open file '$fn': $!";
print $fh $bytes;
close $fh;

now $fn contains invalid UTF-8 (the last byte). If I now try to read the file using PerlIO layer :encoding(utf-8):

my $str = '';
open ( $fh, "<:encoding(utf-8)", $fn ) or die "Could not open file '$fn': $!";
$str = do { local $/; <$fh> };
close $fh;
say "Read string: '$str'";

the output is

Read string: 'a'

Note, that there is no warning "\xE5" does not map to Unicode in this case.

However, if I read the file as bytes and then use Encode::decode() on the raw data, the warnings is printed:

open ( $fh, "<:raw", $fn ) or die "Could not open file '$fn': $!";
$raw_data = do { local $/; <$fh> };
close $fh;
my $str2 = decode( 'utf-8', $raw_data, Encode::FB_WARN | Encode::LEAVE_SRC );
# warning is printed to STDERR

Why cannot the same thing be achieved with PerlIO::encoding? Is it a bug?

pali commented 8 years ago

See https://metacpan.org/pod/PerlIO::encoding There is variable $PerlIO::encoding::fallback and by default WARN_ON_ERR bit is set.

So yes, it is bug as you did not get warning.

hakonhagland commented 8 years ago

@pali Yes when I try add in the code above (before starting to read the file):

use PerlIO::encoding;
printf "Current value of \$PerlIO::encoding::fallback is '0x%X'\n", $PerlIO::encoding::fallback;

The output is

Current value of $PerlIO::encoding::fallback is '0x902'

which shows that the bitmask constants WARN_ON_ERR and PERLQQ are set by default. There is also an undefined/undocumented bitmask 0x800 (0x902 & 0x800) == 0x800 that is set by default.

Interestingly, if I try to change the value to a code ref before reading:

$PerlIO::encoding::fallback = sub{ sprintf "<U+%04X>", shift };

The code hangs at readline (i.e. : <$fh>).. Is this another bug?

pali commented 8 years ago

Look at PerlIO::encoding source code, by default are set these bits:

our $fallback =
    Encode::PERLQQ()|Encode::WARN_ON_ERR()|Encode::STOP_AT_PARTIAL();

Coderef check is supported only by some XS Encode modules, probably not by PerlIO::encoding.

pali commented 8 years ago

Looks like this is not Encode bug, but PerlIO::encoding! And PerlIO is part of Perl itself. Please report this bug directly to Perl.

I used this test script:

use strict;
use warnings;
use Encode;

binmode STDOUT, ':utf8';

my $bytes = "\x{61}\x{E5}";
my $fh;

my $buf;
open $fh, '>:raw', \$buf;
print $fh $bytes;
close $fh;

open $fh, "<:encoding(UTF-8)", \$buf;
my $str = do { local $/; <$fh> };
close $fh;

print "$str\n";

open $fh, "<:raw", \$buf;
my $raw = do { local $/; <$fh> };
close $fh;
my $str2 = decode('UTF-8', $raw, Encode::FB_WARN | Encode::LEAVE_SRC);
print "$str2\n";
tonycoz commented 7 years ago

It turns out this is partly an Encode issue too.

PerlIO::encoding "renew"s the encoding object to ensure it has it's own encoding object (per Encode::Encoding), but Encode::decode_xs() treats such a renewed object as always stop_at_partial, which means that PerlIO::encoding can't use that encoding object to process that little bit of excess data at eof.

So I'm stuck trying to fix this on the PerlIO::encoding side.

Unfortunately, simply removing that renewed -> stop_at_partial will break PerlIO::encoding on validly encoded files on older perls, so I don't see a simple fix.

pali commented 7 years ago

Bug is in PerlIO::scalar and was fixed in perl 5.25.8 by this commit: https://perl5.git.perl.org/perl.git/commit/c47992b404786dcb8752239045e21cbcd7e3d103

tonycoz commented 7 years ago

There's an issue in PerlIO::encoding and the way it interacts with Encode too:

$ ./perl -e 'print "\xef\xbe"' >shortuni.txt $ hd shortuni.txt 00000000 ef be |..| 00000002 $ ./perl -Ilib -e 'binmode STDIN, ":encoding(UTF-8)"; while () { print }' <shortuni.txt (no output)

but it should be outputing a warning and \x{00EF}, like the following does:

$ ./perl -e 'print "\xef\xbeA"' >shortuni.txt $ ./perl -Ilib -e 'binmode STDIN, ":encoding(UTF-8)"; while () { print }' <shortuni.txt utf8 "\xEF" does not map to Unicode at -e line 1. \x{00EF}A

This is blead at v5.25.9-35-g32207c6 which includes the (irrelevant) PerlIO::scalar fix.