Open hakonhagland opened 8 years ago
See https://metacpan.org/pod/PerlIO::encoding There is variable $PerlIO::encoding::fallback and by default WARN_ON_ERR bit is set.
So yes, it is bug as you did not get warning.
@pali Yes when I try add in the code above (before starting to read the file):
use PerlIO::encoding;
printf "Current value of \$PerlIO::encoding::fallback is '0x%X'\n", $PerlIO::encoding::fallback;
The output is
Current value of $PerlIO::encoding::fallback is '0x902'
which shows that the bitmask constants WARN_ON_ERR
and PERLQQ
are set by default. There is also an undefined/undocumented bitmask 0x800
(0x902 & 0x800) == 0x800
that is set by default.
Interestingly, if I try to change the value to a code ref before reading:
$PerlIO::encoding::fallback = sub{ sprintf "<U+%04X>", shift };
The code hangs at readline (i.e. : <$fh>
).. Is this another bug?
Look at PerlIO::encoding source code, by default are set these bits:
our $fallback =
Encode::PERLQQ()|Encode::WARN_ON_ERR()|Encode::STOP_AT_PARTIAL();
Coderef check is supported only by some XS Encode modules, probably not by PerlIO::encoding.
Looks like this is not Encode bug, but PerlIO::encoding! And PerlIO is part of Perl itself. Please report this bug directly to Perl.
I used this test script:
use strict;
use warnings;
use Encode;
binmode STDOUT, ':utf8';
my $bytes = "\x{61}\x{E5}";
my $fh;
my $buf;
open $fh, '>:raw', \$buf;
print $fh $bytes;
close $fh;
open $fh, "<:encoding(UTF-8)", \$buf;
my $str = do { local $/; <$fh> };
close $fh;
print "$str\n";
open $fh, "<:raw", \$buf;
my $raw = do { local $/; <$fh> };
close $fh;
my $str2 = decode('UTF-8', $raw, Encode::FB_WARN | Encode::LEAVE_SRC);
print "$str2\n";
It turns out this is partly an Encode issue too.
PerlIO::encoding "renew"s the encoding object to ensure it has it's own encoding object (per Encode::Encoding), but Encode::decode_xs() treats such a renewed object as always stop_at_partial, which means that PerlIO::encoding can't use that encoding object to process that little bit of excess data at eof.
So I'm stuck trying to fix this on the PerlIO::encoding side.
Unfortunately, simply removing that renewed -> stop_at_partial will break PerlIO::encoding on validly encoded files on older perls, so I don't see a simple fix.
Bug is in PerlIO::scalar and was fixed in perl 5.25.8 by this commit: https://perl5.git.perl.org/perl.git/commit/c47992b404786dcb8752239045e21cbcd7e3d103
There's an issue in PerlIO::encoding and the way it interacts with Encode too:
$ ./perl -e 'print "\xef\xbe"' >shortuni.txt
$ hd shortuni.txt
00000000 ef be |..|
00000002
$ ./perl -Ilib -e 'binmode STDIN, ":encoding(UTF-8)"; while (
but it should be outputing a warning and \x{00EF}, like the following does:
$ ./perl -e 'print "\xef\xbeA"' >shortuni.txt
$ ./perl -Ilib -e 'binmode STDIN, ":encoding(UTF-8)"; while (
This is blead at v5.25.9-35-g32207c6 which includes the (irrelevant) PerlIO::scalar fix.
PerlIO layer
:encoding(utf-8)
seems to fail to report malformed data at the end of a file. Suppose a file$fn
contains valid UTF-8, except for the final character in the file. The last character in the file has an invalid UTF-8 encoding. I would like to have a warning printed to STDERR about invalid UTF-8 when reading this file, but strangely it seems not possible to achieve. For example:now
$fn
contains invalid UTF-8 (the last byte). If I now try to read the file using PerlIO layer:encoding(utf-8)
:the output is
Note, that there is no warning
"\xE5" does not map to Unicode
in this case.However, if I read the file as bytes and then use
Encode::decode()
on the raw data, the warnings is printed:Why cannot the same thing be achieved with
PerlIO::encoding
? Is it a bug?