Encoding I/O Layer Difference on Windows

p5pRT commented 8 years ago

Migrated from rt.perl.org#127668 (status was 'open')

Searchable as RT127668$

p5pRT commented 8 years ago

From dwheeler@cpan.org

I have a file to which I’ve written:

"\xc3\xa5\xc3\xa5\xc3\xa5\x0a"

These bytes correspond to the UTF-8 string:

"ååå\n"

I want to read this file into a scalar\, so I wrote this Perl:

sub slurp { my ($file) = @_; open my $fh\, "\<:encoding(UTF-8)"\, $file or die $!; return '' if eof $fh; local $/; return \<$fh>; }

my $fn = shift or die "Usage: $0 [file]\n"; slurp $fn;

Works great on my Mac and on *nix machines\, but not on Windows. There it emits a warning:

utf8 "\xA5" does not map to Unicode at try.pl line 6

I can fix it by changing the I/O layer to :raw:encoding(UTF-8). My guess is that it reads it in as raw bytes\, first\, then does the conversion. I can also break it on my Mac by changing the layer to :crlf:encoding(UTF-8).

But I don’t understand why the encoding layer’s parsing of a file with \x0a line endings should vary by platform. Sure\, line ending on Windows might typically be \x0d\x0a\, but why would the I/O layer care when I’ve told it that the file encoding is UTF-8?

Note that this does not occur with data in memory. This works fine on both platforms:

use Encode qw(decode_utf8); my $data = "\xc3\xa5\xc3\xa5\xc3\xa5\x0a"; decode_utf8 $data;

Is the I/O layer assuming that\, because we’re on Windows\, line endings need to be converted to \r\n before decoding? Is it implicitly using :crlf on Windows? Doesn’t seem like it’d be necessary if I’ve already told it what encoding to use\, and shouldn't bork that encoding in any event.

p5pRT commented 8 years ago

From @arc

As requested\, I've attached a program and test inputs demonstrating that the problem shows up iff all the following are true:

- The :crlf layer is used - The :encoding(UTF-8) layer is used (not :utf8) - The input ends in LF rather than CRLF - The program tests eof($fh) before reading from the filehandle - $/ is set to undef

I get the following results when running the program in various ways:

$ for mode in ':encoding(UTF-8)' :crlf:utf8 ':crlf:encoding(UTF-8)'; do

for file in crlf.txt lf.txt; do for slurp in 0 1; do for use_eof in 0 1; do echo "mode=$mode file=$file slurp=$slurp use_eof=$use_eof" perl script.pl $mode $file $slurp $use_eof done done done done mode=:encoding(UTF-8) file=crlf.txt slurp=0 use_eof=0 mode=:encoding(UTF-8) file=crlf.txt slurp=0 use_eof=1 mode=:encoding(UTF-8) file=crlf.txt slurp=1 use_eof=0 mode=:encoding(UTF-8) file=crlf.txt slurp=1 use_eof=1 mode=:encoding(UTF-8) file=lf.txt slurp=0 use_eof=0 mode=:encoding(UTF-8) file=lf.txt slurp=0 use_eof=1 mode=:encoding(UTF-8) file=lf.txt slurp=1 use_eof=0 mode=:encoding(UTF-8) file=lf.txt slurp=1 use_eof=1 mode=:crlf:utf8 file=crlf.txt slurp=0 use_eof=0 mode=:crlf:utf8 file=crlf.txt slurp=0 use_eof=1 mode=:crlf:utf8 file=crlf.txt slurp=1 use_eof=0 mode=:crlf:utf8 file=crlf.txt slurp=1 use_eof=1 mode=:crlf:utf8 file=lf.txt slurp=0 use_eof=0 mode=:crlf:utf8 file=lf.txt slurp=0 use_eof=1 mode=:crlf:utf8 file=lf.txt slurp=1 use_eof=0 mode=:crlf:utf8 file=lf.txt slurp=1 use_eof=1 mode=:crlf:encoding(UTF-8) file=crlf.txt slurp=0 use_eof=0 mode=:crlf:encoding(UTF-8) file=crlf.txt slurp=0 use_eof=1 mode=:crlf:encoding(UTF-8) file=crlf.txt slurp=1 use_eof=0 mode=:crlf:encoding(UTF-8) file=crlf.txt slurp=1 use_eof=1 mode=:crlf:encoding(UTF-8) file=lf.txt slurp=0 use_eof=0 mode=:crlf:encoding(UTF-8) file=lf.txt slurp=0 use_eof=1 mode=:crlf:encoding(UTF-8) file=lf.txt slurp=1 use_eof=0 mode=:crlf:encoding(UTF-8) file=lf.txt slurp=1 use_eof=1 utf8 "\xA5" does not map to Unicode at foo.pl line 5. $

The original p5p thread on this may have some additional information: http://nntp.perl.org/group/perl.perl5.porters/234856

-- Aaron Crane ** http://aaroncrane.co.uk/

p5pRT commented 8 years ago

From @arc

å

p5pRT commented 8 years ago

From @arc

å

p5pRT commented 8 years ago

From @arc

script.pl

p5pRT commented 7 years ago

@jkeenan - Status changed from 'new' to 'open'

p5pRT commented 7 years ago

From @jkeenan

On Fri\, 18 Mar 2016 13:19:04 GMT\, arc wrote:

As requested\, I've attached a program and test inputs demonstrating that the problem shows up iff all the following are true:

- The :crlf layer is used - The :encoding(UTF-8) layer is used (not :utf8) - The input ends in LF rather than CRLF - The program tests eof($fh) before reading from the filehandle - $/ is set to undef

I get the following results when running the program in various ways:

$ for mode in ':encoding(UTF-8)' :crlf:utf8 ':crlf:encoding(UTF-8)'; do

for file in crlf.txt lf.txt; do for slurp in 0 1; do for use_eof in 0 1; do echo "mode=$mode file=$file slurp=$slurp use_eof=$use_eof" perl script.pl $mode $file $slurp $use_eof done done done done mode=:encoding(UTF-8) file=crlf.txt slurp=0 use_eof=0 mode=:encoding(UTF-8) file=crlf.txt slurp=0 use_eof=1 mode=:encoding(UTF-8) file=crlf.txt slurp=1 use_eof=0 mode=:encoding(UTF-8) file=crlf.txt slurp=1 use_eof=1 mode=:encoding(UTF-8) file=lf.txt slurp=0 use_eof=0 mode=:encoding(UTF-8) file=lf.txt slurp=0 use_eof=1 mode=:encoding(UTF-8) file=lf.txt slurp=1 use_eof=0 mode=:encoding(UTF-8) file=lf.txt slurp=1 use_eof=1 mode=:crlf:utf8 file=crlf.txt slurp=0 use_eof=0 mode=:crlf:utf8 file=crlf.txt slurp=0 use_eof=1 mode=:crlf:utf8 file=crlf.txt slurp=1 use_eof=0 mode=:crlf:utf8 file=crlf.txt slurp=1 use_eof=1 mode=:crlf:utf8 file=lf.txt slurp=0 use_eof=0 mode=:crlf:utf8 file=lf.txt slurp=0 use_eof=1 mode=:crlf:utf8 file=lf.txt slurp=1 use_eof=0 mode=:crlf:utf8 file=lf.txt slurp=1 use_eof=1 mode=:crlf:encoding(UTF-8) file=crlf.txt slurp=0 use_eof=0 mode=:crlf:encoding(UTF-8) file=crlf.txt slurp=0 use_eof=1 mode=:crlf:encoding(UTF-8) file=crlf.txt slurp=1 use_eof=0 mode=:crlf:encoding(UTF-8) file=crlf.txt slurp=1 use_eof=1 mode=:crlf:encoding(UTF-8) file=lf.txt slurp=0 use_eof=0 mode=:crlf:encoding(UTF-8) file=lf.txt slurp=0 use_eof=1 mode=:crlf:encoding(UTF-8) file=lf.txt slurp=1 use_eof=0 mode=:crlf:encoding(UTF-8) file=lf.txt slurp=1 use_eof=1 utf8 "\xA5" does not map to Unicode at foo.pl line 5. $

The original p5p thread on this may have some additional information: http://nntp.perl.org/group/perl.perl5.porters/234856

After the above post from Aaron\, Leon Timmermans added the following\, which I'm quoting here to get all the current state of discussion into RT:

##### On Sun\, Mar 6\, 2016 at 2:54 AM\, Aaron Crane \arc@cpan\.org wrote:

I'm far from being an expert in the workings of PerlIO\, but my guess is that the combination of :crlf and :encoding(UTF-8) layers isn't handling the C\<\< eof $fh >> test correctly: it looks from the outside like a whole character gets read from the filehandle (to determine whether it has reached its end)\, but then only the last byte of that character is returned to the buffer.

Shocking\, an issue in :crlf or :encoding…

What happens is that a byte gets read and then unread. For a :perlio layer that byte would just go back to the existent buffer (which has space because it just came out of there)\, but :crlf is uniquely special (PerlIOCrlf_unread in perlio.c\, if you're curious). Uncommenting said unique snowflake code and using slower but more obvious path (PerlIOBase_unread) seems to solve this\, so that may be one half of the solution.

The other half would be not to do this read/unread silliness in the first place. We can check for eof without removing anything from the buffer. -T and -B probably have a similar issue\, but I have a hard time imagining how someone triggers that accidentally.

- using a file that ends in \r\n rather than \n

That actually was the crucial hint to where the problem is located :-)

Leon #####

Is anyone able to analyze this further?

Thank you very much.

-- James E Keenan (jkeenan@cpan.org)

p5pRT commented 7 years ago

From @Leont

On Tue\, Feb 28\, 2017 at 2:25 PM\, James E Keenan via RT \perlbug\-followup@perl\.org wrote:

Is anyone able to analyze this further?

Thank you very much.

I don't think this ticket is in need of analysis\, I think it's in need of a fix.

Leon

p5pRT commented 7 years ago

From cm.perl@abtela.com

Le 28/02/2017 à 14:25\, James E Keenan via RT a écrit :

Is anyone able to analyze this further?

Thank you very much.

To me it looks very much like #120797 in which Leon Timmermans suggested to just get rid of PerlIOCrlf_unread in perlio.c

Regards\,

--Christian

Perl / perl5