Perl / perl5

🐪 The Perl programming language
https://dev.perl.org/perl5/
Other
1.85k stars 527 forks source link

PerlIO::scalar does not handle UTF-8 #11938

Closed p5pRT closed 11 years ago

p5pRT commented 12 years ago

Migrated from rt.perl.org#109828 (status was 'resolved')

Searchable as RT109828$

p5pRT commented 12 years ago

From @dgl

If a UTF-8 output layer is specified the resulting scalar does not have the UTF-8 flag on.

I think this makes sense for output\, although there may be other ramifications.

Here's a todo test​:

Inline Patch ```diff diff --git a/ext/PerlIO-scalar/t/scalar.t b/ext/PerlIO-scalar/t/scalar.t index a02107b..59b65ad 100644 --- a/ext/PerlIO-scalar/t/scalar.t +++ b/ext/PerlIO-scalar/t/scalar.t @@ -16,7 +16,7 @@ use Fcntl qw(SEEK_SET SEEK_CUR SEEK_END); # Not 0, 1, 2 everywhere. $| = 1; -use Test::More tests => 79; +use Test::More tests => 80; my $fh; my $var = "aaa\n"; @@ -360,3 +360,11 @@ SKIP: { ok has_trailing_nul $memfile, 'write appends null when growing string after seek past end'; } + +# [perl #xxxx] +{ + local $TODO = "UTF-8 support"; + my $string = "\x{ffe}"; + open my $fh, "> :encoding(UTF-8)", \(my $out); + ok $string eq $out; +} ```
p5pRT commented 12 years ago

From @Leont

On Sat\, Feb 4\, 2012 at 6​:10 PM\, David Leadbeater \perlbug\-followup@​perl\.org wrote​:

If a UTF-8 output layer is specified the resulting scalar does not have the UTF-8 flag on.

I think this makes sense for output\, although there may be other ramifications.

Here's a todo test​:

diff --git a/ext/PerlIO-scalar/t/scalar.t b/ext/PerlIO-scalar/t/scalar.t index a02107b..59b65ad 100644 --- a/ext/PerlIO-scalar/t/scalar.t +++ b/ext/PerlIO-scalar/t/scalar.t @​@​ -16\,7 +16\,7 @​@​ use Fcntl qw(SEEK_SET SEEK_CUR SEEK_END); # Not 0\, 1\, 2 everywhere.

 $| = 1;

-use Test​::More tests => 79; +use Test​::More tests => 80;

 my $fh;  my $var = "aaa\n"; @​@​ -360\,3 +360\,11 @​@​ SKIP​: {     ok has_trailing_nul $memfile\,         'write appends null when growing string after seek past end';  } + +# [perl #xxxx] +{ +  local $TODO = "UTF-8 support"; +  my $string = "\x{ffe}"; +  open my $fh\, "> :encoding(UTF-8)"\, \(my $out); +  ok $string eq $out; +}

PerlIO does bytes\, always. It's utf8 support is literally a one bit flag that promises the bytes will be validly encoded utf8. There's no easy way for lower layers to know what the upper layers do with regard utf8. Nor am I sure that really should tinkle down.

The other direction would seem to be more important. When opening a utf8 scalar\, it should automatically be a utf8 handle. Anything else is plain buggy and potentially dangerous.

Leon

p5pRT commented 12 years ago

The RT System itself - Status changed from 'new' to 'open'

p5pRT commented 12 years ago

From tchrist@perl.com

David Leadbeater (via RT) \perlbug\-followup@​perl\.org wrote   on Sat\, 04 Feb 2012 09​:10​:41 PST​:

+{ + local $TODO = "UTF-8 support"; + my $string = "\x{ffe}";

Why don't you use an assigned Unicode code point there\, please?

+ open my $fh\, "> :encoding(UTF-8)"\, \(my $out);

Why are you involving the Encode module? Why isn't that simply​:

  open(my $fh\, "> :utf8"\, \my $out) || die $!​:

+ ok $string eq $out; +}

I absolutely gave up on this. It was too unreliable. Even if you are careful about decoding your string\, now and then (about 1 in 10) it gets double-encoded no matter what you do. It is not even deterministic in any fashion I can see to make work.

--tom

p5pRT commented 12 years ago

From @ikegami

On Sat\, Feb 4\, 2012 at 12​:10 PM\, David Leadbeater \<perlbug-followup@​perl.org

wrote​:

+# [perl #xxxx] +{ + local $TODO = "UTF-8 support"; + my $string = "\x{ffe}"; + open my $fh\, "> :encoding(UTF-8)"\, \(my $out); + ok $string eq $out; +}

Files can only contain bytes. This makes no sense to me.

- Eric

p5pRT commented 12 years ago

From @ikegami

On Sat\, Feb 4\, 2012 at 5​:49 PM\, Eric Brine \ikegami@&#8203;adaelis\.com wrote​:

On Sat\, Feb 4\, 2012 at 12​:10 PM\, David Leadbeater \< perlbug-followup@​perl.org> wrote​:

+# [perl #xxxx] +{ + local $TODO = "UTF-8 support"; + my $string = "\x{ffe}"; + open my $fh\, "> :encoding(UTF-8)"\, \(my $out); + ok $string eq $out; +}

Files can only contain bytes. This makes no sense to me.

... especially since you specially ask for encode whatever you print. encode "UTF-8" cannot possibly produce something that contains 0xFFE.

And your patch is buggy​: You forgot to actually print to $fh.

p5pRT commented 12 years ago

From @xdg

On Sat\, Feb 4\, 2012 at 12​:10 PM\, David Leadbeater \perlbug\-followup@&#8203;perl\.org wrote​:

If a UTF-8 output layer is specified the resulting scalar does not have the UTF-8 flag on.

I think that one should expect PerlIO​::scalar to provide a black box -- it's an in-memory substitution for bytes on disk with no associated encoding\, just like a file on disk has no associated encoding.

If the referenced string already has the utf8 flag set\, I think it's sufficient to warn rather than try to guess the correct behavior.

David

p5pRT commented 12 years ago

From @tux

On Sat\, 4 Feb 2012 18​:55​:27 +0100\, Leon Timmermans \fawaka@&#8203;gmail\.com wrote​:

On Sat\, Feb 4\, 2012 at 6​:10 PM\, David Leadbeater \perlbug\-followup@&#8203;perl\.org wrote​:

If a UTF-8 output layer is specified the resulting scalar does not have the UTF-8 flag on.

I think this makes sense for output\, although there may be other ramifications.

Here's a todo test​:

diff --git a/ext/PerlIO-scalar/t/scalar.t b/ext/PerlIO-scalar/t/scalar.t index a02107b..59b65ad 100644 --- a/ext/PerlIO-scalar/t/scalar.t +++ b/ext/PerlIO-scalar/t/scalar.t @​@​ -16\,7 +16\,7 @​@​ use Fcntl qw(SEEK_SET SEEK_CUR SEEK_END); # Not 0\, 1\, 2 everywhere.

 $| = 1;

-use Test​::More tests => 79; +use Test​::More tests => 80;

 my $fh;  my $var = "aaa\n"; @​@​ -360\,3 +360\,11 @​@​ SKIP​: {     ok has_trailing_nul $memfile\,         'write appends null when growing string after seek past end';  } + +# [perl #xxxx] +{ +  local $TODO = "UTF-8 support"; +  my $string = "\x{ffe}"; +  open my $fh\, "> :encoding(UTF-8)"\, \(my $out); +  ok $string eq $out; +}

PerlIO does bytes\, always. It's utf8 support is literally a one bit flag that promises the bytes will be validly encoded utf8. There's no easy way for lower layers to know what the upper layers do with regard utf8. Nor am I sure that really should tinkle down.

The other direction would seem to be more important. When opening a utf8 scalar\, it should automatically be a utf8 handle. Anything else is plain buggy and potentially dangerous.

including pragma's?

use open OUT => "encoding(utf16)"; open my $fh\, ">"\, \my $x; print { $fh } "The \x{20ac} is \x{a71c} again}\n"; close $fh;

-- H.Merijn Brand http​://tux.nl Perl Monger http​://amsterdam.pm.org/ using perl5.00307 .. 5.14 porting perl5 on HP-UX\, AIX\, and openSUSE http​://mirrors.develooper.com/hpux/ http​://www.test-smoke.org/ http​://qa.perl.org http​://www.goldmark.org/jeff/stupid-disclaimers/

p5pRT commented 12 years ago

From @nwc10

On Sat\, Feb 04\, 2012 at 08​:12​:30PM -0500\, David Golden wrote​:

On Sat\, Feb 4\, 2012 at 12​:10 PM\, David Leadbeater \perlbug\-followup@&#8203;perl\.org wrote​:

If a UTF-8 output layer is specified the resulting scalar does not have the UTF-8 flag on.

I think that one should expect PerlIO​::scalar to provide a black box -- it's an in-memory substitution for bytes on disk with no associated encoding\, just like a file on disk has no associated encoding.

If the referenced string already has the utf8 flag set\, I think it's sufficient to warn rather than try to guess the correct behavior.

Whoa. I don't think you mean "has the utf8 flag set". That's how 5.6.0 would have exposed it. SvUTF8() shouldn't be visible as a proxy for "characters vs bytes" (yes\, I know there are still holes in that).

I *think* it needs to be strictly bytes-only (just like any real file handle) and refuse to open an existing string that doesn't meet that constraint. (With the inevitable ambiguity that if you only shove characters in the range 0-255 into your string\, you're not going to realise that your code is buggy.)

Nicholas Clark

p5pRT commented 12 years ago

From @nwc10

On Sat\, Feb 04\, 2012 at 06​:55​:27PM +0100\, Leon Timmermans wrote​:

On Sat\, Feb 4\, 2012 at 6​:10 PM\, David Leadbeater \perlbug\-followup@&#8203;perl\.org wrote​:

If a UTF-8 output layer is specified the resulting scalar does not have the UTF-8 flag on.

I think this makes sense for output\, although there may be other ramifications.

PerlIO does bytes\, always. It's utf8 support is literally a one bit flag that promises the bytes will be validly encoded utf8. There's no easy way for lower layers to know what the upper layers do with regard utf8. Nor am I sure that really should tinkle down.

The other direction would seem to be more important. When opening a utf8 scalar\, it should automatically be a utf8 handle. Anything else is plain buggy and potentially dangerous.

No\, as I replied elsewhere\, I think it should refuse to open any scalar that isn't bytes.

Or\, at least\, the user's code needs to be different to say "I want to open a byte buffer as if it's a file handle" and "I'm expecting characters here". That way allows symmetry between opening for reading and opening for writing.

Having open for reading have some sort of "did they mean characters or bytes? I'll guess for them" ends up with the same mess that unpack is in\, whereby it's a runtime decision based *implicitly* on the *parameters* as to whether it's doing a bytes => characters conversion or a characters => characters mapping. Sure\, it's not as *bad* as unpack\, which can attempt to do both in the same statement\, but trying to have open "DWIM" is in the same ball-park of design misfeature.

Nicholas Clark

p5pRT commented 12 years ago

From @Leont

On Mon\, Feb 6\, 2012 at 12​:05 PM\, Nicholas Clark \nick@&#8203;ccl4\.org wrote​:

No\, as I replied elsewhere\, I think it should refuse to open any scalar that isn't bytes.

Or\, at least\, the user's code needs to be different to say "I want to open a byte buffer as if it's a file handle" and "I'm expecting characters here". That way allows symmetry between opening for reading and opening for writing.

Having open for reading have some sort of "did they mean characters or bytes? I'll guess for them" ends up with the same mess that unpack is in\, whereby it's a runtime decision based *implicitly* on the *parameters* as to whether it's doing a bytes => characters conversion or a characters => characters mapping. Sure\, it's not as *bad* as unpack\, which can attempt to do both in the same statement\, but trying to have open "DWIM" is in the same ball-park of design misfeature.

Yeah\, that is a good point\, how about making things explicit? E.G «open my $fh\, '+\<​:scalar(utf8)'\, \my $scalar». I suspect the current PerlIO/PerlIO​::scalar can't easily support that though.

Leon

p5pRT commented 12 years ago

From @xdg

On Mon\, Feb 6\, 2012 at 9​:17 AM\, Leon Timmermans \fawaka@&#8203;gmail\.com wrote​:

Yeah\, that is a good point\, how about making things explicit? E.G «open my $fh\, '+\<​:scalar(utf8)'\, \my $scalar». I suspect the current PerlIO/PerlIO​::scalar can't easily support that though.

Isn't that just C\<open my $fh\, "+\<​:utf8"\, \my $scalar>?

If you *know* that you have UTF-8 characters in a string\, it's not different than knowing you have UTF-8 characters in a disk file. The *user* needs to be clear what they expect the bytes to be.

Or is the question about what Perl should do about returning bytes from a string that coincidentally happens to be a character string? I.e. how should Perl mimic an on-disk file using its internal string data structure?

Assume that Perl's internal character encoding is a black box. Maybe it's UTF-8\, maybe not (maybe it changes in the future). Whatever. It's an internal implementation detail and nothing external should rely on it.

Then when something wants to use that string as a source of bytes\, should Perl (a) just dump out whatever bytes it uses internally for its implementation? Or (b) should it convert the internal representation to some standard representation? Or (c) should it blow up?

I don't like (a) or (c). (b) is tempting. (Coincidentally\, it's easy\, since the internal encoding is utf8.) My naive inclination is to amend the documentation to clarify that the bytes returned are either raw bytes or utf8 encoded if the string already contains characters. And then I'd *still* leave it up to the user to know what's in the "file" (i.e. string) and set the correct encoding layer on it\, just as if they were using a disk file.

-- David

p5pRT commented 12 years ago

From @nwc10

On Mon\, Feb 06\, 2012 at 10​:18​:39AM -0500\, David Golden wrote​:

Or is the question about what Perl should do about returning bytes from a string that coincidentally happens to be a character string? I.e. how should Perl mimic an on-disk file using its internal string data structure?

That was what I thought the question was.

Assume that Perl's internal character encoding is a black box. Maybe it's UTF-8\, maybe not (maybe it changes in the future). Whatever. It's an internal implementation detail and nothing external should rely on it.

Agree

Then when something wants to use that string as a source of bytes\, should Perl (a) just dump out whatever bytes it uses internally for its implementation? Or (b) should it convert the internal representation to some standard representation? Or (c) should it blow up?

I don't like (a) or (c). (b) is tempting. (Coincidentally\, it's easy\, since the internal encoding is utf8.) My naive inclination is to amend the documentation to clarify that the bytes returned are either raw bytes or utf8 encoded if the string already contains characters. And then I'd *still* leave it up to the user to know

How do you know that the string contains characters?

what's in the "file" (i.e. string) and set the correct encoding layer on it\, just as if they were using a disk file.

Nicholas Clark

p5pRT commented 12 years ago

From @xdg

On Mon\, Feb 6\, 2012 at 10​:24 AM\, Nicholas Clark \nick@&#8203;ccl4\.org wrote​:

I don't like (a) or (c).  (b) is tempting.  (Coincidentally\, it's easy\, since the internal encoding is utf8.)  My naive inclination is to amend the documentation to clarify that the bytes returned are either raw bytes or utf8 encoded if the string already contains characters.  And then I'd *still* leave it up to the user to know

How do you know that the string contains characters?

Which "you" do you mean? The user? How does a user know that *any* file contains characters? Generally\, by knowing what was written there originally or by analyzing the file in some way to guess an encoding\, I'd think. (E.g. read it as bytes and then use Encode​::Guess?)

None of that is the interpreter's concern.

-- David

p5pRT commented 12 years ago

From @Leont

On Mon\, Feb 6\, 2012 at 4​:18 PM\, David Golden \xdaveg@&#8203;gmail\.com wrote​:

Isn't that just C\<open my $fh\, "+\<​:utf8"\, \my $scalar>?

Not at all. :utf8 means «assume the bytestream is utf8 encoded». It does not mean «store as characters» (though doing the latter without the former doesn't make sense).

Or is the question about what Perl should do about returning bytes from a string that coincidentally happens to be a character string? I.e. how should Perl mimic an on-disk file using its internal string data structure?

Yeah\, pretty much.

Then when something wants to use that string as a source of bytes\, should Perl (a) just dump out whatever bytes it uses internally for its implementation?  Or (b) should it convert the internal representation to some standard representation?  Or (c) should it blow up?

(a) Is what we're doing right now\, and I think it's just plain wrong\, and possibly dangerous. (b) Maybe\, but for reasons Nicholas explained guesswork may be rather suboptimal (c) Is sane\, unlike (a) and some versions of (b).

Leon

p5pRT commented 12 years ago

From @nwc10

On Mon\, Feb 06\, 2012 at 11​:02​:36AM -0500\, David Golden wrote​:

On Mon\, Feb 6\, 2012 at 10​:24 AM\, Nicholas Clark \nick@&#8203;ccl4\.org wrote​:

I don't like (a) or (c).  (b) is tempting.  (Coincidentally\, it's easy\, since the internal encoding is utf8.)  My naive inclination is to amend the documentation to clarify that the bytes returned are either raw bytes or utf8 encoded if the string already contains characters.  And then I'd *still* leave it up to the user to know

How do you know that the string contains characters?

Which "you" do you mean? The user? How does a user know that *any* file contains characters? Generally\, by knowing what was written there originally or by analyzing the file in some way to guess an encoding\, I'd think. (E.g. read it as bytes and then use Encode​::Guess?)

None of that is the interpreter's concern.

OK\, which means that the interpreter can't *do* option (b) (or (a) for that matter)​:

On Mon\, Feb 06\, 2012 at 03​:24​:04PM +0000\, Nicholas Clark wrote​:

On Mon\, Feb 06\, 2012 at 10​:18​:39AM -0500\, David Golden wrote​:

Or is the question about what Perl should do about returning bytes from a string that coincidentally happens to be a character string? I.e. how should Perl mimic an on-disk file using its internal string data structure?

Then when something wants to use that string as a source of bytes\, should Perl (a) just dump out whatever bytes it uses internally for its implementation? Or (b) should it convert the internal representation to some standard representation? Or (c) should it blow up?

because you've just stated that the interpreter can't make a determination as to whether a string contains characters or bytes (for the ambiguous case of a string containing one or more code points in the range 128-255\, but no code points outside the range 0-255)

Nicholas Clark

p5pRT commented 12 years ago

From @ikegami

On Mon\, Feb 6\, 2012 at 10​:18 AM\, David Golden \xdaveg@&#8203;gmail\.com wrote​:

On Mon\, Feb 6\, 2012 at 9​:17 AM\, Leon Timmermans \fawaka@&#8203;gmail\.com wrote​:

Yeah\, that is a good point\, how about making things explicit? E.G «open my $fh\, '+\<​:scalar(utf8)'\, \my $scalar». I suspect the current PerlIO/PerlIO​::scalar can't easily support that though.

Isn't that just C\<open my $fh\, "+\<​:utf8"\, \my $scalar>?

No\, that means "decode the input on read". The question is about a buffer that contains decoded data\, so what's needed is a layer or some such that indicates "the underlying data is already decoded". That's his intent for :scalar(utf8).

p5pRT commented 12 years ago

From @xdg

On Mon\, Feb 6\, 2012 at 11​:09 AM\, Nicholas Clark \nick@&#8203;ccl4\.org wrote​:

because you've just stated that the interpreter can't make a determination as to whether a string contains characters or bytes (for the ambiguous case of a string containing one or more code points in the range 128-255\, but no code points outside the range 0-255)

You're right. I was being imprecise. I think if the string contains no wide characters\, it should be "read" by PerlIO​::scalar as bytes. If the string does contain wide characters\, PerlIO​::scalar should either fail or should encode them in some "standard" way and return them as bytes in encoded form.

The whole idea is to provide an in-memory abstraction of a *file*\, which means returning a sequence of bytes.

David

p5pRT commented 12 years ago

From @cpansprout

On Mon Feb 06 07​:19​:37 2012\, xdaveg@​gmail.com wrote​:

Then when something wants to use that string as a source of bytes\, should Perl (a) just dump out whatever bytes it uses internally for its implementation? Or (b) should it convert the internal representation to some standard representation? Or (c) should it blow up?

(a) is what Perl currently does\, as Leon Timmerman said.

By (b) I presume you mean to treat \xff as \xff regardless of how it is stored internally\, which makes sense.

But what happens if I open a reading handle to a scalar containing \x{100}? Here we have a choice between (b) and (c).

An in-memory scalar could be considered a byte stream. Or it could just be considered a string of characters.

The latter does make some sense. If I print \xff to an in-memory file with no layers applied\, I simply get \xff in my scalar. So if I print \x{100}\, it would make sense to get \x{100} in my scalar\, no? But if the scalar is considered byte-sized\, I should get \x{100} utf8-encoded\, accompanied by a wide character warning; and reading a scalar with \x{100} would croak.

That it is currently buggy is not being questioned. But which model should be followed in fixing it is debatable. Would it be reasonable to implement the byte-sized version for now and upgrade it later?

--

Father Chrysostomos

p5pRT commented 12 years ago

From @xdg

On Sun\, Feb 12\, 2012 at 5​:02 PM\, Father Chrysostomos via RT \perlbug\-followup@&#8203;perl\.org wrote​:

On Mon Feb 06 07​:19​:37 2012\, xdaveg@​gmail.com wrote​:

Then when something wants to use that string as a source of bytes\, should Perl (a) just dump out whatever bytes it uses internally for its implementation?  Or (b) should it convert the internal representation to some standard representation?  Or (c) should it blow up?

(a) is what Perl currently does\, as Leon Timmerman said.

By (b) I presume you mean to treat \xff as \xff regardless of how it is stored internally\, which makes sense.

Sort of. What I meant is that (a) is "whatever we do" and (b) is "a specific encoding". Those are likely to be similar\, but one is vague and mutable and the other specific and fixed. Such a promise would persist under the usual back-compatibility rules even if we changed the internal representation in the future for some reason. It could also mean that we could choose give UTF-8 and not "utf8" (i.e. lax\, internal encoding) -- and would croak if we can't translate from the internal to UTF-8.

For example\, for a string with wide characters used as in in-memory file\, we could promise to translate from the internal encoding to UTF-8 when the handle is read. That would make it resemble a disk file encoded in UTF-8\, requiring the "​:encoding(UTF-8)" flag and so on. Thus some function that is passed a handle to read shouldn't know or care whether it's an in memory string or an on-disk file -- though the *programmer* would need to know what encoding they expect to receive given their particular application.

An in-memory scalar could be considered a byte stream.  Or it could just be considered a string of characters.

My bias is strongly that it should be a byte-stream\, which is why I'm only considering how we choose to take a string of (wide) characters and make it into a byte stream in some standard way​: (a) "whatever" (b) "a promise" and (c) "boom!"

-- David

p5pRT commented 12 years ago

From @ap

* Nicholas Clark \nick@&#8203;ccl4\.org [2012-02-06 12​:00]​:

On Sat\, Feb 04\, 2012 at 08​:12​:30PM -0500\, David Golden wrote​:

If the referenced string already has the utf8 flag set\, I think it's sufficient to warn rather than try to guess the correct behavior.

Whoa. I don't think you mean "has the utf8 flag set". That's how 5.6.0 would have exposed it. SvUTF8() shouldn't be visible as a proxy for "characters vs bytes" (yes\, I know there are still holes in that).

This. Thank you. I was despairing as I read the thread\, waiting for someone to interject with it.

As far as the user is concerned\, there is never to be any difference between a string with UTF8 on vs a string with UTF8 off as long as $utf8on eq $utf8off.

I *think* it needs to be strictly bytes-only (just like any real file handle) and refuse to open an existing string that doesn't meet that constraint. (With the inevitable ambiguity that if you only shove characters in the range 0-255 into your string\, you're not going to realise that your code is buggy.)

What it should do on input is treat each character as a byte\, throwing an error if there are any characters > 0xFF in the string\, i.e. the moral equivalent of downgrading the input string and croaking if that fails. That’s it.

Regards\, -- Aristotle Pagaltzis // \<http​://plasmasturm.org/>

p5pRT commented 12 years ago

From @tux

On Mon\, 6 Feb 2012 10​:58​:02 +0000\, Nicholas Clark \nick@&#8203;ccl4\.org wrote​:

On Sat\, Feb 04\, 2012 at 08​:12​:30PM -0500\, David Golden wrote​:

On Sat\, Feb 4\, 2012 at 12​:10 PM\, David Leadbeater \perlbug\-followup@&#8203;perl\.org wrote​:

If a UTF-8 output layer is specified the resulting scalar does not have the UTF-8 flag on.

I think that one should expect PerlIO​::scalar to provide a black box -- it's an in-memory substitution for bytes on disk with no associated encoding\, just like a file on disk has no associated encoding.

If the referenced string already has the utf8 flag set\, I think it's sufficient to warn rather than try to guess the correct behavior.

Whoa. I don't think you mean "has the utf8 flag set". That's how 5.6.0 would have exposed it. SvUTF8() shouldn't be visible as a proxy for "characters vs bytes" (yes\, I know there are still holes in that).

I *think* it needs to be strictly bytes-only (just like any real file handle) and refuse to open an existing string that doesn't meet that constraint. (With the inevitable ambiguity that if you only shove characters in the range 0-255 into your string\, you're not going to realise that your code is buggy.)

Nicholas Clark

Personally\, I see no harm in doing a decode on close when opened for writing as utf-8

--8\<--- use v5.12; use warnings;

binmode STDOUT\, "​:utf8";

my $data = "";

{ open my $fh\, ">​:encoding(utf-8)"\, \$data;   print { $fh } "\x{20ac}\n";   close $fh;   }

{ open my $fh\, "\<​:encoding(utf-8)"\, \$data;   print \<$fh>;   close $fh;   }

print $data; utf8​::decode ($data); print $data;

{ open my $fh\, "\<​:encoding(utf-8)"\, \$data;   print \<$fh>;   close $fh;   }

{ use open OUT => "​:encoding(utf-8)";   open my $fh\, ">"\, \$data;   print { $fh } "\x{20ac}\n";   close $fh;   }

{ use open IN => "​:encoding(utf-8)";   open my $fh\, "\<"\, \$data;   print \<$fh>;   close $fh;   }

print $data; utf8​::decode ($data); print $data;

{ use open IN => "​:encoding(utf-8)";   open my $fh\, "\<"\, \$data;   print \<$fh>;   close $fh;   } -->8---

$ perl test.pl € ⬠€ € € € € € €

-- H.Merijn Brand http​://tux.nl Perl Monger http​://amsterdam.pm.org/ using perl5.00307 .. 5.14 porting perl5 on HP-UX\, AIX\, and openSUSE http​://mirrors.develooper.com/hpux/ http​://www.test-smoke.org/ http​://qa.perl.org http​://www.goldmark.org/jeff/stupid-disclaimers/

p5pRT commented 12 years ago

From @ikegami

On Sun\, Feb 12\, 2012 at 5​:02 PM\, Father Chrysostomos via RT \< perlbug-followup@​perl.org> wrote​:

That it is currently buggy is not being questioned.

And the following test will detect regressions once its fixed.

=====

use strict; use warnings;

use Test​::More tests => 1;

sub read_from_scalar {   my ($file\, $perlio) = @​_;   $perlio //= '';   open my $fh\, "\<$perlio"\, \$file or die $!;   local $/;   return \<$fh>; }

sub hexify { join ' '\, map sprintf('%02X'\, ord)\, split //\, $_[0] }

{   my $s = chr(0xE9);   utf8​::upgrade( my $u = $s );   utf8​::downgrade( my $d = $s );   is( hexify(read_from_scalar($u))\, hexify(read_from_scalar($d))\, 'Unicode bug in :scalar read' ); }

1;

p5pRT commented 11 years ago

From @ribasushi

Is there any word on this issue? I just hit this bug in reverse[1] and while there is ample discussion about it being a problem I see the same behavior under current blead. Is there a chance *at least* for a warning to be added so that it lands in 5.18?

Cheers

[1] http​://www.perlmonks.org/?node_id=1010601

p5pRT commented 11 years ago

From @Leont

On Fri\, Dec 28\, 2012 at 8​:34 AM\, Peter Rabbitson \rabbit\+p5p@&#8203;rabbit\.us wrote​:

Is there any word on this issue? I just hit this bug in reverse[1] and while there is ample discussion about it being a problem I see the same behavior under current blead. Is there a chance *at least* for a warning to be added so that it lands in 5.18?

The process kind of fizzled somewhere. I'm in favor of a warning in 5.18\, see attachment.

Leon

p5pRT commented 11 years ago

From @Leont

0001-Warn-when-opening-utf8-string-into-handle.patch ```diff From d486949439c66bd1a6e76af94468c374a58590f8 Mon Sep 17 00:00:00 2001 From: Leon Timmermans Date: Fri, 28 Dec 2012 16:41:53 +0100 Subject: [PATCH] Warn when opening utf8 string into handle --- ext/PerlIO-scalar/scalar.pm | 2 +- ext/PerlIO-scalar/scalar.xs | 2 ++ ext/PerlIO-scalar/t/scalar.t | 11 ++++++++++- 3 files changed, 13 insertions(+), 2 deletions(-) diff --git a/ext/PerlIO-scalar/scalar.pm b/ext/PerlIO-scalar/scalar.pm index 813f5e6..64ecc22 100644 --- a/ext/PerlIO-scalar/scalar.pm +++ b/ext/PerlIO-scalar/scalar.pm @@ -1,5 +1,5 @@ package PerlIO::scalar; -our $VERSION = '0.15'; +our $VERSION = '0.16'; require XSLoader; XSLoader::load(); 1; diff --git a/ext/PerlIO-scalar/scalar.xs b/ext/PerlIO-scalar/scalar.xs index d7b8828..48dbd32 100644 --- a/ext/PerlIO-scalar/scalar.xs +++ b/ext/PerlIO-scalar/scalar.xs @@ -41,6 +41,8 @@ PerlIOScalar_pushed(pTHX_ PerlIO * f, const char *mode, SV * arg, SvREFCNT_inc(perl_get_sv (SvPV_nolen(arg), GV_ADD | GV_ADDMULTI)); } + if (SvUTF8(s->var) && ckWARN(WARN_UTF8)) + Perl_warner(aTHX_ packWARN(WARN_UTF8), "Should only map byte strings into in-memory filehandles\n"); } else { s->var = newSVpvn("", 0); diff --git a/ext/PerlIO-scalar/t/scalar.t b/ext/PerlIO-scalar/t/scalar.t index d255a05..d96e2db 100644 --- a/ext/PerlIO-scalar/t/scalar.t +++ b/ext/PerlIO-scalar/t/scalar.t @@ -16,7 +16,7 @@ use Fcntl qw(SEEK_SET SEEK_CUR SEEK_END); # Not 0, 1, 2 everywhere. $| = 1; -use Test::More tests => 82; +use Test::More tests => 83; my $fh; my $var = "aaa\n"; @@ -384,3 +384,12 @@ SKIP: { close FILE; is $content, "Foo-Bar\n", 'duping via >&='; } + +{ + use warnings; + my $content = "\x{100}"; + my @warnings; + local $SIG{__WARN__} = sub { push @warnings, $_[0] }; + open my $fh, '<', \$content; + is($warnings[0], "Should only map byte strings into in-memory filehandles\n", 'Trying to open a character string warns'); +} -- 1.7.6.1 ```
p5pRT commented 11 years ago

From @tonycoz

On Fri Dec 28 07​:45​:25 2012\, LeonT wrote​:

On Fri\, Dec 28\, 2012 at 8​:34 AM\, Peter Rabbitson \rabbit\+p5p@&#8203;rabbit\.us wrote​:

Is there any word on this issue? I just hit this bug in reverse[1] and while there is ample discussion about it being a problem I see the same behavior under current blead. Is there a chance *at least* for a warning to be added so that it lands in 5.18?

The process kind of fizzled somewhere. I'm in favor of a warning in 5.18\, see attachment.

It should fail to open. If you open a UTF8 flagged string for append and write non-UTF8 bytes you will produce an invalidly encoded SvUTF8 string.

Your patch as written ignores the principle that the SvUTF8() flag only controls the internal encoding\, not other behaviour. If the SV contains only code point 0xFF or lower we should downgrade it and work with that rather than failing (or producing a warning).

This should also be done for _read() and _write()\, since the SV can be modified between I/O operations.

There's an unrelated problem that _pushed() checks flags on both arg and SvRV(arg) without calling SvGETMAGIC().

I'll take a look at these issues when I get home.

Tony

p5pRT commented 11 years ago

From @Leont

On Fri\, Dec 28\, 2012 at 11​:06 PM\, Tony Cook via RT \perlbug\-followup@&#8203;perl\.org wrote​:

It should fail to open. If you open a UTF8 flagged string for append and write non-UTF8 bytes you will produce an invalidly encoded SvUTF8 string.

Your patch as written ignores the principle that the SvUTF8() flag only controls the internal encoding\, not other behaviour. If the SV contains only code point 0xFF or lower we should downgrade it and work with that rather than failing (or producing a warning).

I didn't see enough consensus to change it that much\, but I would be in favor.

This should also be done for _read() and _write()\, since the SV can be modified between I/O operations.

There's an unrelated problem that _pushed() checks flags on both arg and SvRV(arg) without calling SvGETMAGIC().

It should just stop peeking and poking into the SV altogether\, and use the proper APIs (sv_insert and friends). For that matter\, I sometimes feel like it should be rewritten from scratch to actually make sense. Pretty much all of it is problematic.

Leon

p5pRT commented 11 years ago

From @ribasushi

On Fri\, Dec 28\, 2012 at 11​:16​:36PM +0100\, Leon Timmermans wrote​:

On Fri\, Dec 28\, 2012 at 11​:06 PM\, Tony Cook via RT \perlbug\-followup@&#8203;perl\.org wrote​:

It should fail to open. If you open a UTF8 flagged string for append and write non-UTF8 bytes you will produce an invalidly encoded SvUTF8 string.

Your patch as written ignores the principle that the SvUTF8() flag only controls the internal encoding\, not other behaviour. If the SV contains only code point 0xFF or lower we should downgrade it and work with that rather than failing (or producing a warning).

I didn't see enough consensus to change it that much\, but I would be in favor.

This should also be done for _read() and _write()\, since the SV can be modified between I/O operations.

There's an unrelated problem that _pushed() checks flags on both arg and SvRV(arg) without calling SvGETMAGIC().

It should just stop peeking and poking into the SV altogether\, and use the proper APIs (sv_insert and friends). For that matter\, I sometimes feel like it should be rewritten from scratch to actually make sense. Pretty much all of it is problematic.

This particular bit risks derailing the simpler yet more urgent bugfix. Focuse please ;)

Cheers

p5pRT commented 11 years ago

From @tonycoz

On Fri\, Dec 28\, 2012 at 11​:16​:36PM +0100\, Leon Timmermans wrote​:

On Fri\, Dec 28\, 2012 at 11​:06 PM\, Tony Cook via RT \perlbug\-followup@&#8203;perl\.org wrote​:

It should fail to open. If you open a UTF8 flagged string for append and write non-UTF8 bytes you will produce an invalidly encoded SvUTF8 string.

Your patch as written ignores the principle that the SvUTF8() flag only controls the internal encoding\, not other behaviour. If the SV contains only code point 0xFF or lower we should downgrade it and work with that rather than failing (or producing a warning).

I didn't see enough consensus to change it that much\, but I would be in favor.

This should also be done for _read() and _write()\, since the SV can be modified between I/O operations.

There's an unrelated problem that _pushed() checks flags on both arg and SvRV(arg) without calling SvGETMAGIC().

It should just stop peeking and poking into the SV altogether\, and use the proper APIs (sv_insert and friends). For that matter\, I sometimes feel like it should be rewritten from scratch to actually make sense. Pretty much all of it is problematic.

I've attached my suggested changes (in several parts)\, also available on perl5.git.perl.org/perl.git as tonyc/perlio-scalar-sanity.

Reasons for failing instead of warning​:

1) reading - to follow the "SVf_UTF8 is only representation" principle\, we'd need to download where possible\, so a \xA1 (for example) in the stream is always treated as that byte\, but this means we have an inconsistency when the scalar cannot be downgraded - the first bytes of the character sequences "\xA1\x40" and "\xA1\x{101}" would be different.

2) writing - if the SV is flagged UTF8\, and the user of the handle doesn't write correct UTF8 data at the correct offsets\, the SV will no longer be properly formed utf-8\, which I believe we're trying to maintain. One of my tests produced a warning about invalid UTF-8 during before the fix was applied.

It's possible could be avoided if we always treat the written bytes as code points and upgrade them when writing to a UTF8 string\, but then we run into a consitency issue vs reading - what happens when a read on a UTF8 string reaches a code point > 0xFF?

As written I think the warning message could be improved and the documentation of the warning could be improved.

Tony

p5pRT commented 11 years ago

From @tonycoz

0001-TODO-tests-for-opening-upgraded-scalars.patch ```diff From 579291c61c3a4a52c42bf792c31190fa99b95c9d Mon Sep 17 00:00:00 2001 From: Tony Cook Date: Mon, 31 Dec 2012 12:12:17 +1100 Subject: [PATCH 1/8] TODO tests for opening upgraded scalars --- ext/PerlIO-scalar/t/scalar.t | 37 ++++++++++++++++++++++++++++++++++++- 1 file changed, 36 insertions(+), 1 deletion(-) diff --git a/ext/PerlIO-scalar/t/scalar.t b/ext/PerlIO-scalar/t/scalar.t index d255a05..aa28035 100644 --- a/ext/PerlIO-scalar/t/scalar.t +++ b/ext/PerlIO-scalar/t/scalar.t @@ -16,7 +16,7 @@ use Fcntl qw(SEEK_SET SEEK_CUR SEEK_END); # Not 0, 1, 2 everywhere. $| = 1; -use Test::More tests => 82; +use Test::More tests => 92; my $fh; my $var = "aaa\n"; @@ -384,3 +384,38 @@ SKIP: { close FILE; is $content, "Foo-Bar\n", 'duping via >&='; } + +# [perl #109828] PerlIO::scalar does not handle UTF-8 +{ + use Errno qw(EINVAL); + my $todo = "open doesn't know about UTf-8 scalars"; + local $TODO = $todo; + my @warnings; + local $SIG{__WARN__} = sub { push @warnings, "@_" }; + my $content = "12\x{101}"; + $! = 0; + ok(!open(my $fh, "<", \$content), "non-byte open should fail"); + is(0+$!, EINVAL, "check \$! is updated"); + undef $TODO; + is_deeply(\@warnings, [], "should be no warnings (yet)"); + use warnings "utf8"; + $TODO = $todo; + $! = 0; + ok(!open(my $fh, "<", \$content), "non byte open should fail (and warn)"); + is(0+$!, EINVAL, "check \$! is updated even when we warn"); + $TODO = $todo; + my $warning = "Only byte strings can be mapped into in-memory filehandles\n"; + is_deeply(\@warnings, [ $warning ], "should have warned"); + @warnings = (); + $content = "12\xA1"; + utf8::upgrade($content); + undef $TODO; + ok(open(my $fh, "<", \$content), "open upgraded scalar"); + $TODO = $todo; + my $tmp; + is(read($fh, $tmp, 4), 3, "read should get the downgraded bytes"); + is($tmp, "12\xA1", "check we got the expected bytes"); + close $fh; + undef $TODO; + is_deeply(\@warnings, [], "should be no more warnings"); +} -- 1.7.10.4 ```
p5pRT commented 11 years ago

From @tonycoz

0002-fail-to-open-scalars-containing-characters-that-don-.patch ```diff From 2e4557d3e35c1c8fc8e2d43b1864a153dcc27023 Mon Sep 17 00:00:00 2001 From: Tony Cook Date: Mon, 31 Dec 2012 12:15:10 +1100 Subject: [PATCH 2/8] fail to open scalars containing characters that don't fit in a byte --- ext/PerlIO-scalar/scalar.xs | 8 ++++++++ ext/PerlIO-scalar/t/scalar.t | 8 -------- 2 files changed, 8 insertions(+), 8 deletions(-) diff --git a/ext/PerlIO-scalar/scalar.xs b/ext/PerlIO-scalar/scalar.xs index d7b8828..f54a655 100644 --- a/ext/PerlIO-scalar/scalar.xs +++ b/ext/PerlIO-scalar/scalar.xs @@ -52,6 +52,14 @@ PerlIOScalar_pushed(pTHX_ PerlIO * f, const char *mode, SV * arg, sv_force_normal(s->var); SvCUR_set(s->var, 0); } + if (SvUTF8(s->var) && !sv_utf8_downgrade(s->var, TRUE)) { + if (ckWARN(WARN_UTF8)) + Perl_warner(aTHX_ packWARN(WARN_UTF8), "Only byte strings can be mapped into in-memory filehandles\n"); + SETERRNO(EINVAL, SS_IVCHAN); + SvREFCNT_dec(s->var); + s->var = Nullsv; + return -1; + } if ((PerlIOBase(f)->flags) & PERLIO_F_APPEND) { sv_force_normal(s->var); diff --git a/ext/PerlIO-scalar/t/scalar.t b/ext/PerlIO-scalar/t/scalar.t index aa28035..888537f 100644 --- a/ext/PerlIO-scalar/t/scalar.t +++ b/ext/PerlIO-scalar/t/scalar.t @@ -388,34 +388,26 @@ SKIP: { # [perl #109828] PerlIO::scalar does not handle UTF-8 { use Errno qw(EINVAL); - my $todo = "open doesn't know about UTf-8 scalars"; - local $TODO = $todo; my @warnings; local $SIG{__WARN__} = sub { push @warnings, "@_" }; my $content = "12\x{101}"; $! = 0; ok(!open(my $fh, "<", \$content), "non-byte open should fail"); is(0+$!, EINVAL, "check \$! is updated"); - undef $TODO; is_deeply(\@warnings, [], "should be no warnings (yet)"); use warnings "utf8"; - $TODO = $todo; $! = 0; ok(!open(my $fh, "<", \$content), "non byte open should fail (and warn)"); is(0+$!, EINVAL, "check \$! is updated even when we warn"); - $TODO = $todo; my $warning = "Only byte strings can be mapped into in-memory filehandles\n"; is_deeply(\@warnings, [ $warning ], "should have warned"); @warnings = (); $content = "12\xA1"; utf8::upgrade($content); - undef $TODO; ok(open(my $fh, "<", \$content), "open upgraded scalar"); - $TODO = $todo; my $tmp; is(read($fh, $tmp, 4), 3, "read should get the downgraded bytes"); is($tmp, "12\xA1", "check we got the expected bytes"); close $fh; - undef $TODO; is_deeply(\@warnings, [], "should be no more warnings"); } -- 1.7.10.4 ```
p5pRT commented 11 years ago

From @tonycoz

0003-document-the-new-warning.patch ```diff From caf1bb365bb0fe2fc55b4693ce4a50fbb3706a17 Mon Sep 17 00:00:00 2001 From: Tony Cook Date: Mon, 31 Dec 2012 12:03:53 +1100 Subject: [PATCH 3/8] document the new warning --- pod/perldiag.pod | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/pod/perldiag.pod b/pod/perldiag.pod index 600436f..1bdcd69 100644 --- a/pod/perldiag.pod +++ b/pod/perldiag.pod @@ -3396,6 +3396,12 @@ call, or call a constructor from the FileHandle package. (W unopened) You tried to invoke a file test operator on a filehandle that isn't open. Check your control flow. See also L. +=item Only byte strings can be mapped into in-memory filehandles + +(W utf8) You tried to open a reference to a scalar for read or append +where the scalar contained character codes over 0xFF. In-memory files +model on-disk files and can only contain bytes. + =item oops: oopsAV (S internal) An internal warning that the grammar is screwed up. -- 1.7.10.4 ```
p5pRT commented 11 years ago

From @tonycoz

0004-bump-PerlIO-scalar-s-version.patch ```diff From 83bf0f3062cb460fd1a76443fa35b5c0ae3e8026 Mon Sep 17 00:00:00 2001 From: Tony Cook Date: Mon, 31 Dec 2012 11:48:24 +1100 Subject: [PATCH 4/8] bump PerlIO::scalar's version --- ext/PerlIO-scalar/scalar.pm | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ext/PerlIO-scalar/scalar.pm b/ext/PerlIO-scalar/scalar.pm index 813f5e6..64ecc22 100644 --- a/ext/PerlIO-scalar/scalar.pm +++ b/ext/PerlIO-scalar/scalar.pm @@ -1,5 +1,5 @@ package PerlIO::scalar; -our $VERSION = '0.15'; +our $VERSION = '0.16'; require XSLoader; XSLoader::load(); 1; -- 1.7.10.4 ```
p5pRT commented 11 years ago

From @tonycoz

0005-TODO-tests-for-reads-from-a-scalar-changed-to-upgrad.patch ```diff From 5dd403bd2ed4c451ebd5239bb4da6b735841b54b Mon Sep 17 00:00:00 2001 From: Tony Cook Date: Mon, 31 Dec 2012 13:33:02 +1100 Subject: [PATCH 5/8] TODO tests for reads from a scalar changed to upgraded after open --- ext/PerlIO-scalar/t/scalar.t | 35 +++++++++++++++++++++++++++++++++-- 1 file changed, 33 insertions(+), 2 deletions(-) diff --git a/ext/PerlIO-scalar/t/scalar.t b/ext/PerlIO-scalar/t/scalar.t index 888537f..93210a3 100644 --- a/ext/PerlIO-scalar/t/scalar.t +++ b/ext/PerlIO-scalar/t/scalar.t @@ -16,7 +16,7 @@ use Fcntl qw(SEEK_SET SEEK_CUR SEEK_END); # Not 0, 1, 2 everywhere. $| = 1; -use Test::More tests => 92; +use Test::More tests => 101; my $fh; my $var = "aaa\n"; @@ -386,6 +386,7 @@ SKIP: { } # [perl #109828] PerlIO::scalar does not handle UTF-8 +my $byte_warning = "Only byte strings can be mapped into in-memory filehandles\n"; { use Errno qw(EINVAL); my @warnings; @@ -400,7 +401,8 @@ SKIP: { ok(!open(my $fh, "<", \$content), "non byte open should fail (and warn)"); is(0+$!, EINVAL, "check \$! is updated even when we warn"); my $warning = "Only byte strings can be mapped into in-memory filehandles\n"; - is_deeply(\@warnings, [ $warning ], "should have warned"); + is_deeply(\@warnings, [ $byte_warning ], "should have warned"); + @warnings = (); $content = "12\xA1"; utf8::upgrade($content); @@ -411,3 +413,32 @@ SKIP: { close $fh; is_deeply(\@warnings, [], "should be no more warnings"); } +{ # changes after open + my $content = "abc"; + ok(open(my $fh, "<", \$content), "open a scalar"); + my $tmp; + is(read($fh, $tmp, 1), 1, "basic read"); + seek($fh, 1, SEEK_SET); + $content = "\xA1\xA2\xA3"; + utf8::upgrade($content); + is(read($fh, $tmp, 1), 1, "read from post-open upgraded scalar"); + local $TODO = "read doesn't handle a post open non-byte scalar"; + is($tmp, "\xA2", "check we read the correct value"); + seek($fh, 1, SEEK_SET); + $content = "\x{101}\x{102}\x{103}"; + + my @warnings; + local $SIG{__WARN__} = sub { push @warnings, "@_" }; + + $! = 0; + is(read($fh, $tmp, 1), undef, "read from scalar with >0xff chars"); + is(0+$!, EINVAL, "check errno set correctly"); + { + local $TODO; + is_deeply(\@warnings, [], "should be no warning (yet)"); + } + use warnings "utf8"; + seek($fh, 1, SEEK_SET); + is(read($fh, $tmp, 1), undef, "read from scalar with >0xff chars"); + is_deeply(\@warnings, [ $byte_warning ], "check warning"); +} -- 1.7.10.4 ```
p5pRT commented 11 years ago

From @tonycoz

0006-handle-reading-from-a-SVf_UTF8-scalar.patch ```diff From 08daffb492aad53c5a8706d11c9efd841966c514 Mon Sep 17 00:00:00 2001 From: Tony Cook Date: Mon, 31 Dec 2012 14:08:23 +1100 Subject: [PATCH 6/8] handle reading from a SVf_UTF8 scalar if the scalar can be downgradable, it is downgraded and the read succeeds. Otherwise the read fails, producing a warning if enabled and setting errno/$! to EINVAL. --- ext/PerlIO-scalar/scalar.xs | 11 +++++++++++ ext/PerlIO-scalar/t/scalar.t | 8 ++------ 2 files changed, 13 insertions(+), 6 deletions(-) diff --git a/ext/PerlIO-scalar/scalar.xs b/ext/PerlIO-scalar/scalar.xs index f54a655..3d70670 100644 --- a/ext/PerlIO-scalar/scalar.xs +++ b/ext/PerlIO-scalar/scalar.xs @@ -151,6 +151,17 @@ PerlIOScalar_read(pTHX_ PerlIO *f, void *vbuf, Size_t count) STRLEN len; I32 got; p = SvPV(sv, len); + if (SvUTF8(sv)) { + if (sv_utf8_downgrade(sv, TRUE)) { + p = SvPV_nomg(sv, len); + } + else { + if (ckWARN(WARN_UTF8)) + Perl_warner(aTHX_ packWARN(WARN_UTF8), "Only byte strings can be mapped into in-memory filehandles\n"); + SETERRNO(EINVAL, SS_IVCHAN); + return -1; + } + } got = len - (STRLEN)(s->posn); if (got <= 0) return 0; diff --git a/ext/PerlIO-scalar/t/scalar.t b/ext/PerlIO-scalar/t/scalar.t index 93210a3..828a47f 100644 --- a/ext/PerlIO-scalar/t/scalar.t +++ b/ext/PerlIO-scalar/t/scalar.t @@ -415,14 +415,13 @@ my $byte_warning = "Only byte strings can be mapped into in-memory filehandles\n } { # changes after open my $content = "abc"; - ok(open(my $fh, "<", \$content), "open a scalar"); + ok(open(my $fh, "+<", \$content), "open a scalar"); my $tmp; is(read($fh, $tmp, 1), 1, "basic read"); seek($fh, 1, SEEK_SET); $content = "\xA1\xA2\xA3"; utf8::upgrade($content); is(read($fh, $tmp, 1), 1, "read from post-open upgraded scalar"); - local $TODO = "read doesn't handle a post open non-byte scalar"; is($tmp, "\xA2", "check we read the correct value"); seek($fh, 1, SEEK_SET); $content = "\x{101}\x{102}\x{103}"; @@ -433,10 +432,7 @@ my $byte_warning = "Only byte strings can be mapped into in-memory filehandles\n $! = 0; is(read($fh, $tmp, 1), undef, "read from scalar with >0xff chars"); is(0+$!, EINVAL, "check errno set correctly"); - { - local $TODO; - is_deeply(\@warnings, [], "should be no warning (yet)"); - } + is_deeply(\@warnings, [], "should be no warning (yet)"); use warnings "utf8"; seek($fh, 1, SEEK_SET); is(read($fh, $tmp, 1), undef, "read from scalar with >0xff chars"); -- 1.7.10.4 ```
p5pRT commented 11 years ago

From @tonycoz

0007-TODO-tests-for-writing-to-a-SVf_UTF8-scalar.patch ```diff From 114cba8e9a4c3c818a7718bada2b6be2f0202597 Mon Sep 17 00:00:00 2001 From: Tony Cook Date: Mon, 31 Dec 2012 14:52:24 +1100 Subject: [PATCH 7/8] TODO tests for writing to a SVf_UTF8 scalar --- ext/PerlIO-scalar/t/scalar.t | 26 +++++++++++++++++++++++++- 1 file changed, 25 insertions(+), 1 deletion(-) diff --git a/ext/PerlIO-scalar/t/scalar.t b/ext/PerlIO-scalar/t/scalar.t index 828a47f..e14bc60 100644 --- a/ext/PerlIO-scalar/t/scalar.t +++ b/ext/PerlIO-scalar/t/scalar.t @@ -16,7 +16,7 @@ use Fcntl qw(SEEK_SET SEEK_CUR SEEK_END); # Not 0, 1, 2 everywhere. $| = 1; -use Test::More tests => 101; +use Test::More tests => 108; my $fh; my $var = "aaa\n"; @@ -437,4 +437,28 @@ my $byte_warning = "Only byte strings can be mapped into in-memory filehandles\n seek($fh, 1, SEEK_SET); is(read($fh, $tmp, 1), undef, "read from scalar with >0xff chars"); is_deeply(\@warnings, [ $byte_warning ], "check warning"); + + select $fh; # make sure print fails rather tha buffers + $| = 1; + select STDERR; + no warnings "utf8"; + @warnings = (); + $content = "\xA1\xA2\xA3"; + utf8::upgrade($content); + seek($fh, 1, SEEK_SET); + ok((print $fh "A"), "print to an upgraded byte string"); + seek($fh, 1, SEEK_SET); + local $TODO = "write to utf8 flagged strings is broken"; + is($content, "\xA1A\xA3", "check result"); + + $content = "\x{101}\x{102}\x{103}"; + $! = 0; + ok(!(print $fh "B"), "write to an non-downgradable SV"); + is(0+$!, EINVAL, "check errno set"); + + is_deeply(\@warnings, [], "should be no warning"); + + use warnings "utf8"; + ok(!(print $fh "B"), "write to an non-downgradable SV (and warn)"); + is_deeply(\@warnings, [ $byte_warning ], "check warning"); } -- 1.7.10.4 ```
p5pRT commented 11 years ago

From @tonycoz

0008-warn-and-fail-on-writes-to-SVf_UTF8-SVs.patch ```diff From 87f7629cc3cc0f6a3f1796984236f28ded092da3 Mon Sep 17 00:00:00 2001 From: Tony Cook Date: Mon, 31 Dec 2012 15:01:06 +1100 Subject: [PATCH 8/8] warn and fail on writes to SVf_UTF8 SVs --- ext/PerlIO-scalar/scalar.xs | 6 ++++++ ext/PerlIO-scalar/t/scalar.t | 1 - 2 files changed, 6 insertions(+), 1 deletion(-) diff --git a/ext/PerlIO-scalar/scalar.xs b/ext/PerlIO-scalar/scalar.xs index 3d70670..1f24864 100644 --- a/ext/PerlIO-scalar/scalar.xs +++ b/ext/PerlIO-scalar/scalar.xs @@ -184,6 +184,12 @@ PerlIOScalar_write(pTHX_ PerlIO * f, const void *vbuf, Size_t count) SvGETMAGIC(sv); if (!SvROK(sv)) sv_force_normal(sv); if (SvOK(sv)) SvPV_force_nomg_nolen(sv); + if (SvUTF8(sv) && !sv_utf8_downgrade(sv, TRUE)) { + if (ckWARN(WARN_UTF8)) + Perl_warner(aTHX_ packWARN(WARN_UTF8), "Only byte strings can be mapped into in-memory filehandles\n"); + SETERRNO(EINVAL, SS_IVCHAN); + return 0; + } if ((PerlIOBase(f)->flags) & PERLIO_F_APPEND) { dst = SvGROW(sv, SvCUR(sv) + count + 1); offset = SvCUR(sv); diff --git a/ext/PerlIO-scalar/t/scalar.t b/ext/PerlIO-scalar/t/scalar.t index e14bc60..bc4c497 100644 --- a/ext/PerlIO-scalar/t/scalar.t +++ b/ext/PerlIO-scalar/t/scalar.t @@ -448,7 +448,6 @@ my $byte_warning = "Only byte strings can be mapped into in-memory filehandles\n seek($fh, 1, SEEK_SET); ok((print $fh "A"), "print to an upgraded byte string"); seek($fh, 1, SEEK_SET); - local $TODO = "write to utf8 flagged strings is broken"; is($content, "\xA1A\xA3", "check result"); $content = "\x{101}\x{102}\x{103}"; -- 1.7.10.4 ```
p5pRT commented 11 years ago

From @tonycoz

On Mon\, Dec 31\, 2012 at 07​:00​:45PM +1100\, Tony Cook wrote​:

1) reading - to follow the "SVf_UTF8 is only representation" principle\, we'd need to *download* where possible\, so a \xA1 (for

Urr\, downgrade.

As written I think the warning message could be improved and the documentation of the warning could be improved.

Suggestions welcome.

Tony

p5pRT commented 11 years ago

From @khwilliamson

On 12/31/2012 01​:00 AM\, Tony Cook wrote​:

On Fri\, Dec 28\, 2012 at 11​:16​:36PM +0100\, Leon Timmermans wrote​:

On Fri\, Dec 28\, 2012 at 11​:06 PM\, Tony Cook via RT \perlbug\-followup@&#8203;perl\.org wrote​:

It should fail to open. If you open a UTF8 flagged string for append and write non-UTF8 bytes you will produce an invalidly encoded SvUTF8 string.

Your patch as written ignores the principle that the SvUTF8() flag only controls the internal encoding\, not other behaviour. If the SV contains only code point 0xFF or lower we should downgrade it and work with that rather than failing (or producing a warning).

I didn't see enough consensus to change it that much\, but I would be in favor.

This should also be done for _read() and _write()\, since the SV can be modified between I/O operations.

There's an unrelated problem that _pushed() checks flags on both arg and SvRV(arg) without calling SvGETMAGIC().

It should just stop peeking and poking into the SV altogether\, and use the proper APIs (sv_insert and friends). For that matter\, I sometimes feel like it should be rewritten from scratch to actually make sense. Pretty much all of it is problematic.

I've attached my suggested changes (in several parts)\, also available on perl5.git.perl.org/perl.git as tonyc/perlio-scalar-sanity.

Reasons for failing instead of warning​:

1) reading - to follow the "SVf_UTF8 is only representation" principle\, we'd need to download where possible\, so a \xA1 (for example) in the stream is always treated as that byte\, but this means we have an inconsistency when the scalar cannot be downgraded - the first bytes of the character sequences "\xA1\x40" and "\xA1\x{101}" would be different.

2) writing - if the SV is flagged UTF8\, and the user of the handle doesn't write correct UTF8 data at the correct offsets\, the SV will no longer be properly formed utf-8\, which I believe we're trying to maintain. One of my tests produced a warning about invalid UTF-8 during before the fix was applied.

It's possible could be avoided if we always treat the written bytes as code points and upgrade them when writing to a UTF8 string\, but then we run into a consitency issue vs reading - what happens when a read on a UTF8 string reaches a code point > 0xFF?

As written I think the warning message could be improved and the documentation of the warning could be improved.

Tony

Attached are some suggestions for wording changes. I've never liked our distinction between bytes and character semantics. It makes no sense to me. Everything is ultimately a byte.

p5pRT commented 11 years ago

From @khwilliamson

0002-Suggestion-for-rewording-message-and-pod-for-PerlIO-.patch ```diff From 84e69642b537283043e6ae75a78c58b1078b4685 Mon Sep 17 00:00:00 2001 From: Karl Williamson Date: Sun, 13 Jan 2013 20:33:18 -0700 Subject: [PATCH 2/2] Suggestion for rewording message and pod for PerlIO::scalar --- ext/PerlIO-scalar/scalar.xs | 6 +++--- pod/perldiag.pod | 6 +++--- 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/ext/PerlIO-scalar/scalar.xs b/ext/PerlIO-scalar/scalar.xs index 1f24864..f368867 100644 --- a/ext/PerlIO-scalar/scalar.xs +++ b/ext/PerlIO-scalar/scalar.xs @@ -54,7 +54,7 @@ PerlIOScalar_pushed(pTHX_ PerlIO * f, const char *mode, SV * arg, } if (SvUTF8(s->var) && !sv_utf8_downgrade(s->var, TRUE)) { if (ckWARN(WARN_UTF8)) - Perl_warner(aTHX_ packWARN(WARN_UTF8), "Only byte strings can be mapped into in-memory filehandles\n"); + Perl_warner(aTHX_ packWARN(WARN_UTF8), "Only strings not requiring UTF-8 can be mapped into in-memory filehandles\n"); SETERRNO(EINVAL, SS_IVCHAN); SvREFCNT_dec(s->var); s->var = Nullsv; @@ -157,7 +157,7 @@ PerlIOScalar_read(pTHX_ PerlIO *f, void *vbuf, Size_t count) } else { if (ckWARN(WARN_UTF8)) - Perl_warner(aTHX_ packWARN(WARN_UTF8), "Only byte strings can be mapped into in-memory filehandles\n"); + Perl_warner(aTHX_ packWARN(WARN_UTF8), "Only strings not requiring UTF-8 can be mapped into in-memory filehandles\n"); SETERRNO(EINVAL, SS_IVCHAN); return -1; } @@ -186,7 +186,7 @@ PerlIOScalar_write(pTHX_ PerlIO * f, const void *vbuf, Size_t count) if (SvOK(sv)) SvPV_force_nomg_nolen(sv); if (SvUTF8(sv) && !sv_utf8_downgrade(sv, TRUE)) { if (ckWARN(WARN_UTF8)) - Perl_warner(aTHX_ packWARN(WARN_UTF8), "Only byte strings can be mapped into in-memory filehandles\n"); + Perl_warner(aTHX_ packWARN(WARN_UTF8), "Only strings not requiring UTF-8 can be mapped into in-memory filehandles\n"); SETERRNO(EINVAL, SS_IVCHAN); return 0; } diff --git a/pod/perldiag.pod b/pod/perldiag.pod index c90079b..41b98a8 100644 --- a/pod/perldiag.pod +++ b/pod/perldiag.pod @@ -3404,11 +3404,11 @@ call, or call a constructor from the FileHandle package. (W unopened) You tried to invoke a file test operator on a filehandle that isn't open. Check your control flow. See also L. -=item Only byte strings can be mapped into in-memory filehandles +=item Only strings not requiring UTF-8 can be mapped into in-memory filehandles (W utf8) You tried to open a reference to a scalar for read or append -where the scalar contained character codes over 0xFF. In-memory files -model on-disk files and can only contain bytes. +where the scalar contained code points above 0xFF. In-memory files +cannot handle these. =item oops: oopsAV -- 1.7.7.1 ```
p5pRT commented 11 years ago

From @rjbs

I am in favor of your proposed changes Tony\, thanks.

p5pRT commented 11 years ago

From @khwilliamson

Commit 02c3c86bb8fe791df9608437f0844f9a8017e3b6 changed the behavior so one could not successfully open a scalar with code points above 0xFF.

But this test case shows an issue with this​:

use utf8; my $string = qq[aÅb]; my $fh = IO​::File->new(); $fh->open(\$string\, '\<​:encoding(UTF-8)');

-- Karl Williamson

p5pRT commented 11 years ago

From @khwilliamson

On Wed Jan 23 19​:08​:08 2013\, rjbs wrote​:

I am in favor of your proposed changes Tony\, thanks.

-- Karl Williamson

p5pRT commented 11 years ago

From @khwilliamson

I don't know what I pressed to cause it to send while typing the message\, but send it did. So hopefully this will work better.

On Wed Jan 30 15​:20​:46 2013\, khw wrote​:

Commit 02c3c86bb8fe791df9608437f0844f9a8017e3b6 changed the behavior so one could not successfully open a scalar with code points above 0xFF.

But this test case shows an issue with this​:

use utf8; my $string = qq[a�b]; my $fh = IO​::File->new(); $fh->open(\$string\, '\<​:encoding(UTF-8)');

The problem is that the character in the string (which is showing up incorrectly encoded here\, but is a U+00C5) is in Latin 1. Since the string is encodable in Latin1\, the open succeeds\, while silently downgrading from UTF-8 to Latin1\, but the :encoding(UTF-8) doesn't play well with that\, with the result that this silently breaks. -- Karl Williamson

p5pRT commented 11 years ago

From @ap

* Karl Williamson via RT \perlbug\-followup@&#8203;perl\.org [2013-01-31 00​:30]​:

On Wed Jan 30 15​:20​:46 2013\, khw wrote​:

Commit 02c3c86bb8fe791df9608437f0844f9a8017e3b6 changed the behavior so one could not successfully open a scalar with code points above 0xFF.

But this test case shows an issue with this​:

use utf8; my $string = qq[aÅb]; my $fh = IO​::File->new(); $fh->open(\$string\, '\<​:encoding(UTF-8)');

The problem is that the character in the string (which is showing up incorrectly encoded here\, but is a U+00C5) is in Latin 1. Since the string is encodable in Latin1\, the open succeeds\, while silently downgrading from UTF-8 to Latin1\, but the :encoding(UTF-8) doesn't play well with that\, with the result that this silently breaks.

Well the code as written is broken. Whether the utf8 pragma ultimately leaves the string downgraded or upgraded\, the code is wrong either way. Whichever is the case\, the code needs an `encode("UTF8"\, ...)` in there somewhere before the `open` in order to be correct. So the fact that this breaks means things are working as designed.

That it breaks silently is not so great.

But how could that be detected?

You could argue for changing the parser to leave literals encoded and with their UTF8 flag on. But that would break other code – granted\, only code that is already wrong. But the dictate of backcompat demands to try not to needlessly expose their brokenness if so far it wasn’t.

The only way to satisfy both requirements would be if there was a way to mark strings as character strings\, independently of whether their UTF8 flag is turned on. Then the utf8 pragma could turn that flag on for all literals\, even if it leaves them with UTF8 flags off\, and `open` could check for that flag instead of the UTF8 flag.

The current scans for codepoints > 0xFF are a proximate facsimile of such a flag – presence of such codepoints is sufficient evidence for the string being a character string. But in the converse case\, absence of evidence not being evidence of absence applies.

Regards\, -- Aristotle Pagaltzis // \<http​://plasmasturm.org/>

p5pRT commented 11 years ago

From @Leont

On Thu\, Jan 31\, 2013 at 8​:32 AM\, Aristotle Pagaltzis \pagaltzis@&#8203;gmx\.de wrote​:

Well the code as written is broken. Whether the utf8 pragma ultimately leaves the string downgraded or upgraded\, the code is wrong either way. Whichever is the case\, the code needs an `encode("UTF8"\, ...)` in there somewhere before the `open` in order to be correct. So the fact that this breaks means things are working as designed.

That it breaks silently is not so great.

But how could that be detected?

We can return an error instead of trying to downgrade. It would still break this code\, but at least it would do so loudly (and at least it would be sane).

Leon

p5pRT commented 11 years ago

From @ap

* Leon Timmermans \fawaka@&#8203;gmail\.com [2013-01-31 14​:30]​:

On Thu\, Jan 31\, 2013 at 8​:32 AM\, Aristotle Pagaltzis \pagaltzis@&#8203;gmx\.de wrote​:

Well the code as written is broken. Whether the utf8 pragma ultimately leaves the string downgraded or upgraded\, the code is wrong either way. Whichever is the case\, the code needs an `encode("UTF8"\, ...)` in there somewhere before the `open` in order to be correct. So the fact that this breaks means things are working as designed.

That it breaks silently is not so great.

But how could that be detected?

We can return an error instead of trying to downgrade. It would still break this code\, but at least it would do so loudly (and at least it would be sane).

1. You can’t. The string is downgraded far earlier\, by the parser.

  $ perl -MDevel​::Peek -e 'use utf8; Dump qq[aÅb]'   SV = PV(0x7fad5b801090) at 0x7fad5b8267a8   REFCNT = 1   FLAGS = (POK\,READONLY\,pPOK\,UTF8)   PV = 0x10f10f850 "a\303\205b"\0 [UTF8 "a\x{c5}b"]   CUR = 4   LEN = 16

2. It doesn’t matter if the byte value C5 is spelled C5 in the buffer and the   UTF8 flag is off\, or C3 85 and UTF8 is on – both mean the same thing. If   it is downgradable\, then it very well should be downgraded and accepted   silently. (I just realised some of my previous mail was a red herring\, due   to this point.)

  If you’re opening the string as an octet stream\, then you need a string that   contains an octet stream. Not characters. Regardless of the UTF8 flag value.

-- Aristotle Pagaltzis // \<http​://plasmasturm.org/>

p5pRT commented 11 years ago

From @Leont

On Fri\, Feb 1\, 2013 at 8​:16 AM\, Aristotle Pagaltzis \pagaltzis@&#8203;gmx\.de wrote​:

1. You can’t. The string is downgraded far earlier\, by the parser.

$ perl -MDevel​::Peek -e 'use utf8; Dump qq[aÅb]' SV = PV(0x7fad5b801090) at 0x7fad5b8267a8 REFCNT = 1 FLAGS = (POK\,READONLY\,pPOK\,UTF8) PV = 0x10f10f850 "a\303\205b"\0 [UTF8 "a\x{c5}b"] CUR = 4 LEN = 16

That's not downgraded at all\, it has the utf8 flag.

2. It doesn’t matter if the byte value C5 is spelled C5 in the buffer and the UTF8 flag is off\, or C3 85 and UTF8 is on – both mean the same thing. If it is downgradable\, then it very well should be downgraded and accepted silently. (I just realised some of my previous mail was a red herring\, due to this point.)

That abstraction leaks when it comes into contact with PerlIO. No getting around that. Question is only​: how does it leak

Leon

p5pRT commented 11 years ago

From @nwc10

On Fri\, Feb 01\, 2013 at 03​:54​:37PM +0100\, Leon Timmermans wrote​:

On Fri\, Feb 1\, 2013 at 8​:16 AM\, Aristotle Pagaltzis \pagaltzis@&#8203;gmx\.de wrote​:

2. It doesn't matter if the byte value C5 is spelled C5 in the buffer and the UTF8 flag is off\, or C3 85 and UTF8 is on - both mean the same thing. If it is downgradable\, then it very well should be downgraded and accepted silently. (I just realised some of my previous mail was a red herring\, due to this point.)

That abstraction leaks when it comes into contact with PerlIO. No getting around that. Question is only​: how does it leak

I feel that I'm asking a stupid question here\, but why/how does it leak? Is it leaking for the same reason as eval "leaks"? There\, source code from disk is in bytes\, which needs an encoding layered atop it to map to characters (even if it's a 1​:1 mapping). So "obviously"\, that's what the parser expects. Stuff in the range 0-255\, which might be a variable-width encoding. But eval takes strings\, and Perl-code has generated strings of *characters* to feed to the parser. Stuff in the range 0-0x1FFFF (ish)\, abstract representation (as far as Perl-space is concerned)

So\, here\, some code wants to think in terms of using file-like operations on a sequence of octets held in a scalar (which were "obviously" octets because that was what it was dealing with when it assigned to that scalar.)

Whereas other code wants to think in terms of using file-like operations on a sequence of characters. (which were "obviously" characters because that was what it was dealing with when it assigned to that scalar.)

And it's the same syntax to open either.

Is that the leakage you mean? That by the time the code comes to open the scalar\, it simply isn't clear whether the scalar is supposed to be holding sequences of octets\, or sequences of characters\, and so the opening code *can't* get the semantics of the open correct.

Or have I misunderstood?

Nicholas Clark