Closed p5pRT closed 11 years ago
If a UTF-8 output layer is specified the resulting scalar does not have the UTF-8 flag on.
I think this makes sense for output\, although there may be other ramifications.
Here's a todo test:
On Sat\, Feb 4\, 2012 at 6:10 PM\, David Leadbeater \perlbug\-followup@​perl\.org wrote:
If a UTF-8 output layer is specified the resulting scalar does not have the UTF-8 flag on.
I think this makes sense for output\, although there may be other ramifications.
Here's a todo test:
diff --git a/ext/PerlIO-scalar/t/scalar.t b/ext/PerlIO-scalar/t/scalar.t index a02107b..59b65ad 100644 --- a/ext/PerlIO-scalar/t/scalar.t +++ b/ext/PerlIO-scalar/t/scalar.t @@ -16\,7 +16\,7 @@ use Fcntl qw(SEEK_SET SEEK_CUR SEEK_END); # Not 0\, 1\, 2 everywhere.
$| = 1;
-use Test::More tests => 79; +use Test::More tests => 80;
my $fh; my $var = "aaa\n"; @@ -360\,3 +360\,11 @@ SKIP: { ok has_trailing_nul $memfile\, 'write appends null when growing string after seek past end'; } + +# [perl #xxxx] +{ + local $TODO = "UTF-8 support"; + my $string = "\x{ffe}"; + open my $fh\, "> :encoding(UTF-8)"\, \(my $out); + ok $string eq $out; +}
PerlIO does bytes\, always. It's utf8 support is literally a one bit flag that promises the bytes will be validly encoded utf8. There's no easy way for lower layers to know what the upper layers do with regard utf8. Nor am I sure that really should tinkle down.
The other direction would seem to be more important. When opening a utf8 scalar\, it should automatically be a utf8 handle. Anything else is plain buggy and potentially dangerous.
Leon
The RT System itself - Status changed from 'new' to 'open'
David Leadbeater (via RT) \perlbug\-followup@​perl\.org wrote on Sat\, 04 Feb 2012 09:10:41 PST:
+{ + local $TODO = "UTF-8 support"; + my $string = "\x{ffe}";
Why don't you use an assigned Unicode code point there\, please?
+ open my $fh\, "> :encoding(UTF-8)"\, \(my $out);
Why are you involving the Encode module? Why isn't that simply:
open(my $fh\, "> :utf8"\, \my $out) || die $!:
+ ok $string eq $out; +}
I absolutely gave up on this. It was too unreliable. Even if you are careful about decoding your string\, now and then (about 1 in 10) it gets double-encoded no matter what you do. It is not even deterministic in any fashion I can see to make work.
--tom
On Sat\, Feb 4\, 2012 at 12:10 PM\, David Leadbeater \<perlbug-followup@perl.org
wrote:
+# [perl #xxxx] +{ + local $TODO = "UTF-8 support"; + my $string = "\x{ffe}"; + open my $fh\, "> :encoding(UTF-8)"\, \(my $out); + ok $string eq $out; +}
Files can only contain bytes. This makes no sense to me.
- Eric
On Sat\, Feb 4\, 2012 at 5:49 PM\, Eric Brine \ikegami@​adaelis\.com wrote:
On Sat\, Feb 4\, 2012 at 12:10 PM\, David Leadbeater \< perlbug-followup@perl.org> wrote:
+# [perl #xxxx] +{ + local $TODO = "UTF-8 support"; + my $string = "\x{ffe}"; + open my $fh\, "> :encoding(UTF-8)"\, \(my $out); + ok $string eq $out; +}
Files can only contain bytes. This makes no sense to me.
... especially since you specially ask for encode whatever you print. encode "UTF-8" cannot possibly produce something that contains 0xFFE.
And your patch is buggy: You forgot to actually print to $fh.
On Sat\, Feb 4\, 2012 at 12:10 PM\, David Leadbeater \perlbug\-followup@​perl\.org wrote:
If a UTF-8 output layer is specified the resulting scalar does not have the UTF-8 flag on.
I think that one should expect PerlIO::scalar to provide a black box -- it's an in-memory substitution for bytes on disk with no associated encoding\, just like a file on disk has no associated encoding.
If the referenced string already has the utf8 flag set\, I think it's sufficient to warn rather than try to guess the correct behavior.
David
On Sat\, 4 Feb 2012 18:55:27 +0100\, Leon Timmermans \fawaka@​gmail\.com wrote:
On Sat\, Feb 4\, 2012 at 6:10 PM\, David Leadbeater \perlbug\-followup@​perl\.org wrote:
If a UTF-8 output layer is specified the resulting scalar does not have the UTF-8 flag on.
I think this makes sense for output\, although there may be other ramifications.
Here's a todo test:
diff --git a/ext/PerlIO-scalar/t/scalar.t b/ext/PerlIO-scalar/t/scalar.t index a02107b..59b65ad 100644 --- a/ext/PerlIO-scalar/t/scalar.t +++ b/ext/PerlIO-scalar/t/scalar.t @@ -16\,7 +16\,7 @@ use Fcntl qw(SEEK_SET SEEK_CUR SEEK_END); # Not 0\, 1\, 2 everywhere.
$| = 1;
-use Test::More tests => 79; +use Test::More tests => 80;
my $fh; my $var = "aaa\n"; @@ -360\,3 +360\,11 @@ SKIP: { ok has_trailing_nul $memfile\, 'write appends null when growing string after seek past end'; } + +# [perl #xxxx] +{ + local $TODO = "UTF-8 support"; + my $string = "\x{ffe}"; + open my $fh\, "> :encoding(UTF-8)"\, \(my $out); + ok $string eq $out; +}
PerlIO does bytes\, always. It's utf8 support is literally a one bit flag that promises the bytes will be validly encoded utf8. There's no easy way for lower layers to know what the upper layers do with regard utf8. Nor am I sure that really should tinkle down.
The other direction would seem to be more important. When opening a utf8 scalar\, it should automatically be a utf8 handle. Anything else is plain buggy and potentially dangerous.
including pragma's?
use open OUT => "encoding(utf16)"; open my $fh\, ">"\, \my $x; print { $fh } "The \x{20ac} is \x{a71c} again}\n"; close $fh;
-- H.Merijn Brand http://tux.nl Perl Monger http://amsterdam.pm.org/ using perl5.00307 .. 5.14 porting perl5 on HP-UX\, AIX\, and openSUSE http://mirrors.develooper.com/hpux/ http://www.test-smoke.org/ http://qa.perl.org http://www.goldmark.org/jeff/stupid-disclaimers/
On Sat\, Feb 04\, 2012 at 08:12:30PM -0500\, David Golden wrote:
On Sat\, Feb 4\, 2012 at 12:10 PM\, David Leadbeater \perlbug\-followup@​perl\.org wrote:
If a UTF-8 output layer is specified the resulting scalar does not have the UTF-8 flag on.
I think that one should expect PerlIO::scalar to provide a black box -- it's an in-memory substitution for bytes on disk with no associated encoding\, just like a file on disk has no associated encoding.
If the referenced string already has the utf8 flag set\, I think it's sufficient to warn rather than try to guess the correct behavior.
Whoa. I don't think you mean "has the utf8 flag set". That's how 5.6.0 would have exposed it. SvUTF8() shouldn't be visible as a proxy for "characters vs bytes" (yes\, I know there are still holes in that).
I *think* it needs to be strictly bytes-only (just like any real file handle) and refuse to open an existing string that doesn't meet that constraint. (With the inevitable ambiguity that if you only shove characters in the range 0-255 into your string\, you're not going to realise that your code is buggy.)
Nicholas Clark
On Sat\, Feb 04\, 2012 at 06:55:27PM +0100\, Leon Timmermans wrote:
On Sat\, Feb 4\, 2012 at 6:10 PM\, David Leadbeater \perlbug\-followup@​perl\.org wrote:
If a UTF-8 output layer is specified the resulting scalar does not have the UTF-8 flag on.
I think this makes sense for output\, although there may be other ramifications.
PerlIO does bytes\, always. It's utf8 support is literally a one bit flag that promises the bytes will be validly encoded utf8. There's no easy way for lower layers to know what the upper layers do with regard utf8. Nor am I sure that really should tinkle down.
The other direction would seem to be more important. When opening a utf8 scalar\, it should automatically be a utf8 handle. Anything else is plain buggy and potentially dangerous.
No\, as I replied elsewhere\, I think it should refuse to open any scalar that isn't bytes.
Or\, at least\, the user's code needs to be different to say "I want to open a byte buffer as if it's a file handle" and "I'm expecting characters here". That way allows symmetry between opening for reading and opening for writing.
Having open for reading have some sort of "did they mean characters or bytes? I'll guess for them" ends up with the same mess that unpack is in\, whereby it's a runtime decision based *implicitly* on the *parameters* as to whether it's doing a bytes => characters conversion or a characters => characters mapping. Sure\, it's not as *bad* as unpack\, which can attempt to do both in the same statement\, but trying to have open "DWIM" is in the same ball-park of design misfeature.
Nicholas Clark
On Mon\, Feb 6\, 2012 at 12:05 PM\, Nicholas Clark \nick@​ccl4\.org wrote:
No\, as I replied elsewhere\, I think it should refuse to open any scalar that isn't bytes.
Or\, at least\, the user's code needs to be different to say "I want to open a byte buffer as if it's a file handle" and "I'm expecting characters here". That way allows symmetry between opening for reading and opening for writing.
Having open for reading have some sort of "did they mean characters or bytes? I'll guess for them" ends up with the same mess that unpack is in\, whereby it's a runtime decision based *implicitly* on the *parameters* as to whether it's doing a bytes => characters conversion or a characters => characters mapping. Sure\, it's not as *bad* as unpack\, which can attempt to do both in the same statement\, but trying to have open "DWIM" is in the same ball-park of design misfeature.
Yeah\, that is a good point\, how about making things explicit? E.G «open my $fh\, '+\<:scalar(utf8)'\, \my $scalar». I suspect the current PerlIO/PerlIO::scalar can't easily support that though.
Leon
On Mon\, Feb 6\, 2012 at 9:17 AM\, Leon Timmermans \fawaka@​gmail\.com wrote:
Yeah\, that is a good point\, how about making things explicit? E.G «open my $fh\, '+\<:scalar(utf8)'\, \my $scalar». I suspect the current PerlIO/PerlIO::scalar can't easily support that though.
Isn't that just C\<open my $fh\, "+\<:utf8"\, \my $scalar>?
If you *know* that you have UTF-8 characters in a string\, it's not different than knowing you have UTF-8 characters in a disk file. The *user* needs to be clear what they expect the bytes to be.
Or is the question about what Perl should do about returning bytes from a string that coincidentally happens to be a character string? I.e. how should Perl mimic an on-disk file using its internal string data structure?
Assume that Perl's internal character encoding is a black box. Maybe it's UTF-8\, maybe not (maybe it changes in the future). Whatever. It's an internal implementation detail and nothing external should rely on it.
Then when something wants to use that string as a source of bytes\, should Perl (a) just dump out whatever bytes it uses internally for its implementation? Or (b) should it convert the internal representation to some standard representation? Or (c) should it blow up?
I don't like (a) or (c). (b) is tempting. (Coincidentally\, it's easy\, since the internal encoding is utf8.) My naive inclination is to amend the documentation to clarify that the bytes returned are either raw bytes or utf8 encoded if the string already contains characters. And then I'd *still* leave it up to the user to know what's in the "file" (i.e. string) and set the correct encoding layer on it\, just as if they were using a disk file.
-- David
On Mon\, Feb 06\, 2012 at 10:18:39AM -0500\, David Golden wrote:
Or is the question about what Perl should do about returning bytes from a string that coincidentally happens to be a character string? I.e. how should Perl mimic an on-disk file using its internal string data structure?
That was what I thought the question was.
Assume that Perl's internal character encoding is a black box. Maybe it's UTF-8\, maybe not (maybe it changes in the future). Whatever. It's an internal implementation detail and nothing external should rely on it.
Agree
Then when something wants to use that string as a source of bytes\, should Perl (a) just dump out whatever bytes it uses internally for its implementation? Or (b) should it convert the internal representation to some standard representation? Or (c) should it blow up?
I don't like (a) or (c). (b) is tempting. (Coincidentally\, it's easy\, since the internal encoding is utf8.) My naive inclination is to amend the documentation to clarify that the bytes returned are either raw bytes or utf8 encoded if the string already contains characters. And then I'd *still* leave it up to the user to know
How do you know that the string contains characters?
what's in the "file" (i.e. string) and set the correct encoding layer on it\, just as if they were using a disk file.
Nicholas Clark
On Mon\, Feb 6\, 2012 at 10:24 AM\, Nicholas Clark \nick@​ccl4\.org wrote:
I don't like (a) or (c). (b) is tempting. (Coincidentally\, it's easy\, since the internal encoding is utf8.) My naive inclination is to amend the documentation to clarify that the bytes returned are either raw bytes or utf8 encoded if the string already contains characters. And then I'd *still* leave it up to the user to know
How do you know that the string contains characters?
Which "you" do you mean? The user? How does a user know that *any* file contains characters? Generally\, by knowing what was written there originally or by analyzing the file in some way to guess an encoding\, I'd think. (E.g. read it as bytes and then use Encode::Guess?)
None of that is the interpreter's concern.
-- David
On Mon\, Feb 6\, 2012 at 4:18 PM\, David Golden \xdaveg@​gmail\.com wrote:
Isn't that just C\<open my $fh\, "+\<:utf8"\, \my $scalar>?
Not at all. :utf8 means «assume the bytestream is utf8 encoded». It does not mean «store as characters» (though doing the latter without the former doesn't make sense).
Or is the question about what Perl should do about returning bytes from a string that coincidentally happens to be a character string? I.e. how should Perl mimic an on-disk file using its internal string data structure?
Yeah\, pretty much.
Then when something wants to use that string as a source of bytes\, should Perl (a) just dump out whatever bytes it uses internally for its implementation? Or (b) should it convert the internal representation to some standard representation? Or (c) should it blow up?
(a) Is what we're doing right now\, and I think it's just plain wrong\, and possibly dangerous. (b) Maybe\, but for reasons Nicholas explained guesswork may be rather suboptimal (c) Is sane\, unlike (a) and some versions of (b).
Leon
On Mon\, Feb 06\, 2012 at 11:02:36AM -0500\, David Golden wrote:
On Mon\, Feb 6\, 2012 at 10:24 AM\, Nicholas Clark \nick@​ccl4\.org wrote:
I don't like (a) or (c). (b) is tempting. (Coincidentally\, it's easy\, since the internal encoding is utf8.) My naive inclination is to amend the documentation to clarify that the bytes returned are either raw bytes or utf8 encoded if the string already contains characters. And then I'd *still* leave it up to the user to know
How do you know that the string contains characters?
Which "you" do you mean? The user? How does a user know that *any* file contains characters? Generally\, by knowing what was written there originally or by analyzing the file in some way to guess an encoding\, I'd think. (E.g. read it as bytes and then use Encode::Guess?)
None of that is the interpreter's concern.
OK\, which means that the interpreter can't *do* option (b) (or (a) for that matter):
On Mon\, Feb 06\, 2012 at 03:24:04PM +0000\, Nicholas Clark wrote:
On Mon\, Feb 06\, 2012 at 10:18:39AM -0500\, David Golden wrote:
Or is the question about what Perl should do about returning bytes from a string that coincidentally happens to be a character string? I.e. how should Perl mimic an on-disk file using its internal string data structure?
Then when something wants to use that string as a source of bytes\, should Perl (a) just dump out whatever bytes it uses internally for its implementation? Or (b) should it convert the internal representation to some standard representation? Or (c) should it blow up?
because you've just stated that the interpreter can't make a determination as to whether a string contains characters or bytes (for the ambiguous case of a string containing one or more code points in the range 128-255\, but no code points outside the range 0-255)
Nicholas Clark
On Mon\, Feb 6\, 2012 at 10:18 AM\, David Golden \xdaveg@​gmail\.com wrote:
On Mon\, Feb 6\, 2012 at 9:17 AM\, Leon Timmermans \fawaka@​gmail\.com wrote:
Yeah\, that is a good point\, how about making things explicit? E.G «open my $fh\, '+\<:scalar(utf8)'\, \my $scalar». I suspect the current PerlIO/PerlIO::scalar can't easily support that though.
Isn't that just C\<open my $fh\, "+\<:utf8"\, \my $scalar>?
No\, that means "decode the input on read". The question is about a buffer that contains decoded data\, so what's needed is a layer or some such that indicates "the underlying data is already decoded". That's his intent for :scalar(utf8).
On Mon\, Feb 6\, 2012 at 11:09 AM\, Nicholas Clark \nick@​ccl4\.org wrote:
because you've just stated that the interpreter can't make a determination as to whether a string contains characters or bytes (for the ambiguous case of a string containing one or more code points in the range 128-255\, but no code points outside the range 0-255)
You're right. I was being imprecise. I think if the string contains no wide characters\, it should be "read" by PerlIO::scalar as bytes. If the string does contain wide characters\, PerlIO::scalar should either fail or should encode them in some "standard" way and return them as bytes in encoded form.
The whole idea is to provide an in-memory abstraction of a *file*\, which means returning a sequence of bytes.
David
On Mon Feb 06 07:19:37 2012\, xdaveg@gmail.com wrote:
Then when something wants to use that string as a source of bytes\, should Perl (a) just dump out whatever bytes it uses internally for its implementation? Or (b) should it convert the internal representation to some standard representation? Or (c) should it blow up?
(a) is what Perl currently does\, as Leon Timmerman said.
By (b) I presume you mean to treat \xff as \xff regardless of how it is stored internally\, which makes sense.
But what happens if I open a reading handle to a scalar containing \x{100}? Here we have a choice between (b) and (c).
An in-memory scalar could be considered a byte stream. Or it could just be considered a string of characters.
The latter does make some sense. If I print \xff to an in-memory file with no layers applied\, I simply get \xff in my scalar. So if I print \x{100}\, it would make sense to get \x{100} in my scalar\, no? But if the scalar is considered byte-sized\, I should get \x{100} utf8-encoded\, accompanied by a wide character warning; and reading a scalar with \x{100} would croak.
That it is currently buggy is not being questioned. But which model should be followed in fixing it is debatable. Would it be reasonable to implement the byte-sized version for now and upgrade it later?
--
Father Chrysostomos
On Sun\, Feb 12\, 2012 at 5:02 PM\, Father Chrysostomos via RT \perlbug\-followup@​perl\.org wrote:
On Mon Feb 06 07:19:37 2012\, xdaveg@gmail.com wrote:
Then when something wants to use that string as a source of bytes\, should Perl (a) just dump out whatever bytes it uses internally for its implementation? Or (b) should it convert the internal representation to some standard representation? Or (c) should it blow up?
(a) is what Perl currently does\, as Leon Timmerman said.
By (b) I presume you mean to treat \xff as \xff regardless of how it is stored internally\, which makes sense.
Sort of. What I meant is that (a) is "whatever we do" and (b) is "a specific encoding". Those are likely to be similar\, but one is vague and mutable and the other specific and fixed. Such a promise would persist under the usual back-compatibility rules even if we changed the internal representation in the future for some reason. It could also mean that we could choose give UTF-8 and not "utf8" (i.e. lax\, internal encoding) -- and would croak if we can't translate from the internal to UTF-8.
For example\, for a string with wide characters used as in in-memory file\, we could promise to translate from the internal encoding to UTF-8 when the handle is read. That would make it resemble a disk file encoded in UTF-8\, requiring the ":encoding(UTF-8)" flag and so on. Thus some function that is passed a handle to read shouldn't know or care whether it's an in memory string or an on-disk file -- though the *programmer* would need to know what encoding they expect to receive given their particular application.
An in-memory scalar could be considered a byte stream. Or it could just be considered a string of characters.
My bias is strongly that it should be a byte-stream\, which is why I'm only considering how we choose to take a string of (wide) characters and make it into a byte stream in some standard way: (a) "whatever" (b) "a promise" and (c) "boom!"
-- David
* Nicholas Clark \nick@​ccl4\.org [2012-02-06 12:00]:
On Sat\, Feb 04\, 2012 at 08:12:30PM -0500\, David Golden wrote:
If the referenced string already has the utf8 flag set\, I think it's sufficient to warn rather than try to guess the correct behavior.
Whoa. I don't think you mean "has the utf8 flag set". That's how 5.6.0 would have exposed it. SvUTF8() shouldn't be visible as a proxy for "characters vs bytes" (yes\, I know there are still holes in that).
This. Thank you. I was despairing as I read the thread\, waiting for someone to interject with it.
As far as the user is concerned\, there is never to be any difference between a string with UTF8 on vs a string with UTF8 off as long as $utf8on eq $utf8off.
I *think* it needs to be strictly bytes-only (just like any real file handle) and refuse to open an existing string that doesn't meet that constraint. (With the inevitable ambiguity that if you only shove characters in the range 0-255 into your string\, you're not going to realise that your code is buggy.)
What it should do on input is treat each character as a byte\, throwing an error if there are any characters > 0xFF in the string\, i.e. the moral equivalent of downgrading the input string and croaking if that fails. That’s it.
Regards\, -- Aristotle Pagaltzis // \<http://plasmasturm.org/>
On Mon\, 6 Feb 2012 10:58:02 +0000\, Nicholas Clark \nick@​ccl4\.org wrote:
On Sat\, Feb 04\, 2012 at 08:12:30PM -0500\, David Golden wrote:
On Sat\, Feb 4\, 2012 at 12:10 PM\, David Leadbeater \perlbug\-followup@​perl\.org wrote:
If a UTF-8 output layer is specified the resulting scalar does not have the UTF-8 flag on.
I think that one should expect PerlIO::scalar to provide a black box -- it's an in-memory substitution for bytes on disk with no associated encoding\, just like a file on disk has no associated encoding.
If the referenced string already has the utf8 flag set\, I think it's sufficient to warn rather than try to guess the correct behavior.
Whoa. I don't think you mean "has the utf8 flag set". That's how 5.6.0 would have exposed it. SvUTF8() shouldn't be visible as a proxy for "characters vs bytes" (yes\, I know there are still holes in that).
I *think* it needs to be strictly bytes-only (just like any real file handle) and refuse to open an existing string that doesn't meet that constraint. (With the inevitable ambiguity that if you only shove characters in the range 0-255 into your string\, you're not going to realise that your code is buggy.)
Nicholas Clark
Personally\, I see no harm in doing a decode on close when opened for writing as utf-8
--8\<--- use v5.12; use warnings;
binmode STDOUT\, ":utf8";
my $data = "";
{ open my $fh\, ">:encoding(utf-8)"\, \$data; print { $fh } "\x{20ac}\n"; close $fh; }
{ open my $fh\, "\<:encoding(utf-8)"\, \$data; print \<$fh>; close $fh; }
print $data; utf8::decode ($data); print $data;
{ open my $fh\, "\<:encoding(utf-8)"\, \$data; print \<$fh>; close $fh; }
{ use open OUT => ":encoding(utf-8)"; open my $fh\, ">"\, \$data; print { $fh } "\x{20ac}\n"; close $fh; }
{ use open IN => ":encoding(utf-8)"; open my $fh\, "\<"\, \$data; print \<$fh>; close $fh; }
print $data; utf8::decode ($data); print $data;
{ use open IN => ":encoding(utf-8)"; open my $fh\, "\<"\, \$data; print \<$fh>; close $fh; } -->8---
$ perl test.pl € ⬠€ € € € € € €
-- H.Merijn Brand http://tux.nl Perl Monger http://amsterdam.pm.org/ using perl5.00307 .. 5.14 porting perl5 on HP-UX\, AIX\, and openSUSE http://mirrors.develooper.com/hpux/ http://www.test-smoke.org/ http://qa.perl.org http://www.goldmark.org/jeff/stupid-disclaimers/
On Sun\, Feb 12\, 2012 at 5:02 PM\, Father Chrysostomos via RT \< perlbug-followup@perl.org> wrote:
That it is currently buggy is not being questioned.
And the following test will detect regressions once its fixed.
=====
use strict; use warnings;
use Test::More tests => 1;
sub read_from_scalar { my ($file\, $perlio) = @_; $perlio //= ''; open my $fh\, "\<$perlio"\, \$file or die $!; local $/; return \<$fh>; }
sub hexify { join ' '\, map sprintf('%02X'\, ord)\, split //\, $_[0] }
{ my $s = chr(0xE9); utf8::upgrade( my $u = $s ); utf8::downgrade( my $d = $s ); is( hexify(read_from_scalar($u))\, hexify(read_from_scalar($d))\, 'Unicode bug in :scalar read' ); }
1;
Is there any word on this issue? I just hit this bug in reverse[1] and while there is ample discussion about it being a problem I see the same behavior under current blead. Is there a chance *at least* for a warning to be added so that it lands in 5.18?
Cheers
[1] http://www.perlmonks.org/?node_id=1010601
On Fri\, Dec 28\, 2012 at 8:34 AM\, Peter Rabbitson \rabbit\+p5p@​rabbit\.us wrote:
Is there any word on this issue? I just hit this bug in reverse[1] and while there is ample discussion about it being a problem I see the same behavior under current blead. Is there a chance *at least* for a warning to be added so that it lands in 5.18?
The process kind of fizzled somewhere. I'm in favor of a warning in 5.18\, see attachment.
Leon
On Fri Dec 28 07:45:25 2012\, LeonT wrote:
On Fri\, Dec 28\, 2012 at 8:34 AM\, Peter Rabbitson \rabbit\+p5p@​rabbit\.us wrote:
Is there any word on this issue? I just hit this bug in reverse[1] and while there is ample discussion about it being a problem I see the same behavior under current blead. Is there a chance *at least* for a warning to be added so that it lands in 5.18?
The process kind of fizzled somewhere. I'm in favor of a warning in 5.18\, see attachment.
It should fail to open. If you open a UTF8 flagged string for append and write non-UTF8 bytes you will produce an invalidly encoded SvUTF8 string.
Your patch as written ignores the principle that the SvUTF8() flag only controls the internal encoding\, not other behaviour. If the SV contains only code point 0xFF or lower we should downgrade it and work with that rather than failing (or producing a warning).
This should also be done for _read() and _write()\, since the SV can be modified between I/O operations.
There's an unrelated problem that _pushed() checks flags on both arg and SvRV(arg) without calling SvGETMAGIC().
I'll take a look at these issues when I get home.
Tony
On Fri\, Dec 28\, 2012 at 11:06 PM\, Tony Cook via RT \perlbug\-followup@​perl\.org wrote:
It should fail to open. If you open a UTF8 flagged string for append and write non-UTF8 bytes you will produce an invalidly encoded SvUTF8 string.
Your patch as written ignores the principle that the SvUTF8() flag only controls the internal encoding\, not other behaviour. If the SV contains only code point 0xFF or lower we should downgrade it and work with that rather than failing (or producing a warning).
I didn't see enough consensus to change it that much\, but I would be in favor.
This should also be done for _read() and _write()\, since the SV can be modified between I/O operations.
There's an unrelated problem that _pushed() checks flags on both arg and SvRV(arg) without calling SvGETMAGIC().
It should just stop peeking and poking into the SV altogether\, and use the proper APIs (sv_insert and friends). For that matter\, I sometimes feel like it should be rewritten from scratch to actually make sense. Pretty much all of it is problematic.
Leon
On Fri\, Dec 28\, 2012 at 11:16:36PM +0100\, Leon Timmermans wrote:
On Fri\, Dec 28\, 2012 at 11:06 PM\, Tony Cook via RT \perlbug\-followup@​perl\.org wrote:
It should fail to open. If you open a UTF8 flagged string for append and write non-UTF8 bytes you will produce an invalidly encoded SvUTF8 string.
Your patch as written ignores the principle that the SvUTF8() flag only controls the internal encoding\, not other behaviour. If the SV contains only code point 0xFF or lower we should downgrade it and work with that rather than failing (or producing a warning).
I didn't see enough consensus to change it that much\, but I would be in favor.
This should also be done for _read() and _write()\, since the SV can be modified between I/O operations.
There's an unrelated problem that _pushed() checks flags on both arg and SvRV(arg) without calling SvGETMAGIC().
It should just stop peeking and poking into the SV altogether\, and use the proper APIs (sv_insert and friends). For that matter\, I sometimes feel like it should be rewritten from scratch to actually make sense. Pretty much all of it is problematic.
This particular bit risks derailing the simpler yet more urgent bugfix. Focuse please ;)
Cheers
On Fri\, Dec 28\, 2012 at 11:16:36PM +0100\, Leon Timmermans wrote:
On Fri\, Dec 28\, 2012 at 11:06 PM\, Tony Cook via RT \perlbug\-followup@​perl\.org wrote:
It should fail to open. If you open a UTF8 flagged string for append and write non-UTF8 bytes you will produce an invalidly encoded SvUTF8 string.
Your patch as written ignores the principle that the SvUTF8() flag only controls the internal encoding\, not other behaviour. If the SV contains only code point 0xFF or lower we should downgrade it and work with that rather than failing (or producing a warning).
I didn't see enough consensus to change it that much\, but I would be in favor.
This should also be done for _read() and _write()\, since the SV can be modified between I/O operations.
There's an unrelated problem that _pushed() checks flags on both arg and SvRV(arg) without calling SvGETMAGIC().
It should just stop peeking and poking into the SV altogether\, and use the proper APIs (sv_insert and friends). For that matter\, I sometimes feel like it should be rewritten from scratch to actually make sense. Pretty much all of it is problematic.
I've attached my suggested changes (in several parts)\, also available on perl5.git.perl.org/perl.git as tonyc/perlio-scalar-sanity.
Reasons for failing instead of warning:
1) reading - to follow the "SVf_UTF8 is only representation" principle\, we'd need to download where possible\, so a \xA1 (for example) in the stream is always treated as that byte\, but this means we have an inconsistency when the scalar cannot be downgraded - the first bytes of the character sequences "\xA1\x40" and "\xA1\x{101}" would be different.
2) writing - if the SV is flagged UTF8\, and the user of the handle doesn't write correct UTF8 data at the correct offsets\, the SV will no longer be properly formed utf-8\, which I believe we're trying to maintain. One of my tests produced a warning about invalid UTF-8 during before the fix was applied.
It's possible could be avoided if we always treat the written bytes as code points and upgrade them when writing to a UTF8 string\, but then we run into a consitency issue vs reading - what happens when a read on a UTF8 string reaches a code point > 0xFF?
As written I think the warning message could be improved and the documentation of the warning could be improved.
Tony
On Mon\, Dec 31\, 2012 at 07:00:45PM +1100\, Tony Cook wrote:
1) reading - to follow the "SVf_UTF8 is only representation" principle\, we'd need to *download* where possible\, so a \xA1 (for
Urr\, downgrade.
As written I think the warning message could be improved and the documentation of the warning could be improved.
Suggestions welcome.
Tony
On 12/31/2012 01:00 AM\, Tony Cook wrote:
On Fri\, Dec 28\, 2012 at 11:16:36PM +0100\, Leon Timmermans wrote:
On Fri\, Dec 28\, 2012 at 11:06 PM\, Tony Cook via RT \perlbug\-followup@​perl\.org wrote:
It should fail to open. If you open a UTF8 flagged string for append and write non-UTF8 bytes you will produce an invalidly encoded SvUTF8 string.
Your patch as written ignores the principle that the SvUTF8() flag only controls the internal encoding\, not other behaviour. If the SV contains only code point 0xFF or lower we should downgrade it and work with that rather than failing (or producing a warning).
I didn't see enough consensus to change it that much\, but I would be in favor.
This should also be done for _read() and _write()\, since the SV can be modified between I/O operations.
There's an unrelated problem that _pushed() checks flags on both arg and SvRV(arg) without calling SvGETMAGIC().
It should just stop peeking and poking into the SV altogether\, and use the proper APIs (sv_insert and friends). For that matter\, I sometimes feel like it should be rewritten from scratch to actually make sense. Pretty much all of it is problematic.
I've attached my suggested changes (in several parts)\, also available on perl5.git.perl.org/perl.git as tonyc/perlio-scalar-sanity.
Reasons for failing instead of warning:
1) reading - to follow the "SVf_UTF8 is only representation" principle\, we'd need to download where possible\, so a \xA1 (for example) in the stream is always treated as that byte\, but this means we have an inconsistency when the scalar cannot be downgraded - the first bytes of the character sequences "\xA1\x40" and "\xA1\x{101}" would be different.
2) writing - if the SV is flagged UTF8\, and the user of the handle doesn't write correct UTF8 data at the correct offsets\, the SV will no longer be properly formed utf-8\, which I believe we're trying to maintain. One of my tests produced a warning about invalid UTF-8 during before the fix was applied.
It's possible could be avoided if we always treat the written bytes as code points and upgrade them when writing to a UTF8 string\, but then we run into a consitency issue vs reading - what happens when a read on a UTF8 string reaches a code point > 0xFF?
As written I think the warning message could be improved and the documentation of the warning could be improved.
Tony
Attached are some suggestions for wording changes. I've never liked our distinction between bytes and character semantics. It makes no sense to me. Everything is ultimately a byte.
I am in favor of your proposed changes Tony\, thanks.
Commit 02c3c86bb8fe791df9608437f0844f9a8017e3b6 changed the behavior so one could not successfully open a scalar with code points above 0xFF.
But this test case shows an issue with this:
use utf8; my $string = qq[aÅb]; my $fh = IO::File->new(); $fh->open(\$string\, '\<:encoding(UTF-8)');
-- Karl Williamson
On Wed Jan 23 19:08:08 2013\, rjbs wrote:
I am in favor of your proposed changes Tony\, thanks.
-- Karl Williamson
I don't know what I pressed to cause it to send while typing the message\, but send it did. So hopefully this will work better.
On Wed Jan 30 15:20:46 2013\, khw wrote:
Commit 02c3c86bb8fe791df9608437f0844f9a8017e3b6 changed the behavior so one could not successfully open a scalar with code points above 0xFF.
But this test case shows an issue with this:
use utf8; my $string = qq[a�b]; my $fh = IO::File->new(); $fh->open(\$string\, '\<:encoding(UTF-8)');
The problem is that the character in the string (which is showing up incorrectly encoded here\, but is a U+00C5) is in Latin 1. Since the string is encodable in Latin1\, the open succeeds\, while silently downgrading from UTF-8 to Latin1\, but the :encoding(UTF-8) doesn't play well with that\, with the result that this silently breaks. -- Karl Williamson
* Karl Williamson via RT \perlbug\-followup@​perl\.org [2013-01-31 00:30]:
On Wed Jan 30 15:20:46 2013\, khw wrote:
Commit 02c3c86bb8fe791df9608437f0844f9a8017e3b6 changed the behavior so one could not successfully open a scalar with code points above 0xFF.
But this test case shows an issue with this:
use utf8; my $string = qq[aÅb]; my $fh = IO::File->new(); $fh->open(\$string\, '\<:encoding(UTF-8)');
The problem is that the character in the string (which is showing up incorrectly encoded here\, but is a U+00C5) is in Latin 1. Since the string is encodable in Latin1\, the open succeeds\, while silently downgrading from UTF-8 to Latin1\, but the :encoding(UTF-8) doesn't play well with that\, with the result that this silently breaks.
Well the code as written is broken. Whether the utf8 pragma ultimately leaves the string downgraded or upgraded\, the code is wrong either way. Whichever is the case\, the code needs an `encode("UTF8"\, ...)` in there somewhere before the `open` in order to be correct. So the fact that this breaks means things are working as designed.
That it breaks silently is not so great.
But how could that be detected?
You could argue for changing the parser to leave literals encoded and with their UTF8 flag on. But that would break other code – granted\, only code that is already wrong. But the dictate of backcompat demands to try not to needlessly expose their brokenness if so far it wasn’t.
The only way to satisfy both requirements would be if there was a way to mark strings as character strings\, independently of whether their UTF8 flag is turned on. Then the utf8 pragma could turn that flag on for all literals\, even if it leaves them with UTF8 flags off\, and `open` could check for that flag instead of the UTF8 flag.
The current scans for codepoints > 0xFF are a proximate facsimile of such a flag – presence of such codepoints is sufficient evidence for the string being a character string. But in the converse case\, absence of evidence not being evidence of absence applies.
Regards\, -- Aristotle Pagaltzis // \<http://plasmasturm.org/>
On Thu\, Jan 31\, 2013 at 8:32 AM\, Aristotle Pagaltzis \pagaltzis@​gmx\.de wrote:
Well the code as written is broken. Whether the utf8 pragma ultimately leaves the string downgraded or upgraded\, the code is wrong either way. Whichever is the case\, the code needs an `encode("UTF8"\, ...)` in there somewhere before the `open` in order to be correct. So the fact that this breaks means things are working as designed.
That it breaks silently is not so great.
But how could that be detected?
We can return an error instead of trying to downgrade. It would still break this code\, but at least it would do so loudly (and at least it would be sane).
Leon
* Leon Timmermans \fawaka@​gmail\.com [2013-01-31 14:30]:
On Thu\, Jan 31\, 2013 at 8:32 AM\, Aristotle Pagaltzis \pagaltzis@​gmx\.de wrote:
Well the code as written is broken. Whether the utf8 pragma ultimately leaves the string downgraded or upgraded\, the code is wrong either way. Whichever is the case\, the code needs an `encode("UTF8"\, ...)` in there somewhere before the `open` in order to be correct. So the fact that this breaks means things are working as designed.
That it breaks silently is not so great.
But how could that be detected?
We can return an error instead of trying to downgrade. It would still break this code\, but at least it would do so loudly (and at least it would be sane).
1. You can’t. The string is downgraded far earlier\, by the parser.
$ perl -MDevel::Peek -e 'use utf8; Dump qq[aÅb]' SV = PV(0x7fad5b801090) at 0x7fad5b8267a8 REFCNT = 1 FLAGS = (POK\,READONLY\,pPOK\,UTF8) PV = 0x10f10f850 "a\303\205b"\0 [UTF8 "a\x{c5}b"] CUR = 4 LEN = 16
2. It doesn’t matter if the byte value C5 is spelled C5 in the buffer and the UTF8 flag is off\, or C3 85 and UTF8 is on – both mean the same thing. If it is downgradable\, then it very well should be downgraded and accepted silently. (I just realised some of my previous mail was a red herring\, due to this point.)
If you’re opening the string as an octet stream\, then you need a string that contains an octet stream. Not characters. Regardless of the UTF8 flag value.
-- Aristotle Pagaltzis // \<http://plasmasturm.org/>
On Fri\, Feb 1\, 2013 at 8:16 AM\, Aristotle Pagaltzis \pagaltzis@​gmx\.de wrote:
1. You can’t. The string is downgraded far earlier\, by the parser.
$ perl -MDevel::Peek -e 'use utf8; Dump qq[aÅb]' SV = PV(0x7fad5b801090) at 0x7fad5b8267a8 REFCNT = 1 FLAGS = (POK\,READONLY\,pPOK\,UTF8) PV = 0x10f10f850 "a\303\205b"\0 [UTF8 "a\x{c5}b"] CUR = 4 LEN = 16
That's not downgraded at all\, it has the utf8 flag.
2. It doesn’t matter if the byte value C5 is spelled C5 in the buffer and the UTF8 flag is off\, or C3 85 and UTF8 is on – both mean the same thing. If it is downgradable\, then it very well should be downgraded and accepted silently. (I just realised some of my previous mail was a red herring\, due to this point.)
That abstraction leaks when it comes into contact with PerlIO. No getting around that. Question is only: how does it leak
Leon
On Fri\, Feb 01\, 2013 at 03:54:37PM +0100\, Leon Timmermans wrote:
On Fri\, Feb 1\, 2013 at 8:16 AM\, Aristotle Pagaltzis \pagaltzis@​gmx\.de wrote:
2. It doesn't matter if the byte value C5 is spelled C5 in the buffer and the UTF8 flag is off\, or C3 85 and UTF8 is on - both mean the same thing. If it is downgradable\, then it very well should be downgraded and accepted silently. (I just realised some of my previous mail was a red herring\, due to this point.)
That abstraction leaks when it comes into contact with PerlIO. No getting around that. Question is only: how does it leak
I feel that I'm asking a stupid question here\, but why/how does it leak? Is it leaking for the same reason as eval "leaks"? There\, source code from disk is in bytes\, which needs an encoding layered atop it to map to characters (even if it's a 1:1 mapping). So "obviously"\, that's what the parser expects. Stuff in the range 0-255\, which might be a variable-width encoding. But eval takes strings\, and Perl-code has generated strings of *characters* to feed to the parser. Stuff in the range 0-0x1FFFF (ish)\, abstract representation (as far as Perl-space is concerned)
So\, here\, some code wants to think in terms of using file-like operations on a sequence of octets held in a scalar (which were "obviously" octets because that was what it was dealing with when it assigned to that scalar.)
Whereas other code wants to think in terms of using file-like operations on a sequence of characters. (which were "obviously" characters because that was what it was dealing with when it assigned to that scalar.)
And it's the same syntax to open either.
Is that the leakage you mean? That by the time the code comes to open the scalar\, it simply isn't clear whether the scalar is supposed to be holding sequences of octets\, or sequences of characters\, and so the opening code *can't* get the semantics of the open correct.
Or have I misunderstood?
Nicholas Clark
Migrated from rt.perl.org#109828 (status was 'resolved')
Searchable as RT109828$