Closed p5pRT closed 20 years ago
$ perl -le 'print unpack("U*"\, "\xDD")' Malformed UTF-8 character at -e line 1. 65533
I would expect it to print "219\n".
On Sun\, Sep 17\, 2000 at 12:59:15AM -0000\, gisle@aas.no wrote:
$ perl -le 'print unpack("U*"\, "\xDD")' Malformed UTF-8 character at -e line 1. 65533
I would expect it to print "219\n".
% ./perl -le 'print "So would I" if "\xDD" eq pack("U*"\, 219)' %
It would break symmetry\, but I understand why you'd want it. Do you still want it?
On Sun\, Sep 17\, 2000 at 12:59:15AM -0000\, gisle@aas.no wrote:
$ perl -le 'print unpack("U*"\, "\xDD")' Malformed UTF-8 character at -e line 1. 65533
Ah\, sorry\, I understand this now; what you'd like is that "U" detects whether each character in a string is valid UTF8 or not\, and behaves like "A" if it isn't. To be blunt\, I don't believe that's what "U" is for; it's for decoding UTF8\, and "\xDD" isn't UTF8. Was the problem that you thought \xDD *should* be UTF8?
Simon Cozens \simon@​cozens\.net writes:
On Sun\, Sep 17\, 2000 at 12:59:15AM -0000\, gisle@aas.no wrote:
$ perl -le 'print unpack("U*"\, "\xDD")' Malformed UTF-8 character at -e line 1. 65533
I would expect it to print "219\n".
Oops! I should have written 221 here.
% ./perl -le 'print "So would I" if "\xDD" eq pack("U*"\, 219)' %
And eq is broken too!
What is frustrating is that you seem to patch the UTF8 support in a different direction than I did half a year ago. For instance I fixed eq in change #5921 (and sv_cmp in change #5138). Then you basically undid what in change #6465. You seem to think it matter is a string is upgrade to UTF8. I think it is a bug if it matters.
And why did you end the UTF8 test you introduced for sv_eq with "&& 0"? Even if we fix that\, I still think my approach was the better one.
For reference I include some direct link to the patches mentioned: ftp://ftp.linux.activestate.com/pub/staff/gsar/APC/5.5.660/diffs/5138.gz ftp://ftp.linux.activestate.com/pub/staff/gsar/APC/5.7.0/diffs/5921.gz ftp://ftp.linux.activestate.com/pub/staff/gsar/APC/5.7.0/diffs/6465.gz
Regards\, Gisle
Simon Cozens \simon@​cozens\.net writes:
On Sun\, Sep 17\, 2000 at 12:59:15AM -0000\, gisle@aas.no wrote:
$ perl -le 'print unpack("U*"\, "\xDD")' Malformed UTF-8 character at -e line 1. 65533
Ah\, sorry\, I understand this now; what you'd like is that "U" detects whether each character in a string is valid UTF8 or not\, and behaves like "A" if it isn't.
No. I actually think that pack("U") should go away and that pack("C") should deal with values > 255. But if we keep it then it should be made to work.
To be blunt\, I don't believe that's what "U" is for; it's for decoding UTF8\, and "\xDD" isn't UTF8.
"U" is *not* for decoding UTF8. "U" is "C" extended to work for a wider range of character ordinals. The internal UTF8 representation is not supposed to leak out like that.
Was the problem that you thought \xDD *should* be UTF8?
Not exactly.
Regards\, Gisle
Just another example to illustrate my point. IMHO\, all of these should print the same thing:
$ perl -le 'print unpack("U"\, v221.300)' 221 $ perl -le 'print unpack("U"\, v221.200)' Malformed UTF-8 character at -e line 1. 65533 $ perl -le 'print unpack("C"\, v221.300)' 195 $ perl -le 'print unpack("C"\, v221.200)' 221
On Sun\, Sep 17\, 2000 at 01:46:53PM +0200\, Gisle Aas wrote:
What is frustrating is that you seem to patch the UTF8 support in a different direction than I did half a year ago. For instance I fixed eq in change #5921 (and sv_cmp in change #5138). Then you basically undid what in change #6465. You seem to think it matter is a string is upgrade to UTF8. I think it is a bug if it matters.
I did that in response to a bug report\, where for someone it *did* matter. I agree that it's a bug if it matters. However\, I don't think we should do it.
When binary operators start modifying their operands in *any* way\, that's screwy and I don't like it.
And why did you end the UTF8 test you introduced for sv_eq with "&& 0"?
I don't remember doing so. I looked at that earlier this morning and realised that was what was stopping eq from doing the right thing. I wouldn't have put && 0 in there because it blatantly negates the entire point of my patch.
Simon
On Sun\, Sep 17\, 2000 at 01:57:46PM +0200\, Gisle Aas wrote:
To be blunt\, I don't believe that's what "U" is for; it's for decoding UTF8\, and "\xDD" isn't UTF8.
"U" is *not* for decoding UTF8. "U" is "C" extended to work for a wider range of character ordinals.
Incorrect.
U A Unicode character number. Encodes to UTF-8 internally. Works even if C\
Simon
Gisle Aas \gisle@​ActiveState\.com writes:
What is frustrating is that you seem to patch the UTF8 support in a different direction than I did half a year ago. For instance I fixed eq in change #5921 (and sv_cmp in change #5138). Then you basically undid what in change #6465. You seem to think it matter is a string is upgrade to UTF8. I think it is a bug if it matters.
We need to agree what the goal is or we are not going to get there...
FWIW - my leaning is towards Gisle's view here. Updrading a string to UTF8 is no more a bug than upgrading to SvPVIV or whatever is\, and binary ops do that kind of thing.
It is not supposed to matter which form the string is in. In essence perl strings are arrays of chars where chars can be bigger 255. The SvUTF8_on flag say that those chars are UTF8 encoded. The "lack" of SvUTF8 flag says "all these chars are \< 256 so we did not encode them".
On Sun\, Sep 17\, 2000 at 09:21:41PM +0100\, Nick Ing-Simmons wrote:
We need to agree what the goal is or we are not going to get there...
Hmm. OK. We've got two separate issues here.
1) eq was broken. It's my contention that it could be debroken by removing
that dodgy "&& 0"\, but things like that are usually in because other stuff
broke elsewhere. As I said\, I don't remember doing that and it would
completely negate the point of the patch if it *was* me. (That's not\, of
course\, to say that I *didn't* put it there\, just that it was fundamentally
stupid if I did.)
2) There's some disagreement on what pack("U") does\, or should do. I don't have to worry about what it should do\, I only work here. I think it encodes into UTF8\, and the documentation and perl's behaviour are consistent with that belief. Gisle thinks that it's an extension of pack("C"). If it *should* be an extension of pack("C")\, it's not behaving like one and it's buggy. If it should encode to UTF8\, no bug. My Camel3 hasn't arrived yet\, so I can't even turn to that.
FWIW - my leaning is towards Gisle's view here. Updrading a string to UTF8 is no more a bug than upgrading to SvPVIV or whatever is\, and binary ops do that kind of thing.
NO! Look\, scary things can happen.
$x = chr(200); $y = pack("U*"\, 200);
print "Good\, eq works now.\n" if $x eq $y; byte_write($x);
sub byte_write { # Write out our arguments faithfully to a file use bytes; # [1] open OUT\, ">file" or die $!; print OUT @_; close OUT }
Now\, how many bytes do you expect to be put in that file? I want one. chr(200) is one byte\, as far as I'm concerned. And $x was one byte\, until it got stealthily upgraded to UTF8. Now it's two bytes\, even though we never told Perl to modify it. So\, what\, due to spooky upgrading\, we now can't reliably mix binary (byte) and UTF8 data? And this *ISN'T* a bug?
[1] We don't need that\, actually\, because at present\, printing a UTF8 string prints the bytes\, not the characters. I no longer have any idea at all whether that's a bug. You asked it to print a UTF8 string\, it printed UTF8. Isn't that what you asked for?
On Sun\, Sep 17\, 2000 at 10:39:41PM +0100\, Simon Cozens wrote:
1) eq was broken. It's my contention that it could be debroken by removing that dodgy "&& 0"
Yep:
% ./perl -le 'print "Yes" if chr(200) eq pack("U*"\,200)' Yes
There's probably a good reason for not applying the below. No\, I don't know what it is.
Simon
Simon Cozens \simon@​cozens\.net writes:
FWIW - my leaning is towards Gisle's view here. Updrading a string to UTF8 is no more a bug than upgrading to SvPVIV or whatever is\, and binary ops do that kind of thing.
NO! Look\, scary things can happen.
$x = chr(200); $y = pack("U*"\, 200);
print "Good\, eq works now.\n" if $x eq $y; byte_write($x);
sub byte_write { # Write out our arguments faithfully to a file use bytes; # [1] open OUT\, ">file" or die $!; print OUT @_; close OUT }
Now\, how many bytes do you expect to be put in that file? I want one. chr(200) is one byte\, as far as I'm concerned. And $x was one byte\, until it got stealthily upgraded to UTF8. Now it's two bytes\, even though we never told Perl to modify it. So\, what\, due to spooky upgrading\, we now can't reliably mix binary (byte) and UTF8 data? And this *ISN'T* a bug?
Yes it is a bug - in print! it should downgrade the UTF8 version of the character. But then you did put in a 'use bytes'. Ilya and I argued hard that 'use bytes' is alien to the whole "transparent representation" concept. But as I understand the eventual definition in the scope of a 'use bytes' any UTF8 encoded chars are down graded\, with a hard fail (= die) if any are > 255.
[1] We don't need that\, actually\, because at present\, printing a UTF8 string prints the bytes\, not the characters. I no longer have any idea at all whether that's a bug. You asked it to print a UTF8 string\, it printed UTF8. Isn't that what you asked for?
This is the fundamental breakage - we have no way to "ask" for anything on IO.
On Mon\, Sep 18\, 2000 at 08:48:39AM +0100\, Nick Ing-Simmons wrote:
Yes it is a bug - in print! it should downgrade the UTF8 version of the character.
Okay\, I shall make it so. On your head be it. :)
But then you did put in a 'use bytes'. Ilya and I argued hard that 'use bytes' is alien to the whole "transparent representation" concept. But as I understand the eventual definition in the scope of a 'use bytes' any UTF8 encoded chars are down graded\, with a hard fail (= die) if any are > 255.
With all due respect\, that's not what "use bytes" does\, and it doesn't do that for a reason.
I believe that "use bytes" treats everything as a string of bytes\, not as a string of characters. The string that is represented as "\304\254"\, being character 300 in Unicode\, suddenly finds itself treated as two independent bytes\, character 196 and character 172.
That's what it claims to do...
The C\
...and that's indeed what it does:
% ./perl -Ilib -le '$x = chr(300); print "As characters: "\, length $x; { use bytes; print "As a series of bytes: "\, length $x }'
As characters: 1 As a series of bytes: 2
No downgrading there at all. But don't tell me: that's a bug\, right?
On Mon\, Sep 18\, 2000 at 09:15:35AM +0100\, Simon Cozens wrote:
On Mon\, Sep 18\, 2000 at 08:48:39AM +0100\, Nick Ing-Simmons wrote:
Yes it is a bug - in print! it should downgrade the UTF8 version of the character.
Okay\, I shall make it so. On your head be it. :)
This'll make print downgrade output *if possible*. Did you want it to croak on print chr(300) or leave it as UTF8? If the former\, change TRUE to FALSE.
Simon Cozens \simon@​cozens\.net writes:
But then you did put in a 'use bytes'. Ilya and I argued hard that 'use bytes' is alien to the whole "transparent representation" concept. But as I understand the eventual definition in the scope of a 'use bytes' any UTF8 encoded chars are down graded\, with a hard fail (= die) if any are > 255.
With all due respect\,
Thanks but I have a thick skin ;-) I may well have mis-remembered the resolution.
With a view to clearing up the meaning I have copied a few folk I know have discussed this stuff in the past.
that's not what "use bytes" does\, and it doesn't do that for a reason.
What is the reason?
I can understand a pragma which does what I think it does\, but given that (in my view quite correctly) the UTF8 flag's state depends on the value's history I cannot see how blindly treating SvPV as 'bytes' is any use what so ever.
Note that in the "normal" case of (say)
#!perl use bytes; ...
It is moot as that file never turns the thing on. It may still get UTF8 from modules and things though.
I believe that "use bytes" treats everything as a string of bytes\, not as a string of characters.
So do I. And I believe it should squeal loudly if any "byte" turns out to be >= 256.
The string that is represented as "\304\254"\, being character 300 in Unicode\, suddenly finds itself treated as two independent bytes\, character 196 and character 172.
I could argue that result should be chr(300 & 255) i.e. chr(44). I am not sure I want to - I think I prefer the 'die'.
That's what it claims to do...
The C\
Which is a tad vague.
I would prefer it if it said something like:
The C\
...and that's indeed what it does:
% ./perl -Ilib -le '$x = chr(300); print "As characters: "\, length $x; { use bytes; print "As a series of bytes: "\, length $x }'
As characters: 1 As a series of bytes: 2
No downgrading there at all. But don't tell me: that's a bug\, right?
_I_ think so - but if my mental model of this stuff turns out to be wrong then it may not be.
Simon Cozens \simon@​cozens\.net writes:
On Mon\, Sep 18\, 2000 at 09:15:35AM +0100\, Simon Cozens wrote:
On Mon\, Sep 18\, 2000 at 08:48:39AM +0100\, Nick Ing-Simmons wrote:
Yes it is a bug - in print! it should downgrade the UTF8 version of the character.
Okay\, I shall make it so. On your head be it. :)
This'll make print downgrade output *if possible*. Did you want it to croak on print chr(300)
I think it should croak or print \x{12C} or some other "escaped" representation. That is it should croak. When we have 'em the default output discipline should probably make some representaion of out-of-bounds chars.
or leave it as UTF8? If the former\, change TRUE to FALSE.
--- perl/doio.c.~1~ Mon Sep 18 09:25:40 2000 +++ perl/doio.c Mon Sep 18 09:25:40 2000 @@ -1168\,6 +1168\,7 @@ } /* FALL THROUGH */ default: + sv_utf8_downgrade(sv\, TRUE); tmps = SvPV(sv\, len); break; } End of Patch.
On Mon\, 18 Sep 2000\, Nick Ing-Simmons wrote:
The string that is represented as "\304\254"\, being character 300 in Unicode\, suddenly finds itself treated as two independent bytes\, character 196 and character 172.
I don't have a deep understanding of Unicode\, nor am I particularly interested in becoming an expert. However\, I do sometimes process binary data in perl\, and sequences of bytes in that binary data are often the same as sequences of bytes used to represent some Unicode character.
What I want is some way to be *sure* that my data isn't mangled.
C\<use bytes;> is\, I hope\, one way to do that.
Yes\, I know that perl will usually "Do the Right Thing"\, but
the following caveat in perlunicode.pod
Whether an arbitrary piece of data will be treated as "characters" or "bytes" by internal operations cannot be divined at the current time
doesn't give me overwhelming confidence that perl will _always_ do the right thing :-).
I like the clarity of having a positive assertion "use bytes" to put at the top.
I could argue that result should be chr(300 & 255) i.e. chr(44). I am not sure I want to - I think I prefer the 'die'.
chr(44) would definitely be mangling my data. "Silent" mangling of this sort would probably not make me happy. I expect 'die' would be acceptable. It's a sign from perl that my data isn't what I thought it was. If some module somewhere has incorrectly tagged my binary data as Unicode\, then I would prefer to find out as soon as possible so I can work around the problem.
Dealing with binary data is not Perl's primary focus\, and I don't mind jumping through a few extra hoops in order to do so\, but please let's not make it hard to do so reliably.
I would prefer it if it said something like:
The C\
pragma asserts that all strings are composed of characters in the range 0..255 (as in perl5.005) for its lexical scope. New strings will not be UTF8 encoded. If code with C\ in scope encounters a string which is UTF8 encoded (e.g. return from a module which does not have C\ ) then string will be decoded\, if any larger than 255 is found then perl will C\ giving the value of the first out of range character.
If I correctly understand what you mean by "decoded"\, then that sounds reasonable to me.
Saying "use bytes" means to enforce pre-Unicode Perl semantics\, which includes a healthy dose of agnosticism. I think that means "use bytes" treats all strings as buckets of bits regardless of whether the SvUTF8 bit is set. Garbage in\, garbage out. If you want to deal with utf8 intelligently within the scope of a "use bytes"\, you have to look at the SvUTF8 bit yourself.
We can certainly have a pragma that forces all interfaces to iso-8859-1 semantics and tries to do the right thing with any utf8 strings\, but "use bytes" isn't that pragma.
Larry
On Mon\, Sep 18\, 2000 at 09:59:52AM -0700\, Larry Wall wrote:
I think that means "use bytes" treats all strings as buckets of bits regardless of whether the SvUTF8 bit is set.
Gotcha. Perhaps this calls for a docpatch.
End of Patch.
Now it also seems to me that it would make sense to print a string of bytes
when calling C\
On Mon\, 18 Sep 2000\, Simon Cozens wrote:
+Perl normally assumes character semantics in the presence of character +data (i.e. data that has come from a source that has been marked as +being of a particular character encoding). When C\
is in +effect\, the encoding is temporarily ignored\, and each string is treated +as a series of bytes.
Thanks. One question I still have is: How exactly does data get marked as being of a particular encoding?
Andy Dougherty doughera@lafayette.edu Dept. of Physics Lafayette College\, Easton PA 18042
On Mon\, 18 Sep 2000\, Larry Wall wrote:
Saying "use bytes" means to enforce pre-Unicode Perl semantics\, which includes a healthy dose of agnosticism. I think that means "use bytes" treats all strings as buckets of bits regardless of whether the SvUTF8 bit is set. Garbage in\, garbage out.
Sounds good (except in my present case that's Garbage in\, Grant-proposal out :-). "treats all strings as buckets of bits" sounds like a fine description understandable to someone like me who just wants the raw bits.
Thanks\,
Andy Dougherty doughera@lafayette.edu
On Mon\, Sep 18\, 2000 at 02:53:04PM -0400\, Andy Dougherty wrote:
Thanks. One question I still have is: How exactly does data get marked as being of a particular encoding?
Ah\, yes. Uhm. Line disciplines\, of course. Plus anything in Perl which yields something which must be expressed as UTF8:
pack("U*"\, ...) chr( $x ) # $x > 255 vx.y.z # max(x\,y\,z) > 255 \x{BIGNUM} \N{UNICODE THING}
I think that's about it.
On Mon\, Sep 18\, 2000 at 08:08:59PM +0100\, Simon Cozens wrote:
On Mon\, Sep 18\, 2000 at 02:53:04PM -0400\, Andy Dougherty wrote:
Thanks. One question I still have is: How exactly does data get marked as being of a particular encoding?
Ah\, yes. Uhm. Line disciplines\, of course. Plus anything in Perl which yields something which must be expressed as UTF8:
pack\("U\*"\, \.\.\.\) chr\( $x \) \# $x > 255 vx\.y\.z \# max\(x\,y\,z\) > 255
Is there a reason a string expressed as vx.y.z should ever not be UTF8? It's already confusing enough to find out what is UTF8 and what isn't\, rules like 'a v-string\, but if and only if at least one of the components is 256 or more' don't make things any clearer.
It isn't really useful to use v strings for binary data anyway\, is it?
\\x\{BIGNUM\} \\N\{UNICODE THING\}
I think that's about it.
Abigail
On Mon\, Sep 18\, 2000 at 06:24:25PM +0100\, Simon Cozens wrote:
Now it also seems to me that it would make sense to print a string of bytes when calling C\
in the scope of C\ \, so this should modify my previous patch to C\<do_print>.
This would leave us in the wonderful position where C\
Could we\, perhaps\, make this less confusing somehow?
Andy Dougherty \doughera@​lafayette\.edu writes:
On Mon\, 18 Sep 2000\, Nick Ing-Simmons wrote:
The string that is represented as "\304\254"\, being character 300 in Unicode\, suddenly finds itself treated as two independent bytes\, character 196 and character 172.
I like the clarity of having a positive assertion "use bytes" to put at the top.
What are you "asserting" ? - that is what we are discussing the _meaning_ of "use bytes".
Does it mean :
A. Everything should fit in a byte in here. B. Give me any old thing that happens to be about - they are "just bytes".
I could argue that result should be chr(300 & 255) i.e. chr(44). I am not sure I want to - I think I prefer the 'die'.
chr(44) would definitely be mangling my data.
If you had 'use bytes' in scope then "\304\254" with UTF8 flag set is not _YOUR_ data (your data never has UTf8 flag set). Someone else gave you it. What do you want to happen in that case?
"Silent" mangling of this sort would probably not make me happy. I expect 'die' would be acceptable. It's a sign from perl that my data isn't what I thought it was. If some module somewhere has incorrectly tagged my binary data as Unicode\, then I would prefer to find out as soon as possible so I can work around the problem.
Good.
Dealing with binary data is not Perl's primary focus\, and I don't mind jumping through a few extra hoops in order to do so\, but please let's not make it hard to do so reliably.
I would prefer it if it said something like:
The C\
pragma asserts that all strings are composed of characters in the range 0..255 (as in perl5.005) for its lexical scope. New strings will not be UTF8 encoded. If code with C\ in scope encounters a string which is UTF8 encoded (e.g. return from a module which does not have C\ ) then string will be decoded\, if any larger than 255 is found then perl will C\ giving the value of the first out of range character. If I correctly understand what you mean by "decoded"\, then that sounds reasonable to me.
What I mean is if 'ÿ' (say) which is legal iso-8859-1 8-bit char has got itself UTF8 encoded it gets mapped back to its byte value of 0xFF and given to you as such.
Larry Wall \larry@​wall\.org writes:
Saying "use bytes" means to enforce pre-Unicode Perl semantics\, which includes a healthy dose of agnosticism.
But pre-Unicode perl did not go round UTF8 encoding things. Given that perl-5.6+ may if code outside the scope of 'use bytes' gets called what should happen when such a thing gets back to 'use bytes' code?
I think that means "use bytes" treats all strings as buckets of bits regardless of whether the SvUTF8 bit is set. Garbage in\, garbage out. If you want to deal with utf8 intelligently within the scope of a "use bytes"\, you have to look at the SvUTF8 bit yourself.
Which isn't "just like pre-Unicode perl".
In the scope of use bytes what should happen to the SvUTF8 flag as a result of ops?
While this may be "Garbage in\, garbage out"\, the input garbage was neatly separated for re-cycling\, and labeled "beware broken glass" where appropriate - now it is all mixed up again.
We can certainly have a pragma that forces all interfaces to iso-8859-1 semantics and tries to do the right thing with any utf8 strings\, but "use bytes" isn't that pragma.
Fair enough as that is not what I want for binary data anyway.
Maybe I want yet-another-pragma
use strict 'bytes';
Which asserts all 'characters' are 0..255
On Mon\, 18 Sep 2000 20:08:59 +0100\, Simon Cozens \simon@​cozens\.net wrote:
On Mon\, Sep 18\, 2000 at 02:53:04PM -0400\, Andy Dougherty wrote:
Thanks. One question I still have is: How exactly does data get marked as being of a particular encoding?
Ah\, yes. Uhm. Line disciplines\, of course. Plus anything in Perl which yields something which must be expressed as UTF8:
pack\("U\*"\, \.\.\.\) chr\( $x \) \# $x > 255 vx\.y\.z \# max\(x\,y\,z\) > 255 \\x\{BIGNUM\} \\N\{UNICODE THING\}
I think that's about it.
\p{...} \P{...} \X in regexes?
On Tue\, Sep 19\, 2000 at 11:14:18AM +0200\, H.Merijn Brand wrote:
Ah\, yes. Uhm. Line disciplines\, of course. Plus anything in Perl which yields something which must be expressed as UTF8: \p{...} \P{...} \X in regexes?
Well\, hmm. Only if the input data is UTF8 in the first place. \p{} doesn't *create* UTF8 data\, it just selects a chunk of some already existing UTF8 data:
% ./perl -Ilib -MDevel::Peek -e '"abcd"=~/(\p{IsAlpha})/; $x = $1; Dump($x)' SV = PVMG(0x81696d0) at 0x817ff14 REFCNT = 1 FLAGS = (POK\,pPOK) IV = 0 NV = 0 PV = 0x8185318 "a"\0 CUR = 1 LEN = 2
On Tue\, Sep 19\, 2000 at 08:59:12AM +0100\, Nick Ing-Simmons wrote:
Maybe I want yet-another-pragma
use strict 'bytes';
Which asserts all 'characters' are 0..255
I'm reasonably sure you can't have it. :)
Look\, when do we test this assertion? Consider:
$x = v300.400.500;
{
use strict 'bytes';
print $x;
}
OK\, so we have to test it inside the "print" operator\, since that's when our naughty non-byte data gets used.
$x = v300.400.500; { use strict 'bytes'; $x .= $x; }
So we have to test both sides of the concat operator.
In fact\, you should be able to see that you have to test all the data coming into each operator. And you have to do this when the operator is used\, because you can't test the data any other time\, as it may be UTF8 data created outside the scope of your pragma.
So\, you seem to be wanting a run-time assertion inserted into every single op\, looking at every single byte in every single piece of data used by that op.
I don't want to even *think* about implementing that.
Alternatively\, you can just make it the scope of the entire program\, and have it turn off the ability to use Unicode data in any way\, shape or form.
package unicode; sub unimport { exec "perl5.005"\, $0\, @ARGV }
At 20:08 +0100 2000-09-18\, Simon Cozens wrote:
Plus anything in Perl which yields something which must be expressed as UTF8:
pack\("U\*"\, \.\.\.\) chr\( $x \) \# $x > 255 vx\.y\.z \# max\(x\,y\,z\) > 255 \\x\{BIGNUM\} \\N\{UNICODE THING\}
My researches suggest that\, like the second and third cases\, the last only produces a UTF8 string (and marks the scalar appropriately) if the resulting code is > 255:
ppp100 domo$ perl -Mcharnames=:full -MDevel::Peek -e
'Dump("\N{LATIN SMALL LETTER THORN}")' SV = PV(0x7886be8) at 0x787a7ec REFCNT = 1 FLAGS = (POK\,READONLY\,pPOK) PV = 0x7888398 "\376"\0 CUR = 1 LEN = 2 $ perl -Mcharnames=:full -MDevel::Peek -e 'Dump("\N{RUNIC LETTER THURISAZ THURS THORN}")' SV = PV(0x7886be8) at 0x787a7ec REFCNT = 1 FLAGS = (POK\,READONLY\,pPOK\,UTF8) PV = 0x7888388 "\341\232\246"\0 CUR = 3 LEN = 4
(Unpatched perl5.7.0\, BTW.)
On Mon\, 18 Sep 2000\, Simon Cozens wrote:
On Mon\, Sep 18\, 2000 at 02:53:04PM -0400\, Andy Dougherty wrote:
Thanks. One question I still have is: How exactly does data get marked as being of a particular encoding?
Ah\, yes. Uhm. Line disciplines\, of course. Plus anything in Perl which yields something which must be expressed as UTF8:
Ok. I see. Most of those are within the programmer's immediate control and so are not really a problem for me. But until line disciplines get nailed down[*] (a big job\, I know) I guess it's not completely settled. That's ok with me for the moment. I'm not looking for final answers now\, just trying to get a good handle on the questions I should be asking.
Thanks\,
Andy Dougherty doughera@lafayette.edu
[*] e.g. when/how do they kick in? Do they affect read()\, sysread()\, \<>? How about recv()? How about System V IPC?
On Tue\, 19 Sep 2000\, Nick Ing-Simmons wrote:
Larry Wall \larry@​wall\.org writes:
Saying "use bytes" means to enforce pre-Unicode Perl semantics\, which includes a healthy dose of agnosticism.
But pre-Unicode perl did not go round UTF8 encoding things. Given that perl-5.6+ may if code outside the scope of 'use bytes' gets called what should happen when such a thing gets back to 'use bytes' code?
Ideally\, that shouldn't happen :-). So perhaps\, as a first pass\, we just die? Then\, if we find out we're dying way too often\, we try to figure out something smarter to do.
Andy Dougherty \doughera@​lafayette\.edu writes:
On Tue\, 19 Sep 2000\, Nick Ing-Simmons wrote:
Larry Wall \larry@​wall\.org writes:
Saying "use bytes" means to enforce pre-Unicode Perl semantics\, which includes a healthy dose of agnosticism.
But pre-Unicode perl did not go round UTF8 encoding things. Given that perl-5.6+ may if code outside the scope of 'use bytes' gets called what should happen when such a thing gets back to 'use bytes' code?
Ideally\, that shouldn't happen :-). So perhaps\, as a first pass\, we just die? Then\, if we find out we're dying way too often\, we try to figure out something smarter to do.
Fine by me - anything which does not just return the UTF8 encoding is better than what we have now :-(
Simon Cozens \simon@​cozens\.net writes:
I (foolishly) said:
Maybe I want yet-another-pragma
use strict 'bytes';
Which asserts all 'characters' are 0..255
use strict 'bytes'; $x \.= $x;
So we have to test both sides of the concat operator.
So\, you seem to be wanting a run-time assertion inserted into every single op\, looking at every single byte in every single piece of data used by that op.
I don't want to even *think* about implementing that.
Eeek\, thanks for thinking it through for me. That is quite horrible. I guess all I can hope for is an output line disipline that checks end result.
you can't test the data any other time\, as it may be UTF8 data created outside the scope of your pragma.
Well you could see if 'BYTES' was on anywere up the call stack but we had this daft idea way back when we were not tagging the data\, the data-tag was because that got silly.
Simon Cozens \simon@​cozens\.net writes:
I (foolishly) said:
Maybe I want yet-another-pragma
use strict 'bytes';
Which asserts all 'characters' are 0..255
use strict 'bytes'; $x \.= $x;
So we have to test both sides of the concat operator.
So\, you seem to be wanting a run-time assertion inserted into every single op\, looking at every single byte in every single piece of data used by that op.
I don't want to even *think* about implementing that.
Eeek\, thanks for thinking it through for me. That is quite horrible. I guess all I can hope for is an output line disipline that checks end result.
you can't test the data any other time\, as it may be UTF8 data created outside the scope of your pragma.
Well you could see if 'BYTES' was on anywere up the call stack but we had this daft idea way back when we were not tagging the data\, the data-tag was because that got silly.
On Mon\, Sep 18\, 2000 at 08:08:59PM +0100\, Simon Cozens \simon@​cozens\.net wrote:
Ah\, yes. Uhm. Line disciplines\, of course. Plus anything in Perl which yields something which must be expressed as UTF8:
You forgot:
$utf8 = "string";
(currently you have to enable this with use utf8\, though ;)
PS: eval currently ignores the utf8-setting on the string\, which IMHO is a very-low-priority-bug. Here is a testcase:
use Convert::Scalar qw(:utf8); $x = "'\x{1234}'"; utf8 $x or die; # test for utf8\, should not and does not die. $y = eval $x; utf8 $y or die; # test for utf8\, should not but _does_ die
I'm closing this as a bug though as a discussion thread about UTF-8 this is still good reading. Note also that even if closed as bug the original behaviour remains (now with a slightly more verbose error message). In other words\, as far as I am able to interpret the current plan\, this is not a bug.
Migrated from rt.perl.org#4322 (status was 'resolved')
Searchable as RT4322$