Perl / perl5

🐪 The Perl programming language
https://dev.perl.org/perl5/
Other
1.85k stars 527 forks source link

unpack("U*", "\xDD") #2634

Closed p5pRT closed 20 years ago

p5pRT commented 23 years ago

Migrated from rt.perl.org#4322 (status was 'resolved')

Searchable as RT4322$

p5pRT commented 23 years ago

From @gisle

Created by @gisle

$ perl -le 'print unpack("U*"\, "\xDD")' Malformed UTF-8 character at -e line 1. 65533

I would expect it to print "219\n".

Perl Info ``` Flags: category=core severity=low Site configuration information for perl v5.7.0: Configured by gisle at Tue Sep 5 09:56:22 CEST 2000. Summary of my perl5 (revision 5.0 version 7 subversion 0) configuration: Platform: osname=linux, osvers=2.2.14, archname=i686-linux-thread-multi uname='linux eik 2.2.14 #1 fri mar 17 11:59:50 gmt 2000 i686 unknown ' config_args='-Dusedevel -Dprefix=/local/perl/5.7.0_thr -Dusethreads -Doptimize=-g -ders' hint=recommended, useposix=true, d_sigaction=define usethreads=define use5005threads=undef useithreads=define usemultiplicity=define useperlio=undef d_sfio=undef uselargefiles=define usesocks=undef use64bitint=undef use64bitall=undef uselongdouble=undef Compiler: cc='cc', ccflags ='-D_REENTRANT -DDEBUGGING -fno-strict-aliasing -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64', optimize='-g', cppflags='-D_REENTRANT -DDEBUGGING -fno-strict-aliasing -I/usr/local/include' ccversion='', gccversion='2.95.2 19991024 (release)', gccosandvers='' intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12 ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 alignbytes=4, usemymalloc=n, prototype=define Linker and Libraries: ld='cc', ldflags =' -L/usr/local/lib' libpth=/usr/local/lib /lib /usr/lib libs=-lnsl -lndbm -lgdbm -ldbm -ldb -ldl -lm -lpthread -lc -lposix -lcrypt -lutil libc=, so=so, useshrplib=false, libperl=libperl.a Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-rdynamic' cccdlflags='-fpic', lddlflags='-shared -L/usr/local/lib' Locally applied patches: @INC for perl v5.7.0: /local/perl/5.7.0_thr/lib/5.7.0/i686-linux-thread-multi /local/perl/5.7.0_thr/lib/5.7.0 /local/perl/5.7.0_thr/lib/site_perl/5.7.0/i686-linux-thread-multi /local/perl/5.7.0_thr/lib/site_perl/5.7.0 /local/perl/5.7.0_thr/lib/site_perl . Environment for perl v5.7.0: HOME=/home/gisle LANG=POSIX LANGUAGE (unset) LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=/local/perl/5.7.0_thr/bin:... PERL_BADLANG (unset) SHELL=/bin/bash ```
p5pRT commented 23 years ago

From @simoncozens

On Sun\, Sep 17\, 2000 at 12​:59​:15AM -0000\, gisle@​aas.no wrote​:

$ perl -le 'print unpack("U*"\, "\xDD")' Malformed UTF-8 character at -e line 1. 65533

I would expect it to print "219\n".

% ./perl -le 'print "So would I" if "\xDD" eq pack("U*"\, 219)' %

It would break symmetry\, but I understand why you'd want it. Do you still want it?

p5pRT commented 23 years ago

From @simoncozens

On Sun\, Sep 17\, 2000 at 12​:59​:15AM -0000\, gisle@​aas.no wrote​:

$ perl -le 'print unpack("U*"\, "\xDD")' Malformed UTF-8 character at -e line 1. 65533

Ah\, sorry\, I understand this now; what you'd like is that "U" detects whether each character in a string is valid UTF8 or not\, and behaves like "A" if it isn't. To be blunt\, I don't believe that's what "U" is for; it's for decoding UTF8\, and "\xDD" isn't UTF8. Was the problem that you thought \xDD *should* be UTF8?

p5pRT commented 23 years ago

From @gisle

Simon Cozens \simon@​cozens\.net writes​:

On Sun\, Sep 17\, 2000 at 12​:59​:15AM -0000\, gisle@​aas.no wrote​:

$ perl -le 'print unpack("U*"\, "\xDD")' Malformed UTF-8 character at -e line 1. 65533

I would expect it to print "219\n".

Oops! I should have written 221 here.

% ./perl -le 'print "So would I" if "\xDD" eq pack("U*"\, 219)' %

And eq is broken too!

What is frustrating is that you seem to patch the UTF8 support in a different direction than I did half a year ago. For instance I fixed eq in change #5921 (and sv_cmp in change #5138). Then you basically undid what in change #6465. You seem to think it matter is a string is upgrade to UTF8. I think it is a bug if it matters.

And why did you end the UTF8 test you introduced for sv_eq with "&& 0"? Even if we fix that\, I still think my approach was the better one.

For reference I include some direct link to the patches mentioned​: ftp​://ftp.linux.activestate.com/pub/staff/gsar/APC/5.5.660/diffs/5138.gz ftp​://ftp.linux.activestate.com/pub/staff/gsar/APC/5.7.0/diffs/5921.gz ftp​://ftp.linux.activestate.com/pub/staff/gsar/APC/5.7.0/diffs/6465.gz

Regards\, Gisle

p5pRT commented 23 years ago

From @gisle

Simon Cozens \simon@​cozens\.net writes​:

On Sun\, Sep 17\, 2000 at 12​:59​:15AM -0000\, gisle@​aas.no wrote​:

$ perl -le 'print unpack("U*"\, "\xDD")' Malformed UTF-8 character at -e line 1. 65533

Ah\, sorry\, I understand this now; what you'd like is that "U" detects whether each character in a string is valid UTF8 or not\, and behaves like "A" if it isn't.

No. I actually think that pack("U") should go away and that pack("C") should deal with values > 255. But if we keep it then it should be made to work.

To be blunt\, I don't believe that's what "U" is for; it's for decoding UTF8\, and "\xDD" isn't UTF8.

"U" is *not* for decoding UTF8. "U" is "C" extended to work for a wider range of character ordinals. The internal UTF8 representation is not supposed to leak out like that.

Was the problem that you thought \xDD *should* be UTF8?

Not exactly.

Regards\, Gisle

p5pRT commented 23 years ago

From @gisle

Just another example to illustrate my point. IMHO\, all of these should print the same thing​:

$ perl -le 'print unpack("U"\, v221.300)' 221 $ perl -le 'print unpack("U"\, v221.200)' Malformed UTF-8 character at -e line 1. 65533 $ perl -le 'print unpack("C"\, v221.300)' 195 $ perl -le 'print unpack("C"\, v221.200)' 221

p5pRT commented 23 years ago

From @simoncozens

On Sun\, Sep 17\, 2000 at 01​:46​:53PM +0200\, Gisle Aas wrote​:

What is frustrating is that you seem to patch the UTF8 support in a different direction than I did half a year ago. For instance I fixed eq in change #5921 (and sv_cmp in change #5138). Then you basically undid what in change #6465. You seem to think it matter is a string is upgrade to UTF8. I think it is a bug if it matters.

I did that in response to a bug report\, where for someone it *did* matter. I agree that it's a bug if it matters. However\, I don't think we should do it.

When binary operators start modifying their operands in *any* way\, that's screwy and I don't like it.

And why did you end the UTF8 test you introduced for sv_eq with "&& 0"?

I don't remember doing so. I looked at that earlier this morning and realised that was what was stopping eq from doing the right thing. I wouldn't have put && 0 in there because it blatantly negates the entire point of my patch.

Simon

p5pRT commented 23 years ago

From @simoncozens

On Sun\, Sep 17\, 2000 at 01​:57​:46PM +0200\, Gisle Aas wrote​:

To be blunt\, I don't believe that's what "U" is for; it's for decoding UTF8\, and "\xDD" isn't UTF8.

"U" is *not* for decoding UTF8. "U" is "C" extended to work for a wider range of character ordinals.

Incorrect.

  U A Unicode character number. Encodes to UTF-8 internally.   Works even if C\ is not in effect.

Simon

p5pRT commented 23 years ago

From [Unknown Contact. See original ticket]

Gisle Aas \gisle@​ActiveState\.com writes​:

What is frustrating is that you seem to patch the UTF8 support in a different direction than I did half a year ago. For instance I fixed eq in change #5921 (and sv_cmp in change #5138). Then you basically undid what in change #6465. You seem to think it matter is a string is upgrade to UTF8. I think it is a bug if it matters.

We need to agree what the goal is or we are not going to get there...

FWIW - my leaning is towards Gisle's view here. Updrading a string to UTF8 is no more a bug than upgrading to SvPVIV or whatever is\, and binary ops do that kind of thing.

It is not supposed to matter which form the string is in. In essence perl strings are arrays of chars where chars can be bigger 255. The SvUTF8_on flag say that those chars are UTF8 encoded. The "lack" of SvUTF8 flag says "all these chars are \< 256 so we did not encode them".

p5pRT commented 23 years ago

From @simoncozens

On Sun\, Sep 17\, 2000 at 09​:21​:41PM +0100\, Nick Ing-Simmons wrote​:

We need to agree what the goal is or we are not going to get there...

Hmm. OK. We've got two separate issues here.  
1) eq was broken. It's my contention that it could be debroken by removing that dodgy "&& 0"\, but things like that are usually in because other stuff broke elsewhere. As I said\, I don't remember doing that and it would completely negate the point of the patch if it *was* me. (That's not\, of course\, to say that I *didn't* put it there\, just that it was fundamentally stupid if I did.)

2) There's some disagreement on what pack("U") does\, or should do. I don't have to worry about what it should do\, I only work here. I think it encodes into UTF8\, and the documentation and perl's behaviour are consistent with that belief. Gisle thinks that it's an extension of pack("C"). If it *should* be an extension of pack("C")\, it's not behaving like one and it's buggy. If it should encode to UTF8\, no bug. My Camel3 hasn't arrived yet\, so I can't even turn to that.

FWIW - my leaning is towards Gisle's view here. Updrading a string to UTF8 is no more a bug than upgrading to SvPVIV or whatever is\, and binary ops do that kind of thing.

NO! Look\, scary things can happen.

  $x = chr(200);   $y = pack("U*"\, 200);

  print "Good\, eq works now.\n" if $x eq $y;   byte_write($x);

  sub byte_write {   # Write out our arguments faithfully to a file   use bytes; # [1]   open OUT\, ">file" or die $!;   print OUT @​_;   close OUT   }

Now\, how many bytes do you expect to be put in that file? I want one. chr(200) is one byte\, as far as I'm concerned. And $x was one byte\, until it got stealthily upgraded to UTF8. Now it's two bytes\, even though we never told Perl to modify it. So\, what\, due to spooky upgrading\, we now can't reliably mix binary (byte) and UTF8 data? And this *ISN'T* a bug?

[1] We don't need that\, actually\, because at present\, printing a UTF8 string prints the bytes\, not the characters. I no longer have any idea at all whether that's a bug. You asked it to print a UTF8 string\, it printed UTF8. Isn't that what you asked for?

p5pRT commented 23 years ago

From @simoncozens

On Sun\, Sep 17\, 2000 at 10​:39​:41PM +0100\, Simon Cozens wrote​:

1) eq was broken. It's my contention that it could be debroken by removing that dodgy "&& 0"

Yep​:

% ./perl -le 'print "Yes" if chr(200) eq pack("U*"\,200)' Yes

There's probably a good reason for not applying the below. No\, I don't know what it is.

Simon

Inline Patch ```diff --- perl/sv.c.~1~ Sun Sep 17 23:37:35 2000 +++ perl/sv.c Sun Sep 17 23:37:35 2000 @@ -4085,7 +4085,7 @@ pv2 = SvPV(sv2, cur2); /* do not utf8ize the comparands as a side-effect */ - if (cur1 && cur2 && SvUTF8(sv1) != SvUTF8(sv2) && !IN_BYTE && 0) { + if (cur1 && cur2 && SvUTF8(sv1) != SvUTF8(sv2) && !IN_BYTE) { if (SvUTF8(sv1)) { pv2 = (char*)bytes_to_utf8((U8*)pv2, &cur2); pv2tmp = TRUE; End of Patch. ```
p5pRT commented 23 years ago

From [Unknown Contact. See original ticket]

Simon Cozens \simon@&#8203;cozens\.net writes​:

FWIW - my leaning is towards Gisle's view here. Updrading a string to UTF8 is no more a bug than upgrading to SvPVIV or whatever is\, and binary ops do that kind of thing.

NO! Look\, scary things can happen.

$x = chr(200); $y = pack("U*"\, 200);

print "Good\, eq works now.\n" if $x eq $y; byte_write($x);

sub byte_write { # Write out our arguments faithfully to a file use bytes; # [1] open OUT\, ">file" or die $!; print OUT @​_; close OUT }

Now\, how many bytes do you expect to be put in that file? I want one. chr(200) is one byte\, as far as I'm concerned. And $x was one byte\, until it got stealthily upgraded to UTF8. Now it's two bytes\, even though we never told Perl to modify it. So\, what\, due to spooky upgrading\, we now can't reliably mix binary (byte) and UTF8 data? And this *ISN'T* a bug?

Yes it is a bug - in print! it should downgrade the UTF8 version of the character. But then you did put in a 'use bytes'. Ilya and I argued hard that 'use bytes' is alien to the whole "transparent representation" concept. But as I understand the eventual definition in the scope of a 'use bytes' any UTF8 encoded chars are down graded\, with a hard fail (= die) if any are > 255.

[1] We don't need that\, actually\, because at present\, printing a UTF8 string prints the bytes\, not the characters. I no longer have any idea at all whether that's a bug. You asked it to print a UTF8 string\, it printed UTF8. Isn't that what you asked for?

This is the fundamental breakage - we have no way to "ask" for anything on IO.

p5pRT commented 23 years ago

From @simoncozens

On Mon\, Sep 18\, 2000 at 08​:48​:39AM +0100\, Nick Ing-Simmons wrote​:

Yes it is a bug - in print! it should downgrade the UTF8 version of the character.

Okay\, I shall make it so. On your head be it. :)

But then you did put in a 'use bytes'. Ilya and I argued hard that 'use bytes' is alien to the whole "transparent representation" concept. But as I understand the eventual definition in the scope of a 'use bytes' any UTF8 encoded chars are down graded\, with a hard fail (= die) if any are > 255.

With all due respect\, that's not what "use bytes" does\, and it doesn't do that for a reason.

I believe that "use bytes" treats everything as a string of bytes\, not as a string of characters. The string that is represented as "\304\254"\, being character 300 in Unicode\, suddenly finds itself treated as two independent bytes\, character 196 and character 172.

That's what it claims to do...

  The C\ pragma disables character semantics for the rest of the   lexical scope in which it appears.

...and that's indeed what it does​:

% ./perl -Ilib -le '$x = chr(300); print "As characters​: "\, length $x; { use bytes; print "As a series of bytes​: "\, length $x }'

As characters​: 1 As a series of bytes​: 2

No downgrading there at all. But don't tell me​: that's a bug\, right?

p5pRT commented 23 years ago

From @simoncozens

On Mon\, Sep 18\, 2000 at 09​:15​:35AM +0100\, Simon Cozens wrote​:

On Mon\, Sep 18\, 2000 at 08​:48​:39AM +0100\, Nick Ing-Simmons wrote​:

Yes it is a bug - in print! it should downgrade the UTF8 version of the character.

Okay\, I shall make it so. On your head be it. :)

This'll make print downgrade output *if possible*. Did you want it to croak on   print chr(300) or leave it as UTF8? If the former\, change TRUE to FALSE.

Inline Patch ```diff --- perl/doio.c.~1~ Mon Sep 18 09:25:40 2000 +++ perl/doio.c Mon Sep 18 09:25:40 2000 @@ -1168,6 +1168,7 @@ } /* FALL THROUGH */ default: + sv_utf8_downgrade(sv, TRUE); tmps = SvPV(sv, len); break; } End of Patch. ```
p5pRT commented 23 years ago

From [Unknown Contact. See original ticket]

Simon Cozens \simon@&#8203;cozens\.net writes​:

But then you did put in a 'use bytes'. Ilya and I argued hard that 'use bytes' is alien to the whole "transparent representation" concept. But as I understand the eventual definition in the scope of a 'use bytes' any UTF8 encoded chars are down graded\, with a hard fail (= die) if any are > 255.

With all due respect\,

Thanks but I have a thick skin ;-) I may well have mis-remembered the resolution.

With a view to clearing up the meaning I have copied a few folk I know have discussed this stuff in the past.

that's not what "use bytes" does\, and it doesn't do that for a reason.

What is the reason?

I can understand a pragma which does what I think it does\, but given that (in my view quite correctly) the UTF8 flag's state depends on the value's history I cannot see how blindly treating SvPV as 'bytes' is any use what so ever.

Note that in the "normal" case of (say)

#!perl use bytes; ...

It is moot as that file never turns the thing on. It may still get UTF8 from modules and things though.

I believe that "use bytes" treats everything as a string of bytes\, not as a string of characters.

So do I. And I believe it should squeal loudly if any "byte" turns out to be >= 256.

The string that is represented as "\304\254"\, being character 300 in Unicode\, suddenly finds itself treated as two independent bytes\, character 196 and character 172.

I could argue that result should be chr(300 & 255) i.e. chr(44). I am not sure I want to - I think I prefer the 'die'.

That's what it claims to do...

The C\ pragma disables character semantics for the rest of the lexical scope in which it appears.

Which is a tad vague.

I would prefer it if it said something like​:

The C\ pragma asserts that all strings are composed of characters in the range 0..255 (as in perl5.005) for its lexical scope. New strings will not be UTF8 encoded. If code with C\ in scope encounters a string which is UTF8 encoded (e.g. return from a module which does not have C\) then string will be decoded\, if any larger than 255 is found then perl will C\ giving the value of the first out of range character.

...and that's indeed what it does​:

% ./perl -Ilib -le '$x = chr(300); print "As characters​: "\, length $x; { use bytes; print "As a series of bytes​: "\, length $x }'

As characters​: 1 As a series of bytes​: 2

No downgrading there at all. But don't tell me​: that's a bug\, right?

_I_ think so - but if my mental model of this stuff turns out to be wrong then it may not be.

p5pRT commented 23 years ago

From [Unknown Contact. See original ticket]

Simon Cozens \simon@&#8203;cozens\.net writes​:

On Mon\, Sep 18\, 2000 at 09​:15​:35AM +0100\, Simon Cozens wrote​:

On Mon\, Sep 18\, 2000 at 08​:48​:39AM +0100\, Nick Ing-Simmons wrote​:

Yes it is a bug - in print! it should downgrade the UTF8 version of the character.

Okay\, I shall make it so. On your head be it. :)

This'll make print downgrade output *if possible*. Did you want it to croak on print chr(300)

I think it should croak or print \x{12C} or some other "escaped" representation. That is it should croak. When we have 'em the default output discipline should probably make some representaion of out-of-bounds chars.

or leave it as UTF8? If the former\, change TRUE to FALSE.

--- perl/doio.c.~1~ Mon Sep 18 09​:25​:40 2000 +++ perl/doio.c Mon Sep 18 09​:25​:40 2000 @​@​ -1168\,6 +1168\,7 @​@​ } /* FALL THROUGH */ default​: + sv_utf8_downgrade(sv\, TRUE); tmps = SvPV(sv\, len); break; } End of Patch.

p5pRT commented 23 years ago

From @doughera88

On Mon\, 18 Sep 2000\, Nick Ing-Simmons wrote​:

The string that is represented as "\304\254"\, being character 300 in Unicode\, suddenly finds itself treated as two independent bytes\, character 196 and character 172.

I don't have a deep understanding of Unicode\, nor am I particularly interested in becoming an expert. However\, I do sometimes process binary data in perl\, and sequences of bytes in that binary data are often the same as sequences of bytes used to represent some Unicode character.

What I want is some way to be *sure* that my data isn't mangled.
C\<use bytes;> is\, I hope\, one way to do that. Yes\, I know that perl will usually "Do the Right Thing"\, but the following caveat in perlunicode.pod

  Whether an arbitrary piece of data will be treated as   "characters" or "bytes" by internal operations cannot be   divined at the current time

doesn't give me overwhelming confidence that perl will _always_ do the right thing :-).

I like the clarity of having a positive assertion "use bytes" to put at the top.

I could argue that result should be chr(300 & 255) i.e. chr(44). I am not sure I want to - I think I prefer the 'die'.

chr(44) would definitely be mangling my data. "Silent" mangling of this sort would probably not make me happy. I expect 'die' would be acceptable. It's a sign from perl that my data isn't what I thought it was. If some module somewhere has incorrectly tagged my binary data as Unicode\, then I would prefer to find out as soon as possible so I can work around the problem.

Dealing with binary data is not Perl's primary focus\, and I don't mind jumping through a few extra hoops in order to do so\, but please let's not make it hard to do so reliably.

I would prefer it if it said something like​:

The C\ pragma asserts that all strings are composed of characters in the range 0..255 (as in perl5.005) for its lexical scope. New strings will not be UTF8 encoded. If code with C\ in scope encounters a string which is UTF8 encoded (e.g. return from a module which does not have C\) then string will be decoded\, if any larger than 255 is found then perl will C\ giving the value of the first out of range character.

If I correctly understand what you mean by "decoded"\, then that sounds reasonable to me.

p5pRT commented 23 years ago

From @TimToady

Saying "use bytes" means to enforce pre-Unicode Perl semantics\, which includes a healthy dose of agnosticism. I think that means "use bytes" treats all strings as buckets of bits regardless of whether the SvUTF8 bit is set. Garbage in\, garbage out. If you want to deal with utf8 intelligently within the scope of a "use bytes"\, you have to look at the SvUTF8 bit yourself.

We can certainly have a pragma that forces all interfaces to iso-8859-1 semantics and tries to do the right thing with any utf8 strings\, but "use bytes" isn't that pragma.

Larry

p5pRT commented 23 years ago

From @simoncozens

On Mon\, Sep 18\, 2000 at 09​:59​:52AM -0700\, Larry Wall wrote​:

I think that means "use bytes" treats all strings as buckets of bits regardless of whether the SvUTF8 bit is set.

Gotcha. Perhaps this calls for a docpatch.

Inline Patch ```diff --- perl/lib/bytes.pm.~1~ Mon Sep 18 18:19:24 2000 +++ perl/lib/bytes.pm Mon Sep 18 18:19:24 2000 @@ -38,11 +38,28 @@ lexical scope in which it appears. C can be used to reverse the effect of C within the current lexical scope. -Perl normally assumes character semantics in the presence of -character data (i.e. data that has come from a source that has -been marked as being of a particular character encoding). +Perl normally assumes character semantics in the presence of character +data (i.e. data that has come from a source that has been marked as +being of a particular character encoding). When C is in +effect, the encoding is temporarily ignored, and each string is treated +as a series of bytes. + +As an example, when Perl sees C<$x = chr(400)>, it encodes the character +in UTF8 and stores it in $x. Then it is marked as character data, so, +for instance, C returns C<1>. However, in the scope of the +C pragma, $x is treated as a series of bytes - the bytes that make +up the UTF8 encoding - and C returns C<2>: + + $x = chr(400); + print "Length is ", length $x, "\n"; # "Length is 1" + printf "Contents are %vd\n", $x; # "Contents are 400" + { + use bytes; + print "Length is ", length $x, "\n"; # "Length is 2" + printf "Contents are %vd\n", $x; # "Contents are 198.144" + } -To understand the implications and differences between character +For more on the implications and differences between character semantics and byte semantics, see L. =head1 SEE ALSO ```

End of Patch.

Now it also seems to me that it would make sense to print a string of bytes when calling C\ in the scope of C\\, so this should modify my previous patch to C\<do_print>.

Inline Patch ```diff --- perl/doio.c.~1~ Mon Sep 18 18:24:07 2000 +++ perl/doio.c Mon Sep 18 18:24:07 2000 @@ -1168,7 +1168,8 @@ } /* FALL THROUGH */ default: - sv_utf8_downgrade(sv, TRUE); + if (!IN_BYTE) + sv_utf8_downgrade(sv, TRUE); tmps = SvPV(sv, len); break; } End of Patch. ```
p5pRT commented 23 years ago

From @doughera88

On Mon\, 18 Sep 2000\, Simon Cozens wrote​:

+Perl normally assumes character semantics in the presence of character +data (i.e. data that has come from a source that has been marked as +being of a particular character encoding). When C\ is in +effect\, the encoding is temporarily ignored\, and each string is treated +as a series of bytes.

Thanks. One question I still have is​: How exactly does data get marked as being of a particular encoding?

  Andy Dougherty doughera@​lafayette.edu   Dept. of Physics   Lafayette College\, Easton PA 18042

p5pRT commented 23 years ago

From @doughera88

On Mon\, 18 Sep 2000\, Larry Wall wrote​:

Saying "use bytes" means to enforce pre-Unicode Perl semantics\, which includes a healthy dose of agnosticism. I think that means "use bytes" treats all strings as buckets of bits regardless of whether the SvUTF8 bit is set. Garbage in\, garbage out.

Sounds good (except in my present case that's Garbage in\, Grant-proposal out :-). "treats all strings as buckets of bits" sounds like a fine description understandable to someone like me who just wants the raw bits.

Thanks\,

  Andy Dougherty doughera@​lafayette.edu

p5pRT commented 23 years ago

From @simoncozens

On Mon\, Sep 18\, 2000 at 02​:53​:04PM -0400\, Andy Dougherty wrote​:

Thanks. One question I still have is​: How exactly does data get marked as being of a particular encoding?

Ah\, yes. Uhm. Line disciplines\, of course. Plus anything in Perl which yields something which must be expressed as UTF8​:

  pack("U*"\, ...)   chr( $x ) # $x > 255   vx.y.z # max(x\,y\,z) > 255   \x{BIGNUM}   \N{UNICODE THING}

I think that's about it.

p5pRT commented 23 years ago

From @abigail

On Mon\, Sep 18\, 2000 at 08​:08​:59PM +0100\, Simon Cozens wrote​:

On Mon\, Sep 18\, 2000 at 02​:53​:04PM -0400\, Andy Dougherty wrote​:

Thanks. One question I still have is​: How exactly does data get marked as being of a particular encoding?

Ah\, yes. Uhm. Line disciplines\, of course. Plus anything in Perl which yields something which must be expressed as UTF8​:

pack\("U\*"\, \.\.\.\)
chr\( $x \) \# $x > 255
vx\.y\.z \#  max\(x\,y\,z\) > 255

Is there a reason a string expressed as vx.y.z should ever not be UTF8? It's already confusing enough to find out what is UTF8 and what isn't\, rules like 'a v-string\, but if and only if at least one of the components is 256 or more' don't make things any clearer.

It isn't really useful to use v strings for binary data anyway\, is it?

\\x\{BIGNUM\}
\\N\{UNICODE THING\}

I think that's about it.

Abigail

p5pRT commented 23 years ago

From @simoncozens

On Mon\, Sep 18\, 2000 at 06​:24​:25PM +0100\, Simon Cozens wrote​:

Now it also seems to me that it would make sense to print a string of bytes when calling C\ in the scope of C\\, so this should modify my previous patch to C\<do_print>.

This would leave us in the wonderful position where C\ in C\ would only print out bytes\, and C\ in C\ would print out UTF8.

Could we\, perhaps\, make this less confusing somehow?

p5pRT commented 23 years ago

From [Unknown Contact. See original ticket]

Andy Dougherty \doughera@&#8203;lafayette\.edu writes​:

On Mon\, 18 Sep 2000\, Nick Ing-Simmons wrote​:

The string that is represented as "\304\254"\, being character 300 in Unicode\, suddenly finds itself treated as two independent bytes\, character 196 and character 172.

I like the clarity of having a positive assertion "use bytes" to put at the top.

What are you "asserting" ? - that is what we are discussing the _meaning_ of "use bytes".

Does it mean :

A. Everything should fit in a byte in here. B. Give me any old thing that happens to be about - they are "just bytes".

I could argue that result should be chr(300 & 255) i.e. chr(44). I am not sure I want to - I think I prefer the 'die'.

chr(44) would definitely be mangling my data.

If you had 'use bytes' in scope then "\304\254" with UTF8 flag set is not _YOUR_ data (your data never has UTf8 flag set). Someone else gave you it. What do you want to happen in that case?

"Silent" mangling of this sort would probably not make me happy. I expect 'die' would be acceptable. It's a sign from perl that my data isn't what I thought it was. If some module somewhere has incorrectly tagged my binary data as Unicode\, then I would prefer to find out as soon as possible so I can work around the problem.

Good.

Dealing with binary data is not Perl's primary focus\, and I don't mind jumping through a few extra hoops in order to do so\, but please let's not make it hard to do so reliably.

I would prefer it if it said something like​:

The C\ pragma asserts that all strings are composed of characters in the range 0..255 (as in perl5.005) for its lexical scope. New strings will not be UTF8 encoded. If code with C\ in scope encounters a string which is UTF8 encoded (e.g. return from a module which does not have C\) then string will be decoded\, if any larger than 255 is found then perl will C\ giving the value of the first out of range character.

If I correctly understand what you mean by "decoded"\, then that sounds reasonable to me.

What I mean is if 'ÿ' (say) which is legal iso-8859-1 8-bit char has got itself UTF8 encoded it gets mapped back to its byte value of 0xFF and given to you as such.

p5pRT commented 23 years ago

From [Unknown Contact. See original ticket]

Larry Wall \larry@&#8203;wall\.org writes​:

Saying "use bytes" means to enforce pre-Unicode Perl semantics\, which includes a healthy dose of agnosticism.

But pre-Unicode perl did not go round UTF8 encoding things. Given that perl-5.6+ may if code outside the scope of 'use bytes' gets called what should happen when such a thing gets back to 'use bytes' code?

I think that means "use bytes" treats all strings as buckets of bits regardless of whether the SvUTF8 bit is set. Garbage in\, garbage out. If you want to deal with utf8 intelligently within the scope of a "use bytes"\, you have to look at the SvUTF8 bit yourself.

Which isn't "just like pre-Unicode perl".

In the scope of use bytes what should happen to the SvUTF8 flag as a result of ops?

While this may be "Garbage in\, garbage out"\, the input garbage was neatly separated for re-cycling\, and labeled "beware broken glass" where appropriate - now it is all mixed up again.

We can certainly have a pragma that forces all interfaces to iso-8859-1 semantics and tries to do the right thing with any utf8 strings\, but "use bytes" isn't that pragma.

Fair enough as that is not what I want for binary data anyway.

Maybe I want yet-another-pragma

use strict 'bytes';

Which asserts all 'characters' are 0..255

p5pRT commented 23 years ago

From @tux

On Mon\, 18 Sep 2000 20​:08​:59 +0100\, Simon Cozens \simon@&#8203;cozens\.net wrote​:

On Mon\, Sep 18\, 2000 at 02​:53​:04PM -0400\, Andy Dougherty wrote​:

Thanks. One question I still have is​: How exactly does data get marked as being of a particular encoding?

Ah\, yes. Uhm. Line disciplines\, of course. Plus anything in Perl which yields something which must be expressed as UTF8​:

pack\("U\*"\, \.\.\.\)
chr\( $x \) \# $x > 255
vx\.y\.z \#  max\(x\,y\,z\) > 255
\\x\{BIGNUM\}
\\N\{UNICODE THING\}

I think that's about it.

  \p{...} \P{...} \X in regexes?

p5pRT commented 23 years ago

From @simoncozens

On Tue\, Sep 19\, 2000 at 11​:14​:18AM +0200\, H.Merijn Brand wrote​:

Ah\, yes. Uhm. Line disciplines\, of course. Plus anything in Perl which yields something which must be expressed as UTF8​: \p{...} \P{...} \X in regexes?

Well\, hmm. Only if the input data is UTF8 in the first place. \p{} doesn't *create* UTF8 data\, it just selects a chunk of some already existing UTF8 data​:

% ./perl -Ilib -MDevel​::Peek -e '"abcd"=~/(\p{IsAlpha})/; $x = $1; Dump($x)' SV = PVMG(0x81696d0) at 0x817ff14   REFCNT = 1   FLAGS = (POK\,pPOK)   IV = 0   NV = 0   PV = 0x8185318 "a"\0   CUR = 1   LEN = 2

p5pRT commented 23 years ago

From @simoncozens

On Tue\, Sep 19\, 2000 at 08​:59​:12AM +0100\, Nick Ing-Simmons wrote​:

Maybe I want yet-another-pragma

use strict 'bytes';

Which asserts all 'characters' are 0..255

I'm reasonably sure you can't have it. :) Look\, when do we test this assertion? Consider​:  
  $x = v300.400.500;   {   use strict 'bytes';   print $x;   }

OK\, so we have to test it inside the "print" operator\, since that's when our naughty non-byte data gets used.

  $x = v300.400.500;   {   use strict 'bytes';   $x .= $x;   }

So we have to test both sides of the concat operator.

In fact\, you should be able to see that you have to test all the data coming into each operator. And you have to do this when the operator is used\, because you can't test the data any other time\, as it may be UTF8 data created outside the scope of your pragma.

So\, you seem to be wanting a run-time assertion inserted into every single op\, looking at every single byte in every single piece of data used by that op.

I don't want to even *think* about implementing that.

Alternatively\, you can just make it the scope of the entire program\, and have it turn off the ability to use Unicode data in any way\, shape or form.

  package unicode;   sub unimport { exec "perl5.005"\, $0\, @​ARGV }

p5pRT commented 23 years ago

From [Unknown Contact. See original ticket]

At 20​:08 +0100 2000-09-18\, Simon Cozens wrote​:

Plus anything in Perl which yields something which must be expressed as UTF8​:

pack\("U\*"\, \.\.\.\)
chr\( $x \) \# $x > 255
vx\.y\.z \#  max\(x\,y\,z\) > 255
\\x\{BIGNUM\}
\\N\{UNICODE THING\}

My researches suggest that\, like the second and third cases\, the last only produces a UTF8 string (and marks the scalar appropriately) if the resulting code is > 255​:

ppp100 domo$ perl -Mcharnames=​:full -MDevel​::Peek -e

'Dump("\N{LATIN SMALL LETTER THORN}")' SV = PV(0x7886be8) at 0x787a7ec   REFCNT = 1   FLAGS = (POK\,READONLY\,pPOK)   PV = 0x7888398 "\376"\0   CUR = 1   LEN = 2 $ perl -Mcharnames=​:full -MDevel​::Peek -e 'Dump("\N{RUNIC LETTER THURISAZ THURS THORN}")' SV = PV(0x7886be8) at 0x787a7ec   REFCNT = 1   FLAGS = (POK\,READONLY\,pPOK\,UTF8)   PV = 0x7888388 "\341\232\246"\0   CUR = 3   LEN = 4

(Unpatched perl5.7.0\, BTW.)

p5pRT commented 23 years ago

From @doughera88

On Mon\, 18 Sep 2000\, Simon Cozens wrote​:

On Mon\, Sep 18\, 2000 at 02​:53​:04PM -0400\, Andy Dougherty wrote​:

Thanks. One question I still have is​: How exactly does data get marked as being of a particular encoding?

Ah\, yes. Uhm. Line disciplines\, of course. Plus anything in Perl which yields something which must be expressed as UTF8​:

Ok. I see. Most of those are within the programmer's immediate control and so are not really a problem for me. But until line disciplines get nailed down[*] (a big job\, I know) I guess it's not completely settled. That's ok with me for the moment. I'm not looking for final answers now\, just trying to get a good handle on the questions I should be asking.

Thanks\,

  Andy Dougherty doughera@​lafayette.edu

[*] e.g. when/how do they kick in? Do they affect read()\, sysread()\, \<>? How about recv()? How about System V IPC?

p5pRT commented 23 years ago

From @doughera88

On Tue\, 19 Sep 2000\, Nick Ing-Simmons wrote​:

Larry Wall \larry@&#8203;wall\.org writes​:

Saying "use bytes" means to enforce pre-Unicode Perl semantics\, which includes a healthy dose of agnosticism.

But pre-Unicode perl did not go round UTF8 encoding things. Given that perl-5.6+ may if code outside the scope of 'use bytes' gets called what should happen when such a thing gets back to 'use bytes' code?

Ideally\, that shouldn't happen :-). So perhaps\, as a first pass\, we just die? Then\, if we find out we're dying way too often\, we try to figure out something smarter to do.

p5pRT commented 23 years ago

From [Unknown Contact. See original ticket]

Andy Dougherty \doughera@&#8203;lafayette\.edu writes​:

On Tue\, 19 Sep 2000\, Nick Ing-Simmons wrote​:

Larry Wall \larry@&#8203;wall\.org writes​:

Saying "use bytes" means to enforce pre-Unicode Perl semantics\, which includes a healthy dose of agnosticism.

But pre-Unicode perl did not go round UTF8 encoding things. Given that perl-5.6+ may if code outside the scope of 'use bytes' gets called what should happen when such a thing gets back to 'use bytes' code?

Ideally\, that shouldn't happen :-). So perhaps\, as a first pass\, we just die? Then\, if we find out we're dying way too often\, we try to figure out something smarter to do.

Fine by me - anything which does not just return the UTF8 encoding is better than what we have now :-(

p5pRT commented 23 years ago

From [Unknown Contact. See original ticket]

Simon Cozens \simon@&#8203;cozens\.net writes​:

I (foolishly) said​:

Maybe I want yet-another-pragma

use strict 'bytes';

Which asserts all 'characters' are 0..255

   use strict 'bytes';
   $x \.= $x;

So we have to test both sides of the concat operator.

So\, you seem to be wanting a run-time assertion inserted into every single op\, looking at every single byte in every single piece of data used by that op.

I don't want to even *think* about implementing that.

Eeek\, thanks for thinking it through for me. That is quite horrible. I guess all I can hope for is an output line disipline that checks end result.

you can't test the data any other time\, as it may be UTF8 data created outside the scope of your pragma.

Well you could see if 'BYTES' was on anywere up the call stack but we had this daft idea way back when we were not tagging the data\, the data-tag was because that got silly.

p5pRT commented 23 years ago

From [Unknown Contact. See original ticket]

Simon Cozens \simon@&#8203;cozens\.net writes​:

I (foolishly) said​:

Maybe I want yet-another-pragma

use strict 'bytes';

Which asserts all 'characters' are 0..255

   use strict 'bytes';
   $x \.= $x;

So we have to test both sides of the concat operator.

So\, you seem to be wanting a run-time assertion inserted into every single op\, looking at every single byte in every single piece of data used by that op.

I don't want to even *think* about implementing that.

Eeek\, thanks for thinking it through for me. That is quite horrible. I guess all I can hope for is an output line disipline that checks end result.

you can't test the data any other time\, as it may be UTF8 data created outside the scope of your pragma.

Well you could see if 'BYTES' was on anywere up the call stack but we had this daft idea way back when we were not tagging the data\, the data-tag was because that got silly.

p5pRT commented 23 years ago

From [Unknown Contact. See original ticket]

On Mon\, Sep 18\, 2000 at 08​:08​:59PM +0100\, Simon Cozens \simon@&#8203;cozens\.net wrote​:

Ah\, yes. Uhm. Line disciplines\, of course. Plus anything in Perl which yields something which must be expressed as UTF8​:

You forgot​:

$utf8 = "string";

(currently you have to enable this with use utf8\, though ;)

PS​: eval currently ignores the utf8-setting on the string\, which IMHO is a very-low-priority-bug. Here is a testcase​:

use Convert​::Scalar qw(​:utf8); $x = "'\x{1234}'"; utf8 $x or die; # test for utf8\, should not and does not die. $y = eval $x; utf8 $y or die; # test for utf8\, should not but _does_ die

p5pRT commented 23 years ago

From The RT System itself

I'm closing this as a bug though as a discussion thread about UTF-8 this is still good reading. Note also that even if closed as bug the original behaviour remains (now with a slightly more verbose error message). In other words\, as far as I am able to interpret the current plan\, this is not a bug.