Closed p5pRT closed 21 years ago
The ~ operation on UTF8 flagged strings does not do the right thing:
$ perl -MDevel::Peek -e 'Dump(~v300)' SV = PV(0x8160cac) at 0x8160740 REFCNT = 1 FLAGS = (PADBUSY\,PADTMP\,POK\,READONLY\,pPOK\,UTF8) PV = 0x816dee0 ";S"\0 CUR = 2 LEN = 3
It just flips the bits\, but does not even turn off the UTF8 flag.
It is not clear to me what the operation should do. One way is to use 0..10FFFF (the official range of UTF8) and flip bits based on that. That seem kind of wrong. I would suggest that we simply flip bits as if the character was an 'int'. (That would create a fairly long string internally on 64bit machines.)
I would also argue that ~"\0" should evauate into the same as chr(~0) unless inside 'use bytes' scope. Currently it evaluates to chr(255).
In article \20000918174601\.2012\.qmail@​eik\.g\.aas\.no\, gisle@aas.no wrote:
The ~ operation on UTF8 flagged strings does not do the right thing:
$ perl -MDevel::Peek -e 'Dump(~v300)' SV = PV(0x8160cac) at 0x8160740 REFCNT = 1 FLAGS = (PADBUSY\,PADTMP\,POK\,READONLY\,pPOK\,UTF8) PV = 0x816dee0 ";S"\0 CUR = 2 LEN = 3
It just flips the bits\, but does not even turn off the UTF8 flag.
It is not clear to me what the operation should do. One way is to use 0..10FFFF (the official range of UTF8) and flip bits based on that. That seem kind of wrong. I would suggest that we simply flip bits as if the character was an 'int'. (That would create a fairly long string internally on 64bit machines.)
I would also argue that ~"\0" should evauate into the same as chr(~0) unless inside 'use bytes' scope. Currently it evaluates to chr(255).
Hmm. I don't really see a reasonable use for this (~ on strings with chars > 255). The others (^\, |\, &) lend themselves to a convenient definition for what to do with chars > 255. Perhaps then the best thing would be to maintain as much backward-compatibility as possible and truncate each char to 8 bits after ~-ing.
On the other hand\, if one is creating a bitmask to later use with ^\, &\, or |\, it would make sense to set the maximum number of bits in a perl-utf8 char. But that produces pretty long strings from e.g. "\0\0\0". As well as the surprise UTF8-encoded string resulting from ~ on a non-UTF8-encoded string.
Either way\, I see no reason to limit it to official UTF8 or int size.
On Mon\, Sep 18\, 2000 at 08:45:08PM -0700\, Yitzchak Scott-Thoennes wrote:
In article \20000918174601\.2012\.qmail@​eik\.g\.aas\.no\, gisle@aas.no wrote:
The ~ operation on UTF8 flagged strings does not do the right thing:
$ perl -MDevel::Peek -e 'Dump(~v300)' SV = PV(0x8160cac) at 0x8160740 REFCNT = 1 FLAGS = (PADBUSY\,PADTMP\,POK\,READONLY\,pPOK\,UTF8) PV = 0x816dee0 ";S"\0 CUR = 2 LEN = 3
It just flips the bits\, but does not even turn off the UTF8 flag.
It is not clear to me what the operation should do. One way is to use 0..10FFFF (the official range of UTF8) and flip bits based on that. That seem kind of wrong. I would suggest that we simply flip bits as if the character was an 'int'. (That would create a fairly long string internally on 64bit machines.)
How about this: if the $string is in utf8:
~"$string" eq join(""\, map { ~ord($_) } split //\, $string);
and preserve the utf8ness\, because we want ~~X eq X.
I would also argue that ~"\0" should evauate into the same as chr(~0) unless inside 'use bytes' scope.
Sounds like the above.
Currently it evaluates to chr(255).
On the other hand\, if one is creating a bitmask to later use with ^\, &\, or |\, it would make sense to set the maximum number of bits in a perl-utf8 char. But that produces pretty long strings from e.g. "\0\0\0". As well as the
So does ~0 (it produces pretty "long integers").
surprise UTF8-encoded string resulting from ~ on a non-UTF8-encoded string.
How about this: if the $string is in utf8:
~"$string" eq join(""\, map { ~ord($_) } split //\, $string); and preserve the utf8ness\, because we want ~~X eq X.
Argblebargle. That didn't come out right. I meant
~"$string" eq join(""\, map { chr(~ord($_)) } split //\, $string);
So ~(chr(200).chr(2000)) would be chr(~200).chr(~2000).
On Sat\, Oct 14\, 2000 at 01:12:11PM -0500\, Jarkko Hietaniemi wrote:
~"$string" eq join(""\, map { chr(~ord($_)) } split //\, $string); So ~(chr(200).chr(2000)) would be chr(~200).chr(~2000).
Make UTF8 ~chr($x) == chr(~$x)
==== //depot/bleadperl/pp.c#7 (text) ==== Index: perl/pp.c
==== //depot/bleadperl/t/op/bop.t#5 (xtext) ==== Index: perl/t/op/bop.t
==== //depot/bleadperl/utf8.h#4 (text) ==== Index: perl/utf8.h
On Sat\, Oct 14\, 2000 at 08:52:13PM +0100\, Simon Cozens wrote:
On Sat\, Oct 14\, 2000 at 01:12:11PM -0500\, Jarkko Hietaniemi wrote:
~"$string" eq join(""\, map { chr(~ord($_)) } split //\, $string); So ~(chr(200).chr(2000)) would be chr(~200).chr(~2000).
Make UTF8 ~chr\($x\) == chr\(~$x\)
==== //depot/bleadperl/pp.c#7 (text) ==== Index: perl/pp.c --- perl/pp.c.~1~ Sat Oct 14 20:50:48 2000 +++ perl/pp.c Sat Oct 14 20:50:48 2000
Does not work on an alpha. Here's the output:
1..37 ok 1 ok 2 ok 3 ok 4 ok 5 ok 6 ok 7 ok 8 ok 9 ok 10 ok 11 ok 12 ok 13 ok 14 ok 15 ok 16 ok 17 ok 18 ok 19 ok 20 ok 21 ok 22 ok 23 ok 24 ok 25 ok 26 ok 27 ok 28 ok 29 ok 30 ok 31 ok 32 ok 33 ok 34 ok 35  36  37
-- $jhi++; # http://www.iki.fi/jhi/ # There is this special biologist word we use for 'stable'. # It is 'dead'. -- Jack Cohen
Jarkko Hietaniemi \jhi@​iki\.fi wrote:
On Mon\, Sep 18\, 2000 at 08:45:08PM -0700\, Yitzchak Scott-Thoennes wrote:
In article \20000918174601\.2012\.qmail@​eik\.g\.aas\.no\, gisle@aas.no wrote:
The ~ operation on UTF8 flagged strings does not do the right thing:
$ perl -MDevel::Peek -e 'Dump(~v300)' SV = PV(0x8160cac) at 0x8160740 REFCNT = 1 FLAGS = (PADBUSY\,PADTMP\,POK\,READONLY\,pPOK\,UTF8) PV = 0x816dee0 ";S"\0 CUR = 2 LEN = 3
It just flips the bits\, but does not even turn off the UTF8 flag.
It is not clear to me what the operation should do. One way is to use 0..10FFFF (the official range of UTF8) and flip bits based on that. That seem kind of wrong. I would suggest that we simply flip bits as if the character was an 'int'. (That would create a fairly long string internally on 64bit machines.)
How about this: if the $string is in utf8:
Sorry for the very late response to this. But doesn't that "if the $string is in utf8" violate our cardinal rule that the encoding shouldn't affect the results (e.g. ~($x=v1) should be the same as ~(chop($x=v1.300)\,$x).
~"$string" eq join(""\, map { ~ord($_) } split //\, $string);
Correction in followup email noted. The patch based on this looks ok except that it should be checking !IN_BYTE\, not SvUTF8.
and preserve the utf8ness\, because we want ~~X eq X.
I would also argue that ~"\0" should evauate into the same as chr(~0) unless inside 'use bytes' scope.
Sounds like the above.
Currently it evaluates to chr(255).
On the other hand\, if one is creating a bitmask to later use with ^\, &\, or |\, it would make sense to set the maximum number of bits in a perl-utf8 char. But that produces pretty long strings from e.g. "\0\0\0". As well as the
So does ~0 (it produces pretty "long integers").
surprise UTF8-encoded string resulting from ~ on a non-UTF8-encoded string.
-- $jhi++; # http://www.iki.fi/jhi/ # There is this special biologist word we use for 'stable'. # It is 'dead'. -- Jack Cohen
How about this: if the $string is in utf8:
Sorry for the very late response to this. But doesn't that "if the $string is in utf8" violate our cardinal rule that the encoding shouldn't affect the results (e.g. ~($x=v1) should be the same as ~(chop($x=v1.300)\,$x).
I'm sorry but I'm very dense today. Please explain your reasoning. Make certain that your definition of ~ obeys the rules
(1) ~~x == x
(2) ~(x|y) == ~x&~y
(3) ~(x&y) == ~x|~y
(4) x|~x == 1
(5) x&~x == 0
or there is not much point in implementing ~ at all...
~"$string" eq join(""\, map { ~ord($_) } split //\, $string);
Correction in followup email noted. The patch based on this looks ok except that it should be checking !IN_BYTE\, not SvUTF8.
Ummm\, why should we pay the speed hit of utf* function calls for byte data?
Jarkko Hietaniemi \jhi@​iki\.fi wrote:
How about this: if the $string is in utf8:
Sorry for the very late response to this. But doesn't that "if the $string is in utf8" violate our cardinal rule that the encoding shouldn't affect the results (e.g. ~($x=v1) should be the same as ~(chop($x=v1.300)\,$x).
I'm sorry but I'm very dense today. Please explain your reasoning. Make certain that your definition of ~ obeys the rules
\(1\) ~~x == x \(2\) ~\(x|y\) == ~x&~y \(3\) ~\(x&y\) == ~x|~y \(4\) x|~x == 1 \(5\) x&~x == 0
or there is not much point in implementing ~ at all...
Yes. (Even with s/==/eq/.) Though ~~x will upgrade to utf8 encoding if it wasn't already on.
What's missing is:
(6) $x eq $y implies ~$x eq ~$y.
[D:\perl-current].\perl -wlIlib $x = v200; chop($y = v200.300); print "\$x eq \$y" if $x eq $y; print "~\$x eq ~\$y" if ~$x eq ~$y; __END__ $x eq $y
The cardinal rule to which I refer was stated as: * It doesn't matter if data gets upgraded to UTF8 internally; if there is a place where it does matter\, that's a bug. in This Week on perl5-porters (9--23 October 2000).
~"$string" eq join(""\, map { ~ord($_) } split //\, $string);
Correction in followup email noted. The patch based on this looks ok except that it should be checking !IN_BYTE\, not SvUTF8.
Ummm\, why should we pay the speed hit of utf* function calls for byte data?
I'm sorry\, but I'm very dense today too. What do you mean?
In article \20001030181150\.A27977@​chaos\.wustl\.edu\, Jarkko Hietaniemi \jhi@​iki\.fi wrote:
How about this: if the $string is in utf8:
Sorry for the very late response to this. But doesn't that "if the $string is in utf8" violate our cardinal rule that the encoding shouldn't affect the results (e.g. ~($x=v1) should be the same as ~(chop($x=v1.300)\,$x).
I'm sorry but I'm very dense today. Please explain your reasoning. Make certain that your definition of ~ obeys the rules
\(1\) ~~x == x \(2\) ~\(x|y\) == ~x&~y \(3\) ~\(x&y\) == ~x|~y \(4\) x|~x == 1 \(5\) x&~x == 0
or there is not much point in implementing ~ at all...
Well\, I was going to demonstrate how things currently fail rules 2 and 3 above:
#!/usr/bin/perl -w $x = v200; $y = v300; print "1..2\n"; print 'not ' if ~("$x"|$y) ne (~$x&~$y); print "ok 1\n"; print 'not ' if ~("$x"&$y) ne (~$x|~$y); print "ok 2\n"; __END__
But I quickly discovered that ~$y is pretty useless with utf8 since just about anything you try to do gets you a Malformed utf warning. I think this is a reasonable fix for that (though you might or might not want the pp.c change--it's pp_ord):
But I quickly discovered that ~$y is pretty useless with utf8 since just about anything you try to do gets you a Malformed utf warning. I think this is a reasonable fix for that (though you might or might not want the pp.c change--it's pp_ord):
Applied\, thanks\, sand the pp_ord() change. Further patches welcome\, though please consider carefully the UTF8_ALLOW flags. If we allow anything everywhere\, the UTF-8 decoding checking might as well be removed.
How about this: if the $string is in utf8:
Sorry for the very late response to this. But doesn't that "if the $string is in utf8" violate our cardinal rule that the encoding shouldn't affect the results (e.g. ~($x=v1) should be the same as ~(chop($x=v1.300)\,$x).
Yes\, looks like we need to define the semantics of this more tightly.
Consider
$a0 = "\0"; $b0 = substr("\0\x{100}"\, 0\, 1);
$a1 = ~$a0; $b1 = ~$b0;
Yes\, I agree it would be nice to have $a1 eq $b1 since $a0 eq $b0. But that's not how it currently goes. $a1 is a pure "byte string"\, it has never been been touched by "a wide character" -- but $b1 is a "character string" since it's "parent" was. Bytewise the $a0 and $b0 are identical but $b1 carries the evil UTF8 flag. Ergo\, with the current ~ implementation $a1 will be "\xFF" and $b1 will be "\x{ffff...}" (machine-dependent width).
This goes for all the bytes \x00..\x7F since they cannot be told apart from maybe being "in UTF-8".
And remember backward compatibility: we shouldn't break old code that expects the bit string arithmetics to work on bytes\, not characters.
I see two ways out of this:
(1) The UV-wide ~ is used only if
(1a) SvUTF8 is true (1b) the whole character string needs to be scanned first and if a single character > 0xff is met
Otherwise\, that is\, if no SvUTF8\, or all the characters in the string are \<= 0xff\, we use byte-wide ~.
(2) We give up completely trying to define *any* bit arithmetics for character strings and say that ~ | & ^ always work on bytes.
Jarkko Hietaniemi \jhi@​iki\.fi wrote:
How about this: if the $string is in utf8:
Sorry for the very late response to this. But doesn't that "if the $string is in utf8" violate our cardinal rule that the encoding shouldn't affect the results (e.g. ~($x=v1) should be the same as ~(chop($x=v1.300)\,$x).
Yes\, looks like we need to define the semantics of this more tightly.
Consider
$a0 = "\\0"; $b0 = substr\("\\0\\x\{100\}"\, 0\, 1\);
\
$a1 = ~$a0; $b1 = ~$b0;
Yes\, I agree it would be nice to have $a1 eq $b1 since $a0 eq $b0. But that's not how it currently goes. $a1 is a pure "byte string"\, it has never been been touched by "a wide character" -- but $b1 is a "character string" since it's "parent" was. Bytewise the $a0 and $b0 are identical but $b1 carries the evil UTF8 flag. Ergo\, with the current ~ implementation $a1 will be "\xFF" and $b1 will be "\x{ffff...}" (machine-dependent width).
This goes for all the bytes \x00..\x7F since they cannot be told apart from maybe being "in UTF-8".
And remember backward compatibility: we shouldn't break old code that expects the bit string arithmetics to work on bytes\, not characters.
I see two ways out of this:
(1) The UV-wide ~ is used only if
\(1a\) SvUTF8 is true \(1b\) the whole character string needs to be scanned first and if a single character > 0xff is met Otherwise\, that is\, if no SvUTF8\, or all the characters in the string are \<= 0xff\, we use byte\-wide ~\.
(2) We give up completely trying to define *any* bit arithmetics for character strings and say that ~ | & ^ always work on bytes.
(2) makes a little more sense to me than (1). (Assuming you mean truncating each character to 8 bits\, not just ignoring the UTF8 flag).
But perhaps you are being too concerned about backward compatibility. What do you imagine they are going to do with the result of ~$x that might cause a problem?
How about:
(3) Unless IN_BYTE\, do ~ character by character. Note that this will almost certainly produce a string that will only work with the string bitwise operators\, since UTF8_ALLOW_* will be needed [1].
If the expense of utf8_to_uv function calls is a concern: Rename Perl_utf8_to_uv to Perl_utf8_to_uv_hibit Make a macro something like (untested):
#define Perl_utf8_to_uv(s\,curlen\,retlen\,flags) \ ((UV)*s \< 0x80 ? ((retlen ? (*(STRLEN*)retlen = 1) : 0)\, (UV)*s) \ : utf8_to_uv_hibit(s\,curlen\,retlen\,flags))
Note that at least one place (pp_ord) is already doing a \<0x80 check before calling utf8_to_uv. This kind of thing really should be encapsulated with the utf8 decoding code\, not scattered hither and yon.
Let me know if you'd like to at least see a patch for this.
BTW\, I noticed there is a utf8_to_uv_simple that doesn't seem to be used (at least in the core):
=for apidoc Am|U8* s|utf8_to_uv_simple|STRLEN *retlen
Returns the character value of the first character in the string C\
which is assumed to be in UTF8 encoding; C\will be set to the length\, in bytes\, of that character\, and the pointer C\ will be advanced to the end of the character.
From this description\, I'd expect it to take a U8**\, not a U8*\,STRLEN* It certainly would be more useful that way.
[1] Which of the UTF8_ALLOW_* flags are needed to allow characters 0..2^64-1? All of them? Or do some really indicate malformedness even with perl-extended-utf8? If the latter\, should we have a macro UTF8_ALLOW_ANY_UV? And should this differ with uvsize=4 or 8? I'm inclined to say so.
(2) We give up completely trying to define *any* bit arithmetics for character strings and say that ~ | & ^ always work on bytes.
(2) makes a little more sense to me than (1). (Assuming you mean truncating each character to 8 bits\, not just ignoring the UTF8 flag).
No\, I didn't mean truncating to 8 bits\, I meant ignoring the UTF8ness. Bytes. Bytes. Bytes.
But perhaps you are being too concerned about backward compatibility. What do you imagine they are going to do with the result of ~$x that might cause a problem?
ord() it\, for example. 255 is mighty different from 4294967295 or 18446744073709551615.
(3) Unless IN_BYTE\, do ~ character by character. Note that this will almost certainly produce a string that will only work with the string bitwise operators\, since UTF8_ALLOW_* will be needed [1].
Patches welcome.
BTW\, I noticed there is a utf8_to_uv_simple that doesn't seem to be used (at least in the core):
Uh? It is used in utf8_to_bytes()\, which is used in sv_utf8_downgrade()\, which is used in e.g. do_vecget().
=for apidoc Am|U8* s|utf8_to_uv_simple|STRLEN *retlen
Returns the character value of the first character in the string C\
which is assumed to be in UTF8 encoding; C\will be set to the length\, in bytes\, of that character\, and the pointer C\ will be advanced to the end of the character.From this description\, I'd expect it to take a U8**\, not a U8*\,STRLEN* It certainly would be more useful that way.
[1] Which of the UTF8_ALLOW_* flags are needed to allow characters 0..2^64-1? All of them? Or do some really indicate malformedness even with perl-extended-utf8? If the latter\, should we have a macro UTF8_ALLOW_ANY_UV? And should this differ with uvsize=4 or 8? I'm inclined to say so.
Have to think about this... off-hand\, I do not think all of them\, however\, since at least overlong sequences should still be a no-no\, they serve no useful purpose. See Markus Kuhn's UTF-8 pages:
http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8 http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt ftp://sunsite.doc.ic.ac.uk/packages/rfc/rfc2279.txt
The middle one is more or less the UTF-8 decoding law I try to follow in utf8_to_uv().
On Wed\, 1 Nov 2000\, Jarkko Hietaniemi wrote:
(2) We give up completely trying to define *any* bit arithmetics for character strings and say that ~ | & ^ always work on bytes.
(2) makes a little more sense to me than (1). (Assuming you mean truncating each character to 8 bits\, not just ignoring the UTF8 flag).
No\, I didn't mean truncating to 8 bits\, I meant ignoring the UTF8ness. Bytes. Bytes. Bytes.
Which means unexpected behavior when there is a UTF8 upgrade. I'll try to come up with a summary of the different approaches suggested so far and their problems and compatibility issues.
But perhaps you are being too concerned about backward compatibility. What do you imagine they are going to do with the result of ~$x that might cause a problem?
ord() it\, for example. 255 is mighty different from 4294967295 or 18446744073709551615.
ord()ing it will get a Malformed utf warning. This is probably a Good Thing(TM).
(3) Unless IN_BYTE\, do ~ character by character. Note that this will almost certainly produce a string that will only work with the string bitwise operators\, since UTF8_ALLOW_* will be needed [1].
Patches welcome.
BTW\, I noticed there is a utf8_to_uv_simple that doesn't seem to be used (at least in the core):
Uh? It is used in utf8_to_bytes()\, which is used in sv_utf8_downgrade()\, which is used in e.g. do_vecget().
Oops\, I missed that.
=for apidoc Am|U8* s|utf8_to_uv_simple|STRLEN *retlen
Returns the character value of the first character in the string C\
which is assumed to be in UTF8 encoding; C\will be set to the length\, in bytes\, of that character\, and the pointer C\ will be advanced to the end of the character.From this description\, I'd expect it to take a U8**\, not a U8*\,STRLEN* It certainly would be more useful that way.
[1] Which of the UTF8_ALLOW_* flags are needed to allow characters 0..2^64-1? All of them? Or do some really indicate malformedness even with perl-extended-utf8? If the latter\, should we have a macro UTF8_ALLOW_ANY_UV? And should this differ with uvsize=4 or 8? I'm inclined to say so.
Have to think about this... off-hand\, I do not think all of them\, however\, since at least overlong sequences should still be a no-no\, they serve no useful purpose. See Markus Kuhn's UTF-8 pages:
http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8 http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt ftp://sunsite.doc.ic.ac.uk/packages/rfc/rfc2279.txt
The middle one is more or less the UTF-8 decoding law I try to follow in utf8_to_uv().
Shouldn't we try to give the warning where the value is created too? (e.g. on the chr\, not just the ord\, of: ord chr 0xffffffff) Just something to think about...
Note that at least one place (pp_ord) is already doing a \<0x80 check before calling utf8_to_uv. This kind of thing really should be encapsulated with the utf8 decoding code\, not scattered hither and yon.
Yes\, I heartily agree. There is testing involving 0x80 and 0xc0 all over the code\, all that needs to be removed and rerouted to use the official utf8 routines\, or macros if speed is the issue.
Let me know if you'd like to at least see a patch for this.
Yes.
On 1 Nov 2000\, at 10:03\, Yitzchak Scott-Thoennes wrote:
Jarkko Hietaniemi \jhi@​iki\.fi wrote:
(2) We give up completely trying to define *any* bit arithmetics for character strings and say that ~ | & ^ always work on bytes.
(2) makes a little more sense to me than (1). (Assuming you mean truncating each character to 8 bits\, not just ignoring the UTF8 flag).
I'd tend to go with this\, too (and Yitzchak's specification that characters should be truncated rather than operating on the UTF-8 representation makes a lot of sense\, too).
Cheers\, Philip
I *think* we have reached the only sensible compromise in this.
Migrated from rt.perl.org#4332 (status was 'resolved')
Searchable as RT4332$