Open p5pRT opened 13 years ago
perl5.12.0 -Mutf8 -e '$_ = "\x{442}"; print unpack ( C => $_ )\, "\n"' # 1090 perl5.13.2 -Mutf8 -e '$_ = "\x{442}"; print unpack ( C => $_ )\, "\n"' # 1090 perl5.10.0 -Mutf8 -e '$_ = "\x{442}"; print unpack ( C => $_ )\, "\n"' # 1090
while. perl5.8.9 -Mutf8 -e '$_ = "\x{442}"; print unpack ( C => $_ )\, "\n"' # 209 perl5.6.2 -Mutf8 -e '$_ = "\x{442}"; print unpack ( C => $_ )\, "\n"' # 209
but with use bytes
perl5.12.0 -Mbytes -Mutf8 -e '$_ = "\x{442}"; print unpack ( C => $_ )\, "\n"' # 209 perl5.13.2 -Mbytes -Mutf8 -e '$_ = "\x{442}"; print unpack ( C => $_ )\, "\n"' # 209
It's either worth adding sub unpack into bytes.pm and fix documentation or fix this issue.
On Tue\, Feb 22\, 2011 at 04:46:15AM -0800\, mons @ cpan. org wrote:
[Please describe your issue here]
perl5.12.0 -Mutf8 -e '$_ = "\x{442}"; print unpack ( C => $_ )\, "\n"' # 1090 perl5.13.2 -Mutf8 -e '$_ = "\x{442}"; print unpack ( C => $_ )\, "\n"' # 1090 perl5.10.0 -Mutf8 -e '$_ = "\x{442}"; print unpack ( C => $_ )\, "\n"' # 1090
while. perl5.8.9 -Mutf8 -e '$_ = "\x{442}"; print unpack ( C => $_ )\, "\n"' # 209 perl5.6.2 -Mutf8 -e '$_ = "\x{442}"; print unpack ( C => $_ )\, "\n"' # 209
but with use bytes
perl5.12.0 -Mbytes -Mutf8 -e '$_ = "\x{442}"; print unpack ( C => $_ )\, "\n"' # 209 perl5.13.2 -Mbytes -Mutf8 -e '$_ = "\x{442}"; print unpack ( C => $_ )\, "\n"' # 209
It's either worth adding sub unpack into bytes.pm and fix documentation or fix this issue.
It's a documented behaviour change introduced in 5.10\, as described in perl5100delta.pod:
=head1 Incompatible Changes
=head2 Packing and UTF-8 strings
The semantics of pack() and unpack() regarding UTF-8-encoded data has been
changed. Processing is now by default character per character instead of
byte per byte on the underlying encoding. Notably\, code that used things
like C\<pack("a*"\, $string)> to see through the encoding of string will now
simply get back the original $string. Packed strings can also get upgraded
during processing when you store upgraded characters. You can get the old
behaviour by using C\
Nicholas Clark
The RT System itself - Status changed from 'new' to 'open'
Than maybe note this behaviuor in perldoc -f pack? Since if I see
c A signed char (8-bit) value. C An unsigned char (octet) value.
Then I expect an octet. At least I think we should add reference to "Pack and unpack can operate in two modes..."
And since it is documented\, It is seems to be a good idea to add bytes::pack() and bytes::unpack()
On Tue\, Feb 22\, 2011 at 7:46 AM\, mons@cpan.org \perlbug\-followup@​perl\.orgwrote:
# New Ticket Created by mons@cpan.org # Please include the string: [perl #84670] # in the subject line of all future correspondence about this issue. # \<URL: http://rt.perl.org/rt3/Ticket/Display.html?id=84670 >
This is a bug report for perl from mons@cpan.org\, generated with the help of perlbug 1.39 running under perl 5.12.0.
----------------------------------------------------------------- [Please describe your issue here]
perl5.12.0 -Mutf8 -e '$_ = "\x{442}"; print unpack ( C => $_ )\, "\n"' # 1090 perl5.13.2 -Mutf8 -e '$_ = "\x{442}"; print unpack ( C => $_ )\, "\n"' # 1090 perl5.10.0 -Mutf8 -e '$_ = "\x{442}"; print unpack ( C => $_ )\, "\n"' # 1090
It's doing the best it can when given garbage input. It requries a string of bytes\, but it's given a string that contains non-bytes.
You didn't say what you expect it to do. I suppose it could throw an exception\, but the current behaviour is quite reasonable to me.
while. perl5.8.9 -Mutf8 -e '$_ = "\x{442}"; print unpack ( C => $_ )\, "\n"' # 209 perl5.6.2 -Mutf8 -e '$_ = "\x{442}"; print unpack ( C => $_ )\, "\n"' # 209
These versions peeked into the internals and produced unpredictable results for bytes above 0x80. When this was fixed\, the nonsense answer they gave for inputs >0xFF was fixed as well.
- Eric
On Tue Feb 22 10:13:05 2011\, ikegami@adaelis.com wrote:
You didn't say what you expect it to do. I suppose it could throw an exception\, but the current behaviour is quite reasonable to me.
$ perl -we'printf "%02X\n"\, unpack "N"\, "\0\0\0\x{442}"' Character(s) in 'N' format wrapped in unpack at -e line 1. 42
$ perl -wle'printf "%02X\n"\, unpack "C"\, "\x{442}"' 442
I suppose the latter could do like the former (warn and "& 0xFF" the input)\, but the latter's behaviour is so much more useful.
In article \rt\-3\.6\.HEAD\-24085\-1298398878\-801\.84670\-15\-0@​perl\.org\, "Eric Brine via RT" \perlbug\-followup@​perl\.org writes:
On Tue Feb 22 10:13:05 2011\, ikegami@adaelis.com wrote:
You didn't say what you expect it to do. I suppose it could throw an exception\, but the current behaviour is quite reasonable to me.
$ perl -we'printf "%02X\n"\, unpack "N"\, "\0\0\0\x{442}"' Character(s) in 'N' format wrapped in unpack at -e line 1. 42
$ perl -wle'printf "%02X\n"\, unpack "C"\, "\x{442}"' 442
I suppose the latter could do like the former (warn and "& 0xFF" the input)\, but the latter's behaviour is so much more useful.
Actually when I made the unicode pack/unpack patch the "C" format was seen as a possible backward incompatibility problem and on p5p I was asked to add another character to mean "full single character semantics"\, which became the "W" (word) character. But I only did that for pack it seems:
perl -wle 'print ord pack("C"\, 1000)' Character in 'C' format wrapped in pack at -e line 1. 232
perl -wle 'print ord pack("W"\, 1000)' 1000
So the "C" format basically works "modulo 256"
I think its entirely reasonable to have the same behaviour for unpack so that
unpack "C"\, "\x{442}" would give 66 (1090 % 256) together with a format wrap warning (notice that it still won't give 209 which is a nonsense answer corresponding to internal details)
The admittedly much more sane behaviour of returning 1090 would still be available with W\,
unpack "W"\, "\x{442}" would give 1090
This woould be completely in line with the documented (in perldoc -f pack)
C An unsigned char (octet) value. W An unsigned char value (can be greater than 255).
"W" was always meant as the unicode sane version of "C"
I can make a patch if people agree with this...
On Mon\, 28 Feb 2011 15:06:10 -0800\, perl5-porters@ton.iguana.be wrote:
In article \rt\-3\.6\.HEAD\-24085\-1298398878\-801\.84670\-15\-0@​perl\.org\, "Eric Brine via RT" \perlbug\-followup@​perl\.org writes:
On Tue Feb 22 10:13:05 2011\, ikegami@adaelis.com wrote:
You didn't say what you expect it to do. I suppose it could throw an exception\, but the current behaviour is quite reasonable to me.
$ perl -we'printf "%02X\n"\, unpack "N"\, "\0\0\0\x{442}"' Character(s) in 'N' format wrapped in unpack at -e line 1. 42
$ perl -wle'printf "%02X\n"\, unpack "C"\, "\x{442}"' 442
I suppose the latter could do like the former (warn and "& 0xFF" the input)\, but the latter's behaviour is so much more useful.
Actually when I made the unicode pack/unpack patch the "C" format was seen as a possible backward incompatibility problem and on p5p I was asked to add another character to mean "full single character semantics"\, which became the "W" (word) character. But I only did that for pack it seems:
perl -wle 'print ord pack("C"\, 1000)' Character in 'C' format wrapped in pack at -e line 1. 232
perl -wle 'print ord pack("W"\, 1000)' 1000
So the "C" format basically works "modulo 256"
I think its entirely reasonable to have the same behaviour for unpack so that
unpack "C"\, "\x{442}" would give 66 (1090 % 256) together with a format wrap warning (notice that it still won't give 209 which is a nonsense answer corresponding to internal details)
The admittedly much more sane behaviour of returning 1090 would still be available with W\,
unpack "W"\, "\x{442}" would give 1090
This woould be completely in line with the documented (in perldoc -f pack)
C An unsigned char (octet) value. W An unsigned char value (can be greater than 255).
"W" was always meant as the unicode sane version of "C"
I can make a patch if people agree with this...
No one responded to this. It looks ok to me. -- Karl Williamson
Actually, W appears to have been added to unpack already, so the one remaining issue in this ticket is unpack C
Isn't it kinda late to be changing unpack C now?
$ for v in 8 10 12 14 16 18 20 22 24 26 30; do printf '5.%st: ' $v; 5.${v}t/bin/perl -wle'print unpack "C", "\x{442}"'; done
5.8t: 209
5.10t: 1090
5.12t: 1090
5.14t: 1090
5.16t: 1090
5.18t: 1090
5.20t: 1090
5.22t: 1090
5.24t: 1090
5.26t: 1090
5.30t: 1090
Who would be harmed from the change? People using "C" that should be using "W".
Who would be helped by the change? People inadvertently passing a string of Unicode Code Points to unpack C and getting garbage. They would still get garbage, but they would start receiving a warning.
I have no idea how large these groups of people are.
It seems to me that adding a warning without changing the output would be the most beneficial?
On 3/9/20 2:36 AM, Eric Brine wrote:
Isn't it kinda late to be changing unpack C now? |$ for v in 8 10 12 14 16 18 20 22 24 26 30; do printf '5.%st: ' $v; 5.${v}t/bin/perl -wle'print unpack "C", "\x{442}"'; done 5.8t: 209 5.10t: 1090 5.12t: 1090 5.14t: 1090 5.16t: 1090 5.18t: 1090 5.20t: 1090 5.22t: 1090 5.24t: 1090 5.26t: 1090 5.30t: 1090 |
Who would be harmed from the change? People using "C" that should be using "W".
Who would be helped by the change? People inadvertently passing a string of Unicode Code Points to unpack C and getting garbage. They would still get garbage, but they would start receiving a warning.
I have no idea how large these groups of people are.
It seems to me that adding a warning without changing the output would be the most beneficial?
That sounds reasonable to me.
Migrated from rt.perl.org#84670 (status was 'open')
Searchable as RT84670$