unpack(C => ...) on string with UTF8 FLAG without <use bytes> may return value more than 255

p5pRT commented 13 years ago

Migrated from rt.perl.org#84670 (status was 'open')

Searchable as RT84670$

p5pRT commented 13 years ago

From @moonlibs

Created by @moonlibs

perl5.12.0 -Mutf8 -e '$_ = "\x{442}"; print unpack ( C => $_ )\, "\n"' # 1090 perl5.13.2 -Mutf8 -e '$_ = "\x{442}"; print unpack ( C => $_ )\, "\n"' # 1090 perl5.10.0 -Mutf8 -e '$_ = "\x{442}"; print unpack ( C => $_ )\, "\n"' # 1090

while. perl5.8.9 -Mutf8 -e '$_ = "\x{442}"; print unpack ( C => $_ )\, "\n"' # 209 perl5.6.2 -Mutf8 -e '$_ = "\x{442}"; print unpack ( C => $_ )\, "\n"' # 209

but with use bytes

perl5.12.0 -Mbytes -Mutf8 -e '$_ = "\x{442}"; print unpack ( C => $_ )\, "\n"' # 209 perl5.13.2 -Mbytes -Mutf8 -e '$_ = "\x{442}"; print unpack ( C => $_ )\, "\n"' # 209

It's either worth adding sub unpack into bytes.pm and fix documentation or fix this issue.

Perl Info

``` Flags: category=core severity=medium Site configuration information for perl 5.12.0: Configured by mons at Fri May 7 14:44:18 MSD 2010. Summary of my perl5 (revision 5 version 12 subversion 0) configuration: Platform: osname=freebsd, osvers=7.1-release-p2, archname=amd64-freebsd uname='freebsd veda.park.rambler.ru 7.1-release-p2 freebsd 7.1-release-p2 #0: thu feb 12 22:34:21 msk 2009 root@veda.park.rambler.ru:usrobjusrsrcsysdevel amd64 ' config_args='-des -Dprefix=/home/mons -Duse64bitint -DDEBUG_LEAKING_SCALARS -DDEBUGGING -Dinc_version_list=none -Dccflags=-O2 -march=athlon64 -fomit-frame-pointer -pipe -ggdb -g3 -Doptimize=-O2 -g3' hint=recommended, useposix=true, d_sigaction=define useithreads=undef, usemultiplicity=undef useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef use64bitint=define, use64bitall=define, uselongdouble=undef usemymalloc=n, bincompat5005=undef Compiler: cc='cc', ccflags ='-O2 -march=athlon64 -fomit-frame-pointer -pipe -ggdb -g3 -DHAS_FPSETMASK -DHAS_FLOATINGPOINT_H -DDEBUGGING -fno-strict-aliasing -fstack-protector -I/usr/local/include', optimize='-O2 -g3', cppflags='-O2 -march=athlon64 -fomit-frame-pointer -pipe -ggdb -g3 -DHAS_FPSETMASK -DHAS_FLOATINGPOINT_H -DDEBUGGING -fno-strict-aliasing -fstack-protector -I/usr/local/include' ccversion='', gccversion='4.2.1 20070719 [FreeBSD]', gccosandvers='' intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16 ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 alignbytes=8, prototype=define Linker and Libraries: ld='cc', ldflags ='-Wl,-E -fstack-protector -L/usr/local/lib' libpth=/usr/lib /usr/local/lib libs=-lgdbm -lm -lcrypt -lutil -lc perllibs=-lm -lcrypt -lutil -lc libc=, so=so, useshrplib=false, libperl=libperl.a gnulibc_version='' Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags=' ' cccdlflags='-DPIC -fPIC', lddlflags='-shared -L/usr/local/lib -fstack-protector' Locally applied patches: @INC for perl 5.12.0: /home/mons/perl/perl-ex /home/mons/lib/perl5/site_perl/5.12.0/amd64-freebsd /home/mons/lib/perl5/site_perl/5.12.0 /home/mons/lib/perl5/5.12.0/amd64-freebsd /home/mons/lib/perl5/5.12.0 . Environment for perl 5.12.0: HOME=/home/mons LANG=en_US.UTF-8 LANGUAGE (unset) LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=/home/mons/home-bin:/home/mons/bin:/home/mons/flex/bin:/sbin:/bin:/usr/sbin:/usr/bin:/usr/games:/usr/local/sbin:/usr/local/bin:/home/mons/bin PERL5LIB=/home/mons/perl/perl-ex PERL_BADLANG (unset) SHELL=/usr/local/bin/bash ```

p5pRT commented 13 years ago

From @nwc10

On Tue\, Feb 22\, 2011 at 04:46:15AM -0800\, mons @ cpan. org wrote:

[Please describe your issue here]

perl5.12.0 -Mutf8 -e '$_ = "\x{442}"; print unpack ( C => $_ )\, "\n"' # 1090 perl5.13.2 -Mutf8 -e '$_ = "\x{442}"; print unpack ( C => $_ )\, "\n"' # 1090 perl5.10.0 -Mutf8 -e '$_ = "\x{442}"; print unpack ( C => $_ )\, "\n"' # 1090

while. perl5.8.9 -Mutf8 -e '$_ = "\x{442}"; print unpack ( C => $_ )\, "\n"' # 209 perl5.6.2 -Mutf8 -e '$_ = "\x{442}"; print unpack ( C => $_ )\, "\n"' # 209

but with use bytes

perl5.12.0 -Mbytes -Mutf8 -e '$_ = "\x{442}"; print unpack ( C => $_ )\, "\n"' # 209 perl5.13.2 -Mbytes -Mutf8 -e '$_ = "\x{442}"; print unpack ( C => $_ )\, "\n"' # 209

It's either worth adding sub unpack into bytes.pm and fix documentation or fix this issue.

It's a documented behaviour change introduced in 5.10\, as described in perl5100delta.pod:

=head1 Incompatible Changes
=head2 Packing and UTF-8 strings
The semantics of pack() and unpack() regarding UTF-8-encoded data has been changed. Processing is now by default character per character instead of byte per byte on the underlying encoding. Notably\, code that used things like C\<pack("a*"\, $string)> to see through the encoding of string will now simply get back the original $string. Packed strings can also get upgraded during processing when you store upgraded characters. You can get the old behaviour by using C\.
To be consistent with pack()\, the C\ in unpack() templates indicates that the data is to be processed in character mode\, i.e. character by character; on the contrary\, C\ in unpack() indicates UTF-8 mode\, where the packed string is processed in its UTF-8-encoded Unicode form on a byte by byte basis. This is reversed with regard to perl 5.8.X\, but now consistent between pack() and unpack().

Nicholas Clark

p5pRT commented 13 years ago

The RT System itself - Status changed from 'new' to 'open'

p5pRT commented 13 years ago

From @moonlibs

Than maybe note this behaviuor in perldoc -f pack? Since if I see

c A signed char (8-bit) value. C An unsigned char (octet) value.

Then I expect an octet. At least I think we should add reference to "Pack and unpack can operate in two modes..."

And since it is documented\, It is seems to be a good idea to add bytes::pack() and bytes::unpack()

p5pRT commented 13 years ago

From @ikegami

On Tue\, Feb 22\, 2011 at 7:46 AM\, mons@cpan.org \perlbug\-followup@perl\.orgwrote:

# New Ticket Created by mons@cpan.org # Please include the string: [perl #84670] # in the subject line of all future correspondence about this issue. # \<URL: http://rt.perl.org/rt3/Ticket/Display.html?id=84670 >

This is a bug report for perl from mons@cpan.org\, generated with the help of perlbug 1.39 running under perl 5.12.0.

----------------------------------------------------------------- [Please describe your issue here]

perl5.12.0 -Mutf8 -e '$_ = "\x{442}"; print unpack ( C => $_ )\, "\n"' # 1090 perl5.13.2 -Mutf8 -e '$_ = "\x{442}"; print unpack ( C => $_ )\, "\n"' # 1090 perl5.10.0 -Mutf8 -e '$_ = "\x{442}"; print unpack ( C => $_ )\, "\n"' # 1090

It's doing the best it can when given garbage input. It requries a string of bytes\, but it's given a string that contains non-bytes.

You didn't say what you expect it to do. I suppose it could throw an exception\, but the current behaviour is quite reasonable to me.

while. perl5.8.9 -Mutf8 -e '$_ = "\x{442}"; print unpack ( C => $_ )\, "\n"' # 209 perl5.6.2 -Mutf8 -e '$_ = "\x{442}"; print unpack ( C => $_ )\, "\n"' # 209

These versions peeked into the internals and produced unpredictable results for bytes above 0x80. When this was fixed\, the nonsense answer they gave for inputs >0xFF was fixed as well.

- Eric

p5pRT commented 13 years ago

From @ikegami

On Tue Feb 22 10:13:05 2011\, ikegami@adaelis.com wrote:

You didn't say what you expect it to do. I suppose it could throw an exception\, but the current behaviour is quite reasonable to me.

$ perl -we'printf "%02X\n"\, unpack "N"\, "\0\0\0\x{442}"' Character(s) in 'N' format wrapped in unpack at -e line 1. 42

$ perl -wle'printf "%02X\n"\, unpack "C"\, "\x{442}"' 442

I suppose the latter could do like the former (warn and "& 0xFF" the input)\, but the latter's behaviour is so much more useful.

p5pRT commented 13 years ago

From perl5-porters@ton.iguana.be

In article \rt\-3\.6\.HEAD\-24085\-1298398878\-801\.84670\-15\-0@perl\.org\, "Eric Brine via RT" \perlbug\-followup@perl\.org writes:

On Tue Feb 22 10:13:05 2011\, ikegami@adaelis.com wrote:

You didn't say what you expect it to do. I suppose it could throw an exception\, but the current behaviour is quite reasonable to me.

$ perl -we'printf "%02X\n"\, unpack "N"\, "\0\0\0\x{442}"' Character(s) in 'N' format wrapped in unpack at -e line 1. 42

$ perl -wle'printf "%02X\n"\, unpack "C"\, "\x{442}"' 442

I suppose the latter could do like the former (warn and "& 0xFF" the input)\, but the latter's behaviour is so much more useful.

Actually when I made the unicode pack/unpack patch the "C" format was seen as a possible backward incompatibility problem and on p5p I was asked to add another character to mean "full single character semantics"\, which became the "W" (word) character. But I only did that for pack it seems:

perl -wle 'print ord pack("C"\, 1000)' Character in 'C' format wrapped in pack at -e line 1. 232

perl -wle 'print ord pack("W"\, 1000)' 1000

So the "C" format basically works "modulo 256"

I think its entirely reasonable to have the same behaviour for unpack so that

unpack "C"\, "\x{442}" would give 66 (1090 % 256) together with a format wrap warning (notice that it still won't give 209 which is a nonsense answer corresponding to internal details)

The admittedly much more sane behaviour of returning 1090 would still be available with W\,

unpack "W"\, "\x{442}" would give 1090

This woould be completely in line with the documented (in perldoc -f pack)

C An unsigned char (octet) value. W An unsigned char value (can be greater than 255).

"W" was always meant as the unicode sane version of "C"

I can make a patch if people agree with this...

p5pRT commented 5 years ago

From @khwilliamson

On Mon\, 28 Feb 2011 15:06:10 -0800\, perl5-porters@ton.iguana.be wrote:

In article \rt\-3\.6\.HEAD\-24085\-1298398878\-801\.84670\-15\-0@perl\.org\, "Eric Brine via RT" \perlbug\-followup@perl\.org writes:

On Tue Feb 22 10:13:05 2011\, ikegami@adaelis.com wrote:

You didn't say what you expect it to do. I suppose it could throw an exception\, but the current behaviour is quite reasonable to me.

$ perl -we'printf "%02X\n"\, unpack "N"\, "\0\0\0\x{442}"' Character(s) in 'N' format wrapped in unpack at -e line 1. 42

$ perl -wle'printf "%02X\n"\, unpack "C"\, "\x{442}"' 442

I suppose the latter could do like the former (warn and "& 0xFF" the input)\, but the latter's behaviour is so much more useful.

Actually when I made the unicode pack/unpack patch the "C" format was seen as a possible backward incompatibility problem and on p5p I was asked to add another character to mean "full single character semantics"\, which became the "W" (word) character. But I only did that for pack it seems:

perl -wle 'print ord pack("C"\, 1000)' Character in 'C' format wrapped in pack at -e line 1. 232

perl -wle 'print ord pack("W"\, 1000)' 1000

So the "C" format basically works "modulo 256"

I think its entirely reasonable to have the same behaviour for unpack so that

unpack "C"\, "\x{442}" would give 66 (1090 % 256) together with a format wrap warning (notice that it still won't give 209 which is a nonsense answer corresponding to internal details)

The admittedly much more sane behaviour of returning 1090 would still be available with W\,

unpack "W"\, "\x{442}" would give 1090

This woould be completely in line with the documented (in perldoc -f pack)

C An unsigned char (octet) value. W An unsigned char value (can be greater than 255).

"W" was always meant as the unicode sane version of "C"

I can make a patch if people agree with this...

No one responded to this. It looks ok to me. -- Karl Williamson

khwilliamson commented 4 years ago

Actually, W appears to have been added to unpack already, so the one remaining issue in this ticket is unpack C

ikegami commented 4 years ago

Isn't it kinda late to be changing unpack C now?

$ for v in 8 10 12 14 16 18 20 22 24 26 30; do printf '5.%st: ' $v; 5.${v}t/bin/perl -wle'print unpack "C", "\x{442}"'; done
5.8t: 209
5.10t: 1090
5.12t: 1090
5.14t: 1090
5.16t: 1090
5.18t: 1090
5.20t: 1090
5.22t: 1090
5.24t: 1090
5.26t: 1090
5.30t: 1090

ikegami commented 4 years ago

Who would be harmed from the change? People using "C" that should be using "W".

Who would be helped by the change? People inadvertently passing a string of Unicode Code Points to unpack C and getting garbage. They would still get garbage, but they would start receiving a warning.

I have no idea how large these groups of people are.

It seems to me that adding a warning without changing the output would be the most beneficial?

khwilliamson commented 4 years ago

On 3/9/20 2:36 AM, Eric Brine wrote:

Isn't it kinda late to be changing unpack C now?

|$ for v in 8 10 12 14 16 18 20 22 24 26 30; do printf '5.%st: ' $v;
5.${v}t/bin/perl -wle'print unpack "C", "\x{442}"'; done 5.8t: 209
5.10t: 1090 5.12t: 1090 5.14t: 1090 5.16t: 1090 5.18t: 1090 5.20t:
1090 5.22t: 1090 5.24t: 1090 5.26t: 1090 5.30t: 1090 |
Who would be harmed from the change? People using "C" that should be using "W".

Who would be helped by the change? People inadvertently passing a string of Unicode Code Points to unpack C and getting garbage. They would still get garbage, but they would start receiving a warning.

I have no idea how large these groups of people are.

It seems to me that adding a warning without changing the output would be the most beneficial?

That sounds reasonable to me.

Perl / perl5