Perl / perl5

🐪 The Perl programming language
https://dev.perl.org/perl5/
Other
1.96k stars 555 forks source link

"use bytes" doesn't apply byte semantics to concatenation #7114

Closed p5pRT closed 16 years ago

p5pRT commented 20 years ago

Migrated from rt.perl.org#26905 (status was 'resolved')

Searchable as RT26905$

p5pRT commented 20 years ago

From @jlokier

Created by @jlokier

The "use bytes" pragma is useful for code which only wants to handle bytes.

substr()\, length()\, index()\, pos() and regex matching all ignore the UTF-8 flag on strings in the scope of this pragma.

However\, string concatenation does not take this pragma into account. Just like without the pragma\, it upgrades strings to UTF-8 if any of them are UTF-8.

This is quite inconsistent with the algebraic properties expected of byte strings\, such as​:

  length(substr($a\,0\,1).substr($a\,1)) == length($a)

Here's an example program which illustrates this​:

  $x="\x{100}abc";   $y="\x{80}def";   use bytes;   print length($x)\, "\,"\, length($y)\, "\n";   $z = $x.substr($x\,0\,1).substr($x\,1).$y;   print length($x)\, "\,"\, length($y)\, "\,"\, length($z)\, "\n";

The program prints​:

  5\,4   5\,5\,17

Those numbers make no sense. In bytes\, length($x) is 5 and length($y) is 4. After the concatenation\, the total is 17\, when it should logically be 14.

(This also shows length($y) is modified simply by $y being read\, reported as [perl #26901]. In this case\, length($y) is 4 before the concatenation but 5 after.)

Summary​: I think string concatenation should _not_ upgrade non-UTF-8 strings to UTF-8 when they are concatenated inside the scope of "use bytes". A warning or even an exception may be appropriate.

Perl Info ``` Flags: category=core severity=medium Site configuration information for perl v5.8.0: Configured by bhcompile' cf_email='bhcompile at Wed Aug 13 11:45:59 EDT 2003. Summary of my rderl (revision 5.0 version 8 subversion 0) configuration: Platform: osname=linux, osvers=2.4.21-1.1931.2.382.entsmp, archname=i386-linux-thread-multi uname='linux str' config_args='-des -Doptimize=-O2 -g -pipe -march=i386 -mcpu=i686 -Dmyhostname=localhost -Dperladmin=root@localhost -Dcc=gcc -Dcf_by=Red Hat, Inc. -Dinstallprefix=/usr -Dprefix=/usr -Darchname=i386-linux -Dvendorprefix=/usr -Dsiteprefix=/usr -Dotherlibdirs=/usr/lib/perl5/5.8.0 -Duseshrplib -Dusethreads -Duseithreads -Duselargefiles -Dd_dosuid -Dd_semctl_semun -Di_db -Ui_ndbm -Di_gdbm -Di_shadow -Di_syslog -Dman3ext=3pm -Duseperlio -Dinstallusrbinperl -Ubincompat5005 -Uversiononly -Dpager=/usr/bin/less -isr' hint=recommended, useposix=true, d_sigaction=define usethreads=define use5005threads=undef' useithreads=define usemultiplicity= useperlio= d_sfio=undef uselargefiles=define usesocks=undef use64bitint=undef use64bitall=un uselongdouble= usemymalloc=, bincompat5005=undef Compiler: cc='gcc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -DDEBUGGING -fno-strict-aliasing -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -I/usr/include/gdbm', optimize='', cppflags='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -DDEBUGGING -fno-strict-aliasing -I/usr/local/include -I/usr/include/gdbm' ccversion='', gccversion='3.2.2 20030222 (Red Hat Linux 3.2.2-5)', gccosandvers='' gccversion='3.2.2 200302' intsize=r, longsize=r, ptrsize=5, doublesize=8, byteorder=1234 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12 ivtype='long' k', ivsize=4' ivtype='l, nvtype='double' o_nonbl', nvsize=, Off_t='', lseeksize=8 alignbytes=4, prototype=define Linker and Libraries: ld='gcc' l', ldflags =' -L/u' libpth=/usr/local/lib /lib /usr/lib libs=-lnsl -lgdbm -ldb -ldl -lm -lpthread -lc -lcrypt -lutil perllibs= libc=/lib/libc-2.3.2.so, so=so, useshrplib=true, libperl=libper gnulibc_version='2.3.2' Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so', d_dlsymun=undef, ccdlflags='-rdynamic -Wl,-rpath,/usr/lib/perl5/5.8.0/i386-linux-thread-multi/CORE' cccdlflags='-fPIC' ccdlflags='-rdynamic -Wl,-rpath,/usr/lib/perl5', lddlflags='s Unicode/Normalize XS/A' Locally applied patches: MAINT18379 @INC for perl v5.8.0: /usr/lib/perl5/5.8.0/i386-linux-thread-multi /usr/lib/perl5/5.8.0 /usr/lib/perl5/site_perl/5.8.0/i386-linux-thread-multi /usr/lib/perl5/site_perl/5.8.0 /usr/lib/perl5/site_perl /usr/lib/perl5/vendor_perl/5.8.0/i386-linux-thread-multi /usr/lib/perl5/vendor_perl/5.8.0 /usr/lib/perl5/vendor_perl /usr/lib/perl5/5.8.0/i386-linux-thread-multi /usr/lib/perl5/5.8.0 . Environment for perl v5.8.0: HOME=/home/jamie LANG=en_GB.UTF-8 LANGUAGE (unset) LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=/usr/local/bin:/bin:/usr/bin:/usr/X11R6/bin:/home/jamie/bin PERL_BADLANG (unset) SHELL=/bin/bash dlflags='-share (unset) ```
p5pRT commented 20 years ago

From @rgs

SADAHIRO Tomoyuki wrote​:

This is because join() internally uses sv_catsv() which considers bytes.pm.

Here is a patch against perl-current. After this patch the above example prints​: [snip]

After my patch for pp_hot.c\, some tests for Encode fail.

t/CJKT.t 1 256 60 1 1.67% 22 t/at-cn.t 2 512 29 2 6.90% 18 20 t/perlio.t 2 512 38 2 5.26% 7-8

This is unnecessary (I think) declaration of \ in Encode​::CN​::HZ. If E​::CN​::HZ fixed\, all the tests for Encode should succeed.

In addition perlio_ok returning constantly true is wrong. (it should return false if PerlIO​::encoding is not available) So the default method in Encode​::Encoding​:: should be used.

Thanks\, both patches applied to bleadperl as change #22363. Note that I've changed the version number of Encode​::CN​::HZ to 1.05_01. The change to Encode​::CN​::HZ should probably be made conditional on perl version >= 5.9.1.

p5pRT commented 20 years ago

The RT System itself - Status changed from 'new' to 'open'

p5pRT commented 20 years ago

From dankogai@dan.co.jp

On Feb 23\, 2004\, at 01​:26\, Autrijus Tang wrote​:

On Sun\, Feb 22\, 2004 at 06​:41​:43PM +0900\, SADAHIRO Tomoyuki wrote​:

After my patch for pp_hot.c\, some tests for Encode fail. re t/CJKT.t 1 256 60 1 1.67% 22 t/at-cn.t 2 512 29 2 6.90% 18 20 t/perlio.t 2 512 38 2 5.26% 7-8

This is unnecessary (I think) declaration of \ in Encode​::CN​::HZ. If E​::CN​::HZ fixed\, all the tests for Encode should succeed.

As the author of HZ.pm\, I think the patch makes perfect sense. :-)

Sorry for my slow response. I was too busy to be online for last few days.

I just checked the patch on both 5.8.0 and 5.8.3 and worked fine. So it is backward-compatible. Now there is no reason not to let your patch in. I already did so in my repository.

Pumpking(s)\, please go ahead apply his patch.

Dan the Encode Maintainer

p5pRT commented 16 years ago

p5p@spam.wizbit.be - Status changed from 'open' to 'resolved'