Bug in lc and uc (interaction between UTF-8, substr, and lc/uc)

p5pRT commented 18 years ago

Migrated from rt.perl.org#38619 (status was 'resolved')

Searchable as RT38619$

p5pRT commented 18 years ago

From perl@benizi.com

Created by perl@benizi.com

Problem with lc/uc interacting with substr and _utf8_on.

Second substr(lc($var)\,0) on the same _utf8_on'ed $var is the wrong length\, and\, in preliminary results\, seems to be limited to the same length as the first substr(lc($var)\, 0). Adding further iterations leads to further weirdness. Test program below can be called as:

perl bug.pl [test-string] Test string will be split on /:/\, defaults to 'a:bc'.

For each string in the split: _utf8_on\, and print string \ substr(lc(string)\, 0)

Output should be: string1 \ string1 string2 \ string2 ...

Actual output is: string1 \ string1 string2 \ string3 ... (where string3 is the first length(string1) characters of string2)

# sample program demonstrating problem $ cat bug.pl #!/usr/bin/perl -l use strict; use warnings; use Encode qw/_utf8_on/; for (split /:/\, shift||'a:bc') { _utf8_on($_); print "$_\t"\, substr(lc($_)\, 0); }

# expected results $ cat expected_output a a bc bc

# actual results $ perl bug.pl a a bc b

# golfed test case (should produce 'abc'\, not 'ab') $ perl -MEncode=_utf8_on -e '_utf8_on($_)\,print substr lc\,0 for qw\\,$/' ab

Additional oddness/data: Affected versions: >=5.8.1 Confirmed unaffected: linux-i686 5.8.0\, solaris 5.8.0

Affected functions: only lc/uc. (not ucfirst/lcfirst). Only in substr(lc()\,0) order. (i.e. lc(substr($_\, 0)) is not affected.)

Perl Info

``` Flags: category=core severity=low Site configuration information for perl v5.8.7: Configured by Gentoo at Sat Feb 4 23:34:18 EST 2006. Summary of my perl5 (revision 5 version 8 subversion 7) configuration: Platform: osname=linux, osvers=2.6.11-gentoo-r6, archname=i686-linux uname='linux elation 2.6.11-gentoo-r6 #4 thu may 12 16:36:25 edt 2005 i686 intel(r) pentium(r) 4 cpu 3.00ghz genuineintel gnulinux ' config_args='-des -Darchname=i686-linux -Dcccdlflags=-fPIC -Dccdlflags=-rdynamic -Dcc=i686-pc-linux-gnu-gcc -Dprefix=/usr -Dvendorprefix=/usr -Dsiteprefix=/usr -Dlocincpth= -Doptimize=-O2 -march=pentium4 -fomit-frame-pointer -Duselargefiles -Dd_semctl_semun -Dscriptdir=/usr/bin -Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3 -Dinstallman1dir=/usr/share/man/man1 -Dinstallman3dir=/usr/share/man/man3 -Dman1ext=1 -Dman3ext=3pm -Dinc_version_list=5.8.0 5.8.0/i686-linux 5.8.2 5.8.2/i686-linux 5.8.4 5.8.4/i686-linux 5.8.5 5.8.5/i686-linux 5.8.6 5.8.6/i686-linux -Dcf_by=Gentoo -Ud_csh -Di_ndbm -Di_gdbm -Di_db' hint=recommended, useposix=true, d_sigaction=define usethreads=undef use5005threads=undef useithreads=undef usemultiplicity=undef useperlio=define d_sfio=undef uselargefiles=define usesocks=undef use64bitint=undef use64bitall=undef uselongdouble=undef usemymalloc=n, bincompat5005=undef Compiler: cc='i686-pc-linux-gnu-gcc', ccflags ='-fno-strict-aliasing -pipe -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64', optimize='-O2 -march=pentium4 -fomit-frame-pointer', cppflags='-fno-strict-aliasing -pipe' ccversion='', gccversion='3.4.4 (Gentoo 3.4.4-r1, ssp-3.4.4-1.0, pie-8.7.8)', gccosandvers='' intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12 ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 alignbytes=4, prototype=define Linker and Libraries: ld='i686-pc-linux-gnu-gcc', ldflags =' -L/usr/local/lib' libpth=/usr/local/lib /lib /usr/lib libs=-lpthread -lnsl -lndbm -lgdbm -ldb -ldl -lm -lcrypt -lutil -lc perllibs=-lpthread -lnsl -ldl -lm -lcrypt -lutil -lc libc=/lib/libc-2.3.5.so, so=so, useshrplib=false, libperl=libperl.a gnulibc_version='2.3.5' Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-rdynamic' cccdlflags='-fPIC', lddlflags='-shared -L/usr/local/lib' Locally applied patches: @INC for perl v5.8.7: /etc/perl /usr/lib/perl5/site_perl/5.8.7/i686-linux /usr/lib/perl5/site_perl/5.8.7 /usr/lib/perl5/site_perl/5.8.5 /usr/lib/perl5/site_perl/5.8.5/i686-linux /usr/lib/perl5/site_perl/5.8.6 /usr/lib/perl5/site_perl/5.8.6/i686-linux /usr/lib/perl5/site_perl /usr/lib/perl5/vendor_perl/5.8.7/i686-linux /usr/lib/perl5/vendor_perl/5.8.7 /usr/lib/perl5/vendor_perl/5.8.5 /usr/lib/perl5/vendor_perl/5.8.5/i686-linux /usr/lib/perl5/vendor_perl/5.8.6 /usr/lib/perl5/vendor_perl/5.8.6/i686-linux /usr/lib/perl5/vendor_perl /usr/lib/perl5/5.8.7/i686-linux /usr/lib/perl5/5.8.7 /usr/local/lib/site_perl . Environment for perl v5.8.7: HOME=/home/bhaskell LANG (unset) LANGUAGE (unset) LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=/home/bhaskell/bin:/home/bhaskell/wn/bin:/usr/kde/3.4/bin:/bin:/usr/bin:/opt/bin:/usr/i686-pc-linux-gnu/gcc-bin/3.4.4:/opt/ati/bin:/opt/ghc/bin:/opt/blackdown-jdk-1.4.2.02/bin:/opt/blackdown-jdk-1.4.2.02/jre/bin:/usr/qt/3/bin:/usr/kde/3.4/bin:/usr/kde/3.3/bin:/usr/games/bin:/var/qmail/bin:/usr/cogsci/bin:/people/bhaskell/bin PERL_BADLANG (unset) SHELL=/bin/zsh ```

p5pRT commented 18 years ago

From perl@benizi.com

Still in 5.9.3 for i686-linux. (Tested that before I submitted\, but forgot to mention it).

p5pRT commented 18 years ago

perl@benizi.com - Status changed from 'new' to 'open'

p5pRT commented 18 years ago

From @andk

Looks like a hairy troll has jhidden for quite a while:)

----Program---- use strict; use warnings; use Encode qw/_utf8_on/; for (split /:/\, 'a:bc') { _utf8_on($_); my $p = join ""\, "$_ "\, substr(lc($_)\, 0); print $p =~ /^(a a|bc bc)$/ ? "ok # $p\n" : "not ok # $p\n"; }

----Output of .../pHIziQK/perl-5.8.0@18529/bin/perl---- ok # a a ok # bc bc

----EOF ($?='0')---- ----Output of .../pCRFA94/perl-5.8.0@18530/bin/perl---- ok # a a not ok # bc b

----EOF ($?='0')----

Change 18530 by hv@hv-crypt.org on 2003/01/21 01:37:03

integrate (by hand) #18353 and #18359 from maint-5.8:

OK\, maybe it helps to binary search along the maint-5.8 stretch...

----Program---- use strict; use warnings; use Encode qw/_utf8_on/; for (split /:/\, 'a:bc') { _utf8_on($_); my $p = join ""\, "$_ "\, substr(lc($_)\, 0); print $p =~ /^(a a|bc bc)$/ ? "ok # $p\n" : "not ok # $p\n"; }

----Output of .../pAyq3oR/perl-5.8.0@18352/bin/perl---- ok # a a ok # bc bc

----EOF ($?='0')---- ----Output of .../pZpX8E8/perl-5.8.0@18353/bin/perl---- ok # a a not ok # bc b

----EOF ($?='0')----

Change 18353 by jhi@lyta on 2002/12/26 02:07:06

Introduce a cache for UTF-8 data: length and byte\<->char mapping are stored in a new type of magic. Speeds up length()\, substr()\, index()\, rindex()\, pos()\, and some parts of s///.
The speedup varies a lot (on the usual suspects: what is the access pattern of the data\, compiler\, CPU)\, but should be at least one order of magnitude\, and getting to the same magnitude as byte string speeds\, and in some cases (length on unchanged data) even reaching the byte string speed. On the other hand\, in some cases (index) the byte speed is still faster by a factor of five or so\, but the bottleneck there does not seem to be any more the byte\<->char mapping (instead\, the fbm_instr() speed).
There is one cache slot for the speed\, and only two for the byte\<->char mapping (the first one for the start->offset\, and the second for the offset->offset+length\, when talking in substr() terms).
Code this hairy is bound to have hairy trolls hiding under it.

-- andreas

p5pRT commented 18 years ago

From @nwc10

On Fri\, Feb 24\, 2006 at 04:46:09AM +0100\, Andreas J. Koenig wrote:

Looks like a hairy troll has jhidden for quite a while:)

Change 18353 by jhi@lyta on 2002/12/26 02:07:06
Introduce a cache for UTF\-8 data&#8203;: length and byte\<\->char mapping
are stored in a new type of magic\.  Speeds up length\, substr\,
index\, rindex\, pos\, and some parts of s///\.

Thanks. This confirms my suspicion that it was the UTF-8 caching code introduced with 5.8.1

As part of my TPF grant I'm going to look at all this\, so if no-one else beats me to finding the specific bug in the existing code\, it will be resolved in the next 3 months.

As a work around\, I think that re-assigning the value to itself before the lc or uc will clear the cache\, and lc and uc will then give the correct answer.

Nicholas Clark

p5pRT commented 18 years ago

From BQW10602@nifty.com

On Fri\, 24 Feb 2006 10:13:15 +0000\, Nicholas Clark \nick@ccl4\.org wrote

On Fri\, Feb 24\, 2006 at 04:46:09AM +0100\, Andreas J. Koenig wrote:
Looks like a hairy troll has jhidden for quite a while:)

Change 18353 by jhi@lyta on 2002/12/26 02:07:06
Introduce a cache for UTF\-8 data&#8203;: length and byte\<\->char mapping
are stored in a new type of magic\.  Speeds up length\, substr\,
index\, rindex\, pos\, and some parts of s///\.
Thanks. This confirms my suspicion that it was the UTF-8 caching code introduced with 5.8.1

As part of my TPF grant I'm going to look at all this\, so if no-one else beats me to finding the specific bug in the existing code\, it will be resolved in the next 3 months.

As a work around\, I think that re-assigning the value to itself before the lc or uc will clear the cache\, and lc and uc will then give the correct answer.

Should the magic on TARG be reset? (Or don't use TARG?)

ucfirst() also has this bug\, when ulen != tculen (see pp_ucfirst).

for (split /:/\, shift||"a:ßbc") { utf8::upgrade($_); print "$_\t"\, substr(ucfirst($_)\, 0)\, "\t$_\n"; } __END__ a A a ßbc S ßbc

cf. The result on Perl 5.8.0 a A a ßbc Ssbc ßbc

postincrement $_++ is also buggy.

#!perl use strict; use warnings; use Encode qw/_utf8_on/; for (split /:/\, shift||'a:bc') { _utf8_on($_); print "$_\t"\, substr($_++\, 0)\, "\t$_\n"; } __END__ a a b bc b bd

cf. The result on Perl 5.8.0 a a b bc bc bd

In contrast\, preincrement ++$_ is good (pp_preinc doesn't use TARG).

#!perl use strict; use warnings; use Encode qw/_utf8_on/; for (split /:/\, shift||'a:bc') { _utf8_on($_); print "$_\t"\, substr(++$_\, 0)\, "\t$_\n"; } __END__ a b b bc bd bd

Regards SADAHIRO Tomoyuki

p5pRT commented 18 years ago

From @jhi

Nicholas Clark wrote:

On Fri\, Feb 24\, 2006 at 04:46:09AM +0100\, Andreas J. Koenig wrote:
Looks like a hairy troll has jhidden for quite a while:)

Change 18353 by jhi@lyta on 2002/12/26 02:07:06
Introduce a cache for UTF\-8 data&#8203;: length and byte\<\->char mapping
are stored in a new type of magic\.  Speeds up length\, substr\,
index\, rindex\, pos\, and some parts of s///\.
Thanks. This confirms my suspicion that it was the UTF-8 caching code introduced with 5.8.1

As part of my TPF grant I'm going to look at all this\, so if no-one else beats me to finding the specific bug in the existing code\, it will be resolved in the next 3 months.

Not to put too much pressure on Sadahiro-san but he has traditionally been able to fix all UTF-8 bugs extremely fast... Our brains must be similarly wired. Or\, rather\, his brain is wired the right way to *untangle* the spaghetti I left behind.

As a work around\, I think that re-assigning the value to itself before the lc or uc will clear the cache\, and lc and uc will then give the correct answer.

Nicholas Clark

p5pRT commented 18 years ago

From BQW10602@nifty.com

On Fri\, 24 Feb 2006 19:15:53 +0200\, Jarkko Hietaniemi \jhietaniemi@gmail\.com wrote

Nicholas Clark wrote:
On Fri\, Feb 24\, 2006 at 04:46:09AM +0100\, Andreas J. Koenig wrote:
Looks like a hairy troll has jhidden for quite a while:)

Change 18353 by jhi@lyta on 2002/12/26 02:07:06
Introduce a cache for UTF\-8 data&#8203;: length and byte\<\->char mapping
are stored in a new type of magic\.  Speeds up length\, substr\,
index\, rindex\, pos\, and some parts of s///\.
Thanks. This confirms my suspicion that it was the UTF-8 caching code introduced with 5.8.1

As part of my TPF grant I'm going to look at all this\, so if no-one else beats me to finding the specific bug in the existing code\, it will be resolved in the next 3 months.
Not to put too much pressure on Sadahiro-san but he has traditionally been able to fix all UTF-8 bugs extremely fast... Our brains must be similarly wired. Or\, rather\, his brain is wired the right way to *untangle* the spaghetti I left behind.

However this bug against *other magics* seemed to exist even in perl 5.6.1\, as shown below for m//g in scalar context.

SvSETMAGIC(sv) does not affect TARG if sv != TARG.

Regards\, SADAHIRO Tomoyuki

The changes of pp.c are for pp_ucfirst\, pp_uc\, and pp_lc.

Inline Patch

```diff diff -ur perl-current@27323/pp.c perl/pp.c --- perl-current@27323/pp.c Sat Feb 25 09:41:08 2006 +++ perl/pp.c Sat Feb 25 16:59:13 2006 @@ -3350,7 +3350,8 @@ if (slen > ulen) sv_catpvn(TARG, (char*)(s + ulen), slen - ulen); SvUTF8_on(TARG); - SETs(TARG); + sv = TARG; + SETs(sv); } else { s = (U8*)SvPV_force_nomg(sv, slen); @@ -3402,7 +3403,8 @@ if (!len) { SvUTF8_off(TARG); /* decontaminate */ sv_setpvn(TARG, "", 0); - SETs(TARG); + sv = TARG; + SETs(sv); } else { STRLEN min = len + 1; @@ -3435,7 +3437,8 @@ *d = '\0'; SvUTF8_on(TARG); SvCUR_set(TARG, d - (U8*)SvPVX_const(TARG)); - SETs(TARG); + sv = TARG; + SETs(sv); } } else { @@ -3487,7 +3490,8 @@ if (!len) { SvUTF8_off(TARG); /* decontaminate */ sv_setpvn(TARG, "", 0); - SETs(TARG); + sv = TARG; + SETs(sv); } else { STRLEN min = len + 1; @@ -3540,7 +3544,8 @@ *d = '\0'; SvUTF8_on(TARG); SvCUR_set(TARG, d - (U8*)SvPVX_const(TARG)); - SETs(TARG); + sv = TARG; + SETs(sv); } } else { diff -ur perl-current@27323/t/op/lc.t perl/t/op/lc.t --- perl-current@27323/t/op/lc.t Tue Nov 08 00:50:29 2005 +++ perl/t/op/lc.t Sat Feb 25 18:08:54 2006 @@ -6,7 +6,7 @@ require './test.pl'; } -plan tests => 59; +plan tests => 77; $a = "HELLO.* world"; $b = "hello.* WORLD"; @@ -163,3 +163,39 @@ is($a, v10, "[perl #18857]"); } } + + +# [perl #38619] Bug in lc and uc (interaction between UTF-8, substr, and lc/uc) + +for ("a\x{100}", "xyz\x{100}") { + is(substr(uc($_), 0), uc($_), "[perl #38619] uc"); +} +for ("A\x{100}", "XYZ\x{100}") { + is(substr(lc($_), 0), lc($_), "[perl #38619] lc"); +} +for ("a\x{100}", "ßyz\x{100}") { # ß to Ss (different length) + is(substr(ucfirst($_), 0), ucfirst($_), "[perl #38619] ucfirst"); +} + +# Related to [perl #38619] +# the original report concerns PERL_MAGIC_utf8. +# these cases concern PERL_MAGIC_regex_global. + +for (map { $_ } "a\x{100}", "abc\x{100}", "\x{100}") { + chop; # get ("a", "abc", "") in utf8 + my $return = uc($_) =~ /\G(.?)/g; + my $result = $return ? $1 : "not"; + my $expect = (uc($_) =~ /(.?)/g)[0]; + is($return, 1, "[perl #38619]"); + is($result, $expect, "[perl #38619]"); +} + +for (map { $_ } "A\x{100}", "ABC\x{100}", "\x{100}") { + chop; # get ("A", "ABC", "") in utf8 + my $return = lc($_) =~ /\G(.?)/g; + my $result = $return ? $1 : "not"; + my $expect = (lc($_) =~ /(.?)/g)[0]; + is($return, 1, "[perl #38619]"); + is($result, $expect, "[perl #38619]"); +} + ##### END OF PATCH ```

p5pRT commented 18 years ago

From @nwc10

On Sat\, Feb 25\, 2006 at 06:16:45PM +0900\, SADAHIRO Tomoyuki wrote:

Not to put too much pressure on Sadahiro-san but he has traditionally been able to fix all UTF-8 bugs extremely fast... Our brains must be

:-)

However this bug against *other magics* seemed to exist even in perl 5.6.1\, as shown below for m//g in scalar context.

SvSETMAGIC(sv) does not affect TARG if sv != TARG.

D'oh!

Thanks\, applied (change 27329)

Nicholas Clark

p5pRT commented 18 years ago

@iabyn - Status changed from 'open' to 'resolved'

Perl / perl5