Closed p5pRT closed 18 years ago
Problem with lc/uc interacting with substr and _utf8_on.
Second substr(lc($var)\,0) on the same _utf8_on'ed $var is the wrong length\, and\, in preliminary results\, seems to be limited to the same length as the first substr(lc($var)\, 0). Adding further iterations leads to further weirdness. Test program below can be called as:
perl bug.pl [test-string] Test string will be split on /:/\, defaults to 'a:bc'.
For each string in the split:
_utf8_on\, and print string \
Output should be:
string1 \
Actual output is:
string1 \
# sample program demonstrating problem $ cat bug.pl #!/usr/bin/perl -l use strict; use warnings; use Encode qw/_utf8_on/; for (split /:/\, shift||'a:bc') { _utf8_on($_); print "$_\t"\, substr(lc($_)\, 0); }
# expected results $ cat expected_output a a bc bc
# actual results $ perl bug.pl a a bc b
# golfed test case (should produce 'abc'\, not 'ab') $ perl -MEncode=_utf8_on -e '_utf8_on($_)\,print substr lc\,0 for qw\\,$/' ab
Additional oddness/data: Affected versions: >=5.8.1 Confirmed unaffected: linux-i686 5.8.0\, solaris 5.8.0
Affected functions: only lc/uc. (not ucfirst/lcfirst). Only in substr(lc()\,0) order. (i.e. lc(substr($_\, 0)) is not affected.)
Still in 5.9.3 for i686-linux. (Tested that before I submitted\, but forgot to mention it).
perl@benizi.com - Status changed from 'new' to 'open'
Looks like a hairy troll has jhidden for quite a while:)
----Program---- use strict; use warnings; use Encode qw/_utf8_on/; for (split /:/\, 'a:bc') { _utf8_on($_); my $p = join ""\, "$_ "\, substr(lc($_)\, 0); print $p =~ /^(a a|bc bc)$/ ? "ok # $p\n" : "not ok # $p\n"; }
----Output of .../pHIziQK/perl-5.8.0@18529/bin/perl---- ok # a a ok # bc bc
----EOF ($?='0')---- ----Output of .../pCRFA94/perl-5.8.0@18530/bin/perl---- ok # a a not ok # bc b
----EOF ($?='0')----
Change 18530 by hv@hv-crypt.org on 2003/01/21 01:37:03
integrate (by hand) #18353 and #18359 from maint-5.8:
OK\, maybe it helps to binary search along the maint-5.8 stretch...
----Program---- use strict; use warnings; use Encode qw/_utf8_on/; for (split /:/\, 'a:bc') { _utf8_on($_); my $p = join ""\, "$_ "\, substr(lc($_)\, 0); print $p =~ /^(a a|bc bc)$/ ? "ok # $p\n" : "not ok # $p\n"; }
----Output of .../pAyq3oR/perl-5.8.0@18352/bin/perl---- ok # a a ok # bc bc
----EOF ($?='0')---- ----Output of .../pZpX8E8/perl-5.8.0@18353/bin/perl---- ok # a a not ok # bc b
----EOF ($?='0')----
Change 18353 by jhi@lyta on 2002/12/26 02:07:06
Introduce a cache for UTF-8 data: length and byte\<->char mapping
are stored in a new type of magic. Speeds up length()\, substr()\,
index()\, rindex()\, pos()\, and some parts of s///.
The speedup varies a lot (on the usual suspects: what is the
access pattern of the data\, compiler\, CPU)\, but should be at
least one order of magnitude\, and getting to the same magnitude
as byte string speeds\, and in some cases (length on unchanged data)
even reaching the byte string speed. On the other hand\, in some
cases (index) the byte speed is still faster by a factor of five
or so\, but the bottleneck there does not seem to be any more
the byte\<->char mapping (instead\, the fbm_instr() speed).
There is one cache slot for the speed\, and only two for the
byte\<->char mapping (the first one for the start->offset\,
and the second for the offset->offset+length\, when talking
in substr() terms).
Code this hairy is bound to have hairy trolls hiding under it.
-- andreas
On Fri\, Feb 24\, 2006 at 04:46:09AM +0100\, Andreas J. Koenig wrote:
Looks like a hairy troll has jhidden for quite a while:)
Change 18353 by jhi@lyta on 2002/12/26 02:07:06
Introduce a cache for UTF\-8 data​: length and byte\<\->char mapping are stored in a new type of magic\. Speeds up length\(\)\, substr\(\)\, index\(\)\, rindex\(\)\, pos\(\)\, and some parts of s///\.
Thanks. This confirms my suspicion that it was the UTF-8 caching code introduced with 5.8.1
As part of my TPF grant I'm going to look at all this\, so if no-one else beats me to finding the specific bug in the existing code\, it will be resolved in the next 3 months.
As a work around\, I think that re-assigning the value to itself before the lc or uc will clear the cache\, and lc and uc will then give the correct answer.
Nicholas Clark
On Fri\, 24 Feb 2006 10:13:15 +0000\, Nicholas Clark \nick@​ccl4\.org wrote
On Fri\, Feb 24\, 2006 at 04:46:09AM +0100\, Andreas J. Koenig wrote:
Looks like a hairy troll has jhidden for quite a while:)
Change 18353 by jhi@lyta on 2002/12/26 02:07:06
Introduce a cache for UTF\-8 data​: length and byte\<\->char mapping are stored in a new type of magic\. Speeds up length\(\)\, substr\(\)\, index\(\)\, rindex\(\)\, pos\(\)\, and some parts of s///\.
Thanks. This confirms my suspicion that it was the UTF-8 caching code introduced with 5.8.1
As part of my TPF grant I'm going to look at all this\, so if no-one else beats me to finding the specific bug in the existing code\, it will be resolved in the next 3 months.
As a work around\, I think that re-assigning the value to itself before the lc or uc will clear the cache\, and lc and uc will then give the correct answer.
Should the magic on TARG be reset? (Or don't use TARG?)
ucfirst() also has this bug\, when ulen != tculen (see pp_ucfirst).
for (split /:/\, shift||"a:ßbc") { utf8::upgrade($_); print "$_\t"\, substr(ucfirst($_)\, 0)\, "\t$_\n"; } __END__ a A a ßbc S ßbc
cf. The result on Perl 5.8.0 a A a ßbc Ssbc ßbc
postincrement $_++ is also buggy.
#!perl use strict; use warnings; use Encode qw/_utf8_on/; for (split /:/\, shift||'a:bc') { _utf8_on($_); print "$_\t"\, substr($_++\, 0)\, "\t$_\n"; } __END__ a a b bc b bd
cf. The result on Perl 5.8.0 a a b bc bc bd
In contrast\, preincrement ++$_ is good (pp_preinc doesn't use TARG).
#!perl use strict; use warnings; use Encode qw/_utf8_on/; for (split /:/\, shift||'a:bc') { _utf8_on($_); print "$_\t"\, substr(++$_\, 0)\, "\t$_\n"; } __END__ a b b bc bd bd
Regards SADAHIRO Tomoyuki
Nicholas Clark wrote:
On Fri\, Feb 24\, 2006 at 04:46:09AM +0100\, Andreas J. Koenig wrote:
Looks like a hairy troll has jhidden for quite a while:)
Change 18353 by jhi@lyta on 2002/12/26 02:07:06
Introduce a cache for UTF\-8 data​: length and byte\<\->char mapping are stored in a new type of magic\. Speeds up length\(\)\, substr\(\)\, index\(\)\, rindex\(\)\, pos\(\)\, and some parts of s///\.
Thanks. This confirms my suspicion that it was the UTF-8 caching code introduced with 5.8.1
As part of my TPF grant I'm going to look at all this\, so if no-one else beats me to finding the specific bug in the existing code\, it will be resolved in the next 3 months.
Not to put too much pressure on Sadahiro-san but he has traditionally been able to fix all UTF-8 bugs extremely fast... Our brains must be similarly wired. Or\, rather\, his brain is wired the right way to *untangle* the spaghetti I left behind.
As a work around\, I think that re-assigning the value to itself before the lc or uc will clear the cache\, and lc and uc will then give the correct answer.
Nicholas Clark
On Fri\, 24 Feb 2006 19:15:53 +0200\, Jarkko Hietaniemi \jhietaniemi@​gmail\.com wrote
Nicholas Clark wrote:
On Fri\, Feb 24\, 2006 at 04:46:09AM +0100\, Andreas J. Koenig wrote:
Looks like a hairy troll has jhidden for quite a while:)
Change 18353 by jhi@lyta on 2002/12/26 02:07:06
Introduce a cache for UTF\-8 data​: length and byte\<\->char mapping are stored in a new type of magic\. Speeds up length\(\)\, substr\(\)\, index\(\)\, rindex\(\)\, pos\(\)\, and some parts of s///\.
Thanks. This confirms my suspicion that it was the UTF-8 caching code introduced with 5.8.1
As part of my TPF grant I'm going to look at all this\, so if no-one else beats me to finding the specific bug in the existing code\, it will be resolved in the next 3 months.
Not to put too much pressure on Sadahiro-san but he has traditionally been able to fix all UTF-8 bugs extremely fast... Our brains must be similarly wired. Or\, rather\, his brain is wired the right way to *untangle* the spaghetti I left behind.
However this bug against *other magics* seemed to exist even in perl 5.6.1\, as shown below for m//g in scalar context.
SvSETMAGIC(sv) does not affect TARG if sv != TARG.
Regards\, SADAHIRO Tomoyuki
The changes of pp.c are for pp_ucfirst\, pp_uc\, and pp_lc.
On Sat\, Feb 25\, 2006 at 06:16:45PM +0900\, SADAHIRO Tomoyuki wrote:
Not to put too much pressure on Sadahiro-san but he has traditionally been able to fix all UTF-8 bugs extremely fast... Our brains must be
:-)
However this bug against *other magics* seemed to exist even in perl 5.6.1\, as shown below for m//g in scalar context.
SvSETMAGIC(sv) does not affect TARG if sv != TARG.
D'oh!
Thanks\, applied (change 27329)
Nicholas Clark
@iabyn - Status changed from 'open' to 'resolved'
Migrated from rt.perl.org#38619 (status was 'resolved')
Searchable as RT38619$