Closed p5pRT closed 21 years ago
The following snippet demonstrated corrupted utf8 strings for lc()\, uc() and ucfirst()\, but not lcfirst():
utf8::upgrade($_="t.est"); s/([a-z]+)/lc($1)/ge; print "lc: $_\n"; utf8::upgrade($_="t.est"); s/([a-z]+)/uc($1)/ge; print "uc: $_\n"; utf8::upgrade($_="t.est"); s/([a-z]+)/lcfirst($1)/ge; print "lcfirst: $_\n"; utf8::upgrade($_="t.est"); s/([a-z]+)/ucfirst($1)/ge; print "ucfirst: $_\n";
It seem to only happen when the following combination is met: - String with unicode flag on - Regular expression with captured variables - Variables manipulated by lc\, uc\, ucfirst or the equivalent \L escapes
Since it occurred in Mail::Header it effectively corrupted utf8-enabled mail applications writing in perl.
Thanks\, /Autrijus/
At 2002-10-28 00:11:02 -0000\, perlbug@perl.org wrote:
utf8::upgrade($_="t.est"); s/([a-z]+)/lc($1)/ge; print "lc: $_\n"; utf8::upgrade($_="t.est"); s/([a-z]+)/uc($1)/ge; print "uc: $_\n"; utf8::upgrade($_="t.est"); s/([a-z]+)/lcfirst($1)/ge; print "lcfirst: $_\n"; utf8::upgrade($_="t.est"); s/([a-z]+)/ucfirst($1)/ge; print "ucfirst: $_\n";
This is a horrible morass of interacting bugs. Let's look at one of the cases above in more detail:
utf8::upgrade($_="t.est.t.est"); s/(\w+)/uc($1)/eg; # $_ eq "T.E0B.T.EST"
Perl_pp_lc() is called twice each for "t" and "est" respectively. The first bug is that $1 does not have SVf_UTF8 set on the first call. This makes no difference with "t"\, but using "\N{greek:sigma}" instead causes more brokenness.
SVf_UTF8 is set the next time\, and we enter the while loop at pp.c:3452. The first time toUPPER_utf8() is called\, Perl_to_utf8() case decides to call swash_init()\, which wipes out the PV we're stepping over somewhere down the line (and causes the brokenness seen in the bug report). Once the swash is initialised\, however\, subsequent calls to pp_lc() work as expected. (The other three cases are essentially similar.)
The code in the bug report avoids breaking lcfirst only by a fortunate coincidence. I extended this to ucfirst in #18107\, but it doesn't fix any of the real problems.
Can anyone (Hugo?) help to explain why SVf_UTF8 isn't set for the first call to pp_lc? Fixing that would solve one class of bugs\, and help me to fix the other (and\, I suspect\, discover a third).
-- ams\, who foolishly saw "a nice opportunity to explore Unicode."
On Tue\, Nov 05\, 2002 at 02:36:46PM +0530\, Abhijit Menon-Sen wrote:
At 2002-10-28 00:11:02 -0000\, perlbug@perl.org wrote:
utf8::upgrade($_="t.est"); s/([a-z]+)/lc($1)/ge; print "lc: $_\n"; utf8::upgrade($_="t.est"); s/([a-z]+)/uc($1)/ge; print "uc: $_\n"; utf8::upgrade($_="t.est"); s/([a-z]+)/lcfirst($1)/ge; print "lcfirst: $_\n"; utf8::upgrade($_="t.est"); s/([a-z]+)/ucfirst($1)/ge; print "ucfirst: $_\n";
This is a horrible morass of interacting bugs. Let's look at one of the cases above in more detail:
utf8​::upgrade\($\_="t\.est\.t\.est"\); s/\(\\w\+\)/uc\($1\)/eg; \# $\_ eq "T\.E0B\.T\.EST"
Perl_pp_lc() is called twice each for "t" and "est" respectively. The first bug is that $1 does not have SVf_UTF8 set on the first call. This makes no difference with "t"\, but using "\N{greek:sigma}" instead causes more brokenness.
Can anyone (Hugo?) help to explain why SVf_UTF8 isn't set for the first call to pp_lc? Fixing that would solve one class of bugs\, and help me to fix the other (and\, I suspect\, discover a third).
OK. I can do that one in latin1\, which may make it easier to view for many readers (like me) living in 8 bit land:
perl -wle '$a = chr (201) . chr 256; chop $a; print $a; $b = $a; $c = $a; $b =~ s/./lc $&/e; print $b; $c =~ s/./lc $&/e; print $c; ' É É é
I think it is one I mailed to p5p a year or two ago\, but never got the tuity things organised to make a TODO test.
-- ams\, who foolishly saw "a nice opportunity to explore Unicode."
"Foolish mortal"
I think that there ought to be inspiration for a good spoof of Tolkien's verse about the ring here\, but I can't create one on the spur of the moment.
Nicholas Clark
Abhijit Menon-Sen \ams@​wiw\.org wrote: :Can anyone (Hugo?) help to explain why SVf_UTF8 isn't set for the first :call to pp_lc? Fixing that would solve one class of bugs\, and help me to :fix the other (and\, I suspect\, discover a third).
It is because we're working on $1\, which is magic\, but we're checking utfness before we check for magic.
Could you try some tests with the patch below? This at least starts to fix things\, and may be the whole story.
Hugo
On Tue 05 Nov 2002 14:30\, hv@crypt.org wrote:
Abhijit Menon-Sen \ams@​wiw\.org wrote: :Can anyone (Hugo?) help to explain why SVf_UTF8 isn't set for the first :call to pp_lc? Fixing that would solve one class of bugs\, and help me to :fix the other (and\, I suspect\, discover a third).
It is because we're working on $1\, which is magic\, but we're checking utfness before we check for magic.
Could you try some tests with the patch below? This at least starts to fix things\, and may be the whole story.
What kind of tests? A full smoke for HP-UX 11.00?
Hugo --- pp.c.old Sat Oct 19 16:21:46 2002 +++ pp.c Tue Nov 5 13:24:06 2002 @@ -3323\,12 +3323\,13 @@ register U8 *s; STRLEN slen;
+ SvGETMAGIC(sv); if (DO_UTF8(sv)) { U8 tmpbuf[UTF8_MAXLEN_UCLC+1]; STRLEN ulen; STRLEN tculen;
- s = (U8*)SvPV(sv\, slen); + s = (U8*)SvPV_nomg(sv\, slen); utf8_to_uvchr(s\, &ulen);
toTITLE\_utf8\(s\, tmpbuf\, &tculen\);
@@ -3342\,7 +3343\,7 @@ SETs(TARG); } else { - s = (U8*)SvPV_force(sv\, slen); + s = (U8*)SvPV_force_nomg(sv\, slen); Copy(tmpbuf\, s\, tculen\, U8); } } @@ -3350\,11 +3351\,11 @@ if (!SvPADTMP(sv) || SvREADONLY(sv)) { dTARGET; SvUTF8_off(TARG); /* decontaminate */ - sv_setsv(TARG\, sv); + sv_setsv_nomg(TARG\, sv); sv = TARG; SETs(sv); } - s = (U8*)SvPV_force(sv\, slen); + s = (U8*)SvPV_force_nomg(sv\, slen); if (*s) { if (IN_LOCALE_RUNTIME) { TAINT; @@ -3365\,8 +3366\,7 @@ *s = toUPPER(*s); } } - if (SvSMAGICAL(sv)) - mg_set(sv); + SvSETMAGIC(sv); RETURN; }
@@ -3377\,7 +3377\,8 @@ register U8 *s; STRLEN slen;
- if (DO_UTF8(sv) && (s = (U8*)SvPV(sv\, slen)) && slen && UTF8_IS_START(*s)) { + SvGETMAGIC(sv); + if (DO_UTF8(sv) && (s = (U8*)SvPV_nomg(sv\, slen)) && slen && UTF8_IS_START(*s)) { STRLEN ulen; U8 tmpbuf[UTF8_MAXLEN_UCLC+1]; U8 *tend; @@ -3396\,7 +3397\,7 @@ SETs(TARG); } else { - s = (U8*)SvPV_force(sv\, slen); + s = (U8*)SvPV_force_nomg(sv\, slen); Copy(tmpbuf\, s\, ulen\, U8); } } @@ -3404\,11 +3405\,11 @@ if (!SvPADTMP(sv) || SvREADONLY(sv)) { dTARGET; SvUTF8_off(TARG); /* decontaminate */ - sv_setsv(TARG\, sv); + sv_setsv_nomg(TARG\, sv); sv = TARG; SETs(sv); } - s = (U8*)SvPV_force(sv\, slen); + s = (U8*)SvPV_force_nomg(sv\, slen); if (*s) { if (IN_LOCALE_RUNTIME) { TAINT; @@ -3419\,8 +3420\,7 @@ *s = toLOWER(*s); } } - if (SvSMAGICAL(sv)) - mg_set(sv); + SvSETMAGIC(sv); RETURN; }
@@ -3431\,6 +3431\,7 @@ register U8 *s; STRLEN len;
+ SvGETMAGIC(sv); if (DO_UTF8(sv)) { dTARGET; STRLEN ulen; @@ -3438\,7 +3439\,7 @@ U8 *send; U8 tmpbuf[UTF8_MAXLEN_UCLC+1];
- s = (U8*)SvPV(sv\,len); + s = (U8*)SvPV_nomg(sv\,len); if (!len) { SvUTF8_off(TARG); /* decontaminate */ sv_setpvn(TARG\, ""\, 0); @@ -3468\,11 +3469\,11 @@ if (!SvPADTMP(sv) || SvREADONLY(sv)) { dTARGET; SvUTF8_off(TARG); /* decontaminate */ - sv_setsv(TARG\, sv); + sv_setsv_nomg(TARG\, sv); sv = TARG; SETs(sv); } - s = (U8*)SvPV_force(sv\, len); + s = (U8*)SvPV_force_nomg(sv\, len); if (len) { register U8 *send = s + len;
@@ -3488\,8 +3489\,7 @@ } } } - if (SvSMAGICAL(sv)) - mg_set(sv); + SvSETMAGIC(sv); RETURN; }
@@ -3500\,6 +3500\,7 @@ register U8 *s; STRLEN len;
+ SvGETMAGIC(sv); if (DO_UTF8(sv)) { dTARGET; STRLEN ulen; @@ -3507\,7 +3508\,7 @@ U8 *send; U8 tmpbuf[UTF8_MAXLEN_UCLC+1];
- s = (U8*)SvPV(sv\,len); + s = (U8*)SvPV_nomg(sv\,len); if (!len) { SvUTF8_off(TARG); /* decontaminate */ sv_setpvn(TARG\, ""\, 0); @@ -3554\,12 +3555\,12 @@ if (!SvPADTMP(sv) || SvREADONLY(sv)) { dTARGET; SvUTF8_off(TARG); /* decontaminate */ - sv_setsv(TARG\, sv); + sv_setsv_nomg(TARG\, sv); sv = TARG; SETs(sv); }
- s = (U8*)SvPV_force(sv\, len); + s = (U8*)SvPV_force_nomg(sv\, len); if (len) { register U8 *send = s + len;
@@ -3575\,8 +3576\,7 @@ } } } - if (SvSMAGICAL(sv)) - mg_set(sv); + SvSETMAGIC(sv); RETURN; }
-- H.Merijn Brand Amsterdam Perl Mongers (http://amsterdam.pm.org/) using perl-5.6.1\, 5.8.0 & 633 on HP-UX 10.20 & 11.00\, AIX 4.2\, AIX 4.3\, WinNT 4\, Win2K pro & WinCE 2.11. Smoking perl CORE: smokers@perl.org http://archives.develooper.com/daily-build@perl.org/ perl-qa@perl.org send smoke reports to: smokers-reports@perl.org\, QA: http://qa.perl.org
On Tue\, Nov 05\, 2002 at 02:38:49PM +0100\, H.Merijn Brand wrote:
On Tue 05 Nov 2002 14:30\, hv@crypt.org wrote:
Abhijit Menon-Sen \ams@​wiw\.org wrote: :Can anyone (Hugo?) help to explain why SVf_UTF8 isn't set for the first :call to pp_lc? Fixing that would solve one class of bugs\, and help me to :fix the other (and\, I suspect\, discover a third).
It is because we're working on $1\, which is magic\, but we're checking utfness before we check for magic.
Could you try some tests with the patch below? This at least starts to fix things\, and may be the whole story.
What kind of tests? A full smoke for HP-UX 11.00?
This hunk failed:
--- pp.c.old Sat Oct 19 16:21:46 2002 +++ pp.c Tue Nov 5 13:24:06 2002 @@ -3323\,12 +3323\,13 @@ register U8 *s; STRLEN slen;
+ SvGETMAGIC(sv); if (DO_UTF8(sv)) { U8 tmpbuf[UTF8_MAXLEN_UCLC+1]; STRLEN ulen; STRLEN tculen;
- s = (U8*)SvPV(sv\, slen); + s = (U8*)SvPV_nomg(sv\, slen); utf8_to_uvchr(s\, &ulen);
toTITLE\_utf8\(s\, tmpbuf\, &tculen\);
..../perl -Ilib -wle '$a = chr (201) . chr 256; chop $a; print $a; $b = $a; $c = $a; $b =~ +s/./lc $&/e; print $b; $c =~ s/./lc $&/e; print $c; ' É é é
Well\, that one works now\, as do all variants except ucfirst. I can't see how to apply the rejected hunk.
Nicholas Clark
\hv@​crypt\.org writes:
Abhijit Menon-Sen \ams@​wiw\.org wrote: :Can anyone (Hugo?) help to explain why SVf_UTF8 isn't set for the first :call to pp_lc? Fixing that would solve one class of bugs\, and help me to :fix the other (and\, I suspect\, discover a third).
It is because we're working on $1\, which is magic\, but we're checking utfness before we check for magic.
Could you try some tests with the patch below? This at least starts to fix things\, and may be the whole story.
Hugo --- pp.c.old Sat Oct 19 16:21:46 2002 +++ pp.c Tue Nov 5 13:24:06 2002 @@ -3323\,12 +3323\,13 @@ register U8 *s; STRLEN slen;
+ SvGETMAGIC(sv); if (DO_UTF8(sv)) { U8 tmpbuf[UTF8_MAXLEN_UCLC+1]; STRLEN ulen; STRLEN tculen;
- s = (U8*)SvPV(sv\, slen); + s = (U8*)SvPV_nomg(sv\, slen);
......
And presumably doing one SvGETMAGIC at the top should be faster than re-doing it in various spots below :-)
-- Nick Ing-Simmons http://www.ni-s.u-net.com/
At 2002-11-05 13:30:00 +0000\, hv@crypt.org wrote:
Could you try some tests with the patch below? This at least starts to fix things\, and may be the whole story.
Now I feel stupid.
#18107 (my earlier patch) and #18109 (your patch with one hunk fixed) fix all four cases in the original bug report\, but the swash_init() problem remains:
$_ = "\N{greek:sigma}foo.bar"; s/(\w+)/uc($1)/ge; print "broken\n" unless $_ eq "\N{greek:Sigma}FOO.BAR";
-- ams
On Tue\, Nov 05\, 2002 at 07:55:23PM +0530\, Abhijit Menon-Sen wrote:
At 2002-11-05 13:30:00 +0000\, hv@crypt.org wrote:
Could you try some tests with the patch below? This at least starts to fix things\, and may be the whole story.
Now I feel stupid.
#18107 (my earlier patch) and #18109 (your patch with one hunk fixed) fix all four cases in the original bug report\, but the swash_init() problem remains:
$\_ = "\\N\{greek​:sigma\}foo\.bar"; s/\(\\w\+\)/uc\($1\)/ge; print "broken\\n" unless $\_ eq "\\N\{greek​:Sigma\}FOO\.BAR";
Bah. You need full unicode for your example. That's tantamount to cheating Can I do it in mere 8 bit? No. I can do better than that. Here we go (wrong) in glorious 7 bit ASCII:
perl5.00503 -lwe '$_ = "foo.bar" . chr 256; chop $_; s/(\w+)/uc($1)/ge; print $_' FOO.BAR
..../perl -Ilib -lwe '$_ = "foo.bar" . chr 256; chop $_; s/(\w+)/uc($1)/ge; print $_' FOO.B0B
Who is this B0B character? :-(
Nicholas Clark
PS I'm not really using chr 256. It's just a quantum effect until it meets its anticharacter\, chr -256\, and annihilates. :-) [AKA I need a way to set the utf8 flag on an ASCII scalar]
Abhijit Menon-Sen \ams@​wiw\.org wrote: :At 2002-11-05 13:30:00 +0000\, hv@crypt.org wrote: :> Could you try some tests with the patch below? This at least starts to :> fix things\, and may be the whole story. : :Now I feel stupid.
Happens to the best of us. :)
:#18107 (my earlier patch) and #18109 (your patch with one hunk fixed)
That one hunk needed fixing only because #18107 changed the context\, right?
:fix all four cases in the original bug report\, but the swash_init() :problem remains: : : $_ = "\N{greek:sigma}foo.bar"; : s/(\w+)/uc($1)/ge; : print "broken\n" unless $_ eq "\N{greek:Sigma}FOO.BAR";
I'll try to look at that one later.
Hugo
"H.Merijn Brand" \h\.m\.brand@​hccnet\.nl wrote: :> Could you try some tests with the patch below? This at least starts to :> fix things\, and may be the whole story. : :What kind of tests? A full smoke for HP-UX 11.00?
No\, I just meant 'try some lc/uc/s{}{}e combinations'. I'm sure the smoke will out in any case.
Hugo
\hv@​crypt\.org writes:
Abhijit Menon-Sen \ams@​wiw\.org wrote: :At 2002-11-05 13:30:00 +0000\, hv@crypt.org wrote: :> Could you try some tests with the patch below? This at least starts to :> fix things\, and may be the whole story. : :Now I feel stupid.
Happens to the best of us. :)
With the swash code it happens to _all_ of us.
For the uninitiated SWASHes are hashes that represent property data from the Unicode tables. The hashes are consulted by the C code to to upper/lower case and similar things. BUT the hashes are (lazily) populated by running perl code which uses regular expressions to poke about in the Unicode data files. So we re-enter lots of things (like the RE engine) which were not really expecting it. So in the swash code we attempt to save state - call the perl code - and then restore state.
-- Nick Ing-Simmons http://www.ni-s.u-net.com/
On Tue\, 05 Nov 2002 16:41:21 +0000\, Nick Ing-Simmons \nick\.ing\-simmons@​elixent\.com said:
> \hv@​crypt\.org writes:
Abhijit Menon-Sen \ams@​wiw\.org wrote: :At 2002-11-05 13:30:00 +0000\, hv@crypt.org wrote: :> Could you try some tests with the patch below? This at least starts to :> fix things\, and may be the whole story. : :Now I feel stupid.
Happens to the best of us. :)
> With the swash code it happens to _all_ of us.
> For the uninitiated SWASHes are hashes that represent property data > from the Unicode tables. The hashes are consulted by the C code to > to upper/lower case and similar things. BUT the hashes are (lazily) populated > by running perl code which uses regular expressions to poke about > in the Unicode data files. So we re-enter lots of things (like the RE engine) > which were not really expecting it. So in the swash code we attempt to save > state - call the perl code - and then restore state.
While you are at it\, may I ask you what you think about this:
when code is running in the debugger and triggers that swash code\, it may happen that things become really sloooow. I've seen this but couldn't write a simple test case\, so I delayed talking about it. Do you see any possibility to disable the debugger for *this* perl code?
Thanks for any insight\, -- andreas
At 2002-11-05 14:44:28 +0000\, nick@ccl4.org wrote:
../perl -Ilib -lwe '$_ = "foo.bar" . chr 256; chop $_; s/(\w+)/uc($1)/ge; print $_' FOO.B0B
Who is this B0B character? :-(
Nicholas\, say goodbye to B0B.
The appended patch teaches save_re_context() to save the existing $1..$n SV's\, and fixes all the problems I've seen so far from the original bug report. I'll apply it shortly\, unless somebody has a better suggestion for how to fetch the SV's to be saved.
(By the way\, I liked p4d2p. Can p4genpatch be made to read from stdin?)
-- ams
==== //depot/perl/regcomp.c#311 - /home/ams/build/perl/current/regcomp.c ==== @@ -5070\,6 +5070\,23 @@ SAVEVPTR(PL_reg_curpm); /* from regexec.c */ SAVEI32(PL_regnpar); /* () count. */ SAVEI32(PL_regsize); /* from regexec.c */ + + { + /* Save $1..$n (#18107: UTF-8 s/(\w+)/uc($1)/e); AMS 20021106. */ + int i; + GV *mgv; + REGEXP *rx; + char digits[16]; + + if (PL_curpm && (rx = PM_GETRE(PL_curpm))) { + for (i = 1; i \<= rx->nparens; i++) { + sprintf(digits\, "%lu"\, i); + if ((mgv = gv_fetchpv(digits\, FALSE\, SVt_PV))) + save_scalar(mgv); + } + } + } + #ifdef DEBUGGING SAVEPPTR(PL_reg_starttry); /* from regexec.c */ #endif
==== //depot/perl/t/op/lc.t#10 - /home/ams/build/perl/current/t/op/lc.t ==== @@ -1\,6 +1\,6 @@ #!./perl
-print "1..51\n"; +print "1..55\n";
my $test = 1;
@@ -136\,3 +136\,18 @@ ok(uc("\x{1C5}") eq "\x{1C4}"\, "U+01C5 uc is U+01C4"); ok(uc("\x{1C6}") eq "\x{1C4}"\, "U+01C6 uc is U+01C4\, too");
+# #18107: A host of bugs involving [ul]c{\,first}. AMS 20021106 +$a = "\x{3c3}foo.bar"; # \x{3c3} == GREEK SMALL LETTER SIGMA. +$b = "\x{3a3}FOO.BAR"; # \x{3a3} == GREEK CAPITAL LETTER SIGMA. + +($c = $b) =~ s/(\w+)/lc($1)/ge; +ok($c eq $a\, "Using s///e to change case."); + +($c = $a) =~ s/(\w+)/uc($1)/ge; +ok($c eq $b\, "Using s///e to change case."); + +($c = $b) =~ s/(\w+)/lcfirst($1)/ge; +ok($c eq "\x{3c3}FOO.bAR"\, "Using s///e to change case."); + +($c = $a) =~ s/(\w+)/ucfirst($1)/ge; +ok($c eq "\x{3a3}foo.Bar"\, "Using s///e to change case.");
On Wed\, 6 Nov 2002 19:38:11 +0530\, Abhijit Menon-Sen \ams@​wiw\.org said:
> (By the way\, I liked p4d2p. Can p4genpatch be made to read from stdin?)
That would do it without re-indentation (to keep the patch short). Not well tested.
* accepts more than one patchnumber on commandline * reads patchnumbers from STDIN\, one per line
-- andreas
Abhijit Menon-Sen \ams@​wiw\.org wrote: :The appended patch teaches save_re_context() to save the existing $1..$n :SV's\, and fixes all the problems I've seen so far from the original bug :report. I'll apply it shortly\, unless somebody has a better suggestion :for how to fetch the SV's to be saved.
If I correctly understand what this is doing\, this feels like an approach that is in danger of slowing things down substantially. Is there really no cheaper way to fix this?
Can you give a detailed trace of what's going wrong in a sample testcase? Is the problem specific to swash creation\, or does this also affect other intruded perl code\, such as re_eval?
Hugo
On Tue\, 05 Nov 2002 13:30:00 +0000\, hv@crypt.org wrote:
Abhijit Menon-Sen \ams@​wiw\.org wrote: :Can anyone (Hugo?) help to explain why SVf_UTF8 isn't set for the first :call to pp_lc? Fixing that would solve one class of bugs\, and help me to :fix the other (and\, I suspect\, discover a third).
It is because we're working on $1\, which is magic\, but we're checking utfness before we check for magic.
Could you try some tests with the patch below? This at least starts to fix things\, and may be the whole story.
Nope :( I think SvPV has to come before DO_UTF8 for AMAGIC to work:
~/bleadperl/perl $./perl -Ilib -MTest::More=no_plan -wle' use overload q{""}=>sub{v256}; for $try ( lc(v256)\, lc(bless{}) ) { is(("v".join"."\,map ord\, split //\, $try)\, "v257") }' ok 1 not ok 2 # Failed test (-e at line 2) # got: 'v256' # expected: 'v257' 1..2 # Looks like you failed 1 tests of 2.
Also\, it appears that the utf8 branches for [ul]c(first)? are doing setmagic on the source\, not the target (the non-utf8 branch does sv = TARG except when suborning the source).
And [ul]cfirst are assuming (in the SvPADTMP true branch) that the upper/lower-cased first character will have exactly the same byte length as the original. Is this a valid assumtion?
I'm going to try to come up with first a test patch and then a code patch.
sthoenna@efn.org (Yitzchak Scott-Thoennes) wrote: :On Tue\, 05 Nov 2002 13:30:00 +0000\, hv@crypt.org wrote: :>Abhijit Menon-Sen \ams@​wiw\.org wrote: :>:Can anyone (Hugo?) help to explain why SVf_UTF8 isn't set for the first :>:call to pp_lc? Fixing that would solve one class of bugs\, and help me to :>:fix the other (and\, I suspect\, discover a third). :> :>It is because we're working on $1\, which is magic\, but we're checking :>utfness before we check for magic. :> :>Could you try some tests with the patch below? This at least starts to :>fix things\, and may be the whole story. : :Nope :( I think SvPV has to come before DO_UTF8 for AMAGIC to work:
Ah\, yes.
:Also\, it appears that the utf8 branches for [ul]c(first)? are doing :setmagic on the source\, not the target (the non-utf8 branch does :sv = TARG except when suborning the source).
Yup. Though my patch didn't change that.
:And [ul]cfirst are assuming (in the SvPADTMP true branch) that :the upper/lower-cased first character will have exactly the same :byte length as the original. Is this a valid assumtion?
I don't think so. Hmm\, I wonder if there is also any possibility that the locale+nonutf8 branch would end up needing a utf8 result.
:I'm going to try to come up with first a test patch and then a :code patch.
Thank you. In case it is helpful\, I attach below the patch that I think would fix ucfirst; as far as I can tell the C\< utf8_to_uvchr(tmpbuf\, 0); > does nothing\, and the C\< SvUTF8_off(TARG); /* decontaminate */ > should not be needed before an sv_setsv(). This does at least pass existing tests and the moral equivalent of your example overloading test.
Hugo
Yitzchak Scott-Thoennes \sthoenna@​efn\.org writes:
And [ul]cfirst are assuming (in the SvPADTMP true branch) that the upper/lower-cased first character will have exactly the same byte length as the original. Is this a valid assumtion?
No.
-- Nick Ing-Simmons http://www.ni-s.u-net.com/
\hv@​crypt\.org writes:
I don't think so. Hmm\, I wonder if there is also any possibility that the locale+nonutf8 branch would end up needing a utf8 result.
There are problem cases. iso-8859-1 0xff (y diaresis) does not have upper case version anywhere "near" it. Upper-case of accented characters in some locales (French?) drops the accent and makes it shorter in UTF-8.
-- Nick Ing-Simmons http://www.ni-s.u-net.com/
On Fri\, 08 Nov 2002 12:59:49 +0000\, hv@crypt.org wrote:
sthoenna@efn.org (Yitzchak Scott-Thoennes) wrote: :On Tue\, 05 Nov 2002 13:30:00 +0000\, hv@crypt.org wrote: :>Abhijit Menon-Sen \ams@​wiw\.org wrote: :>:Can anyone (Hugo?) help to explain why SVf_UTF8 isn't set for the first :>:call to pp_lc? Fixing that would solve one class of bugs\, and help me to :>:fix the other (and\, I suspect\, discover a third). :> :>It is because we're working on $1\, which is magic\, but we're checking :>utfness before we check for magic. :> :>Could you try some tests with the patch below? This at least starts to :>fix things\, and may be the whole story. : :Nope :( I think SvPV has to come before DO_UTF8 for AMAGIC to work:
Ah\, yes.
:Also\, it appears that the utf8 branches for [ul]c(first)? are doing :setmagic on the source\, not the target (the non-utf8 branch does :sv = TARG except when suborning the source).
Yup. Though my patch didn't change that.
Sorry\, I didn't mean to imply it did. Been that way since at least 5.6.1.
:And [ul]cfirst are assuming (in the SvPADTMP true branch) that :the upper/lower-cased first character will have exactly the same :byte length as the original. Is this a valid assumtion?
I don't think so. Hmm\, I wonder if there is also any possibility that the locale+nonutf8 branch would end up needing a utf8 result.
Sounded like NI-S thought so. What needs to be done? Will toupper(255) ever really return >255?
:I'm going to try to come up with first a test patch and then a :code patch.
Thank you. In case it is helpful\, I attach below the patch that I think would fix ucfirst; as far as I can tell the C\< utf8_to_uvchr(tmpbuf\, 0); > does nothing\, and the C\< SvUTF8_off(TARG); /* decontaminate */ > should not be needed before an sv_setsv(). This does at least pass existing tests and the moral equivalent of your example overloading test.
Thanks. I'm still working on a test patch but have been busy with other things. Some further notes:
I'm not sure we should even be doing plain sv_setsv. That's passing through any IV or NV values:
~/bleadperl/perl $./perl -Ilib -MDevel::Peek -MScalar::Util=/./ -we'
Dump dualvar(123\,"abc")' SV = PVNV(0x1034e4a8) at 0x10261164 REFCNT = 1 FLAGS = (TEMP\,IOK\,POK\,pIOK\,pPOK) IV = 123 NV = 0 PV = 0x102989f8 "abc"\0 CUR = 3 LEN = 4
or in pure perl:
~/bleadperl/perl $./miniperl -wle'print length(~ucfirst !1)' 20
(which I think should print 0)
On Tue\, 12 Nov 2002\, Yitzchak Scott-Thoennes wrote:
~/bleadperl/perl $./perl -Ilib -MDevel::Peek -MScalar::Util=/./ -we'
Dump dualvar(123\,"abc")' SV = PVNV(0x1034e4a8) at 0x10261164 REFCNT = 1 FLAGS = (TEMP\,IOK\,POK\,pIOK\,pPOK) IV = 123 NV = 0 PV = 0x102989f8 "abc"\0 CUR = 3 LEN = 4
I need to work on my cut-and-pasting skills. That should have been: ~/bleadperl/perl $./perl -Ilib -MDevel::Peek -MScalar::Util=/./ -we'
Dump ucfirst dualvar(123\,"abc")' SV = PVNV(0x1034e4c8) at 0x1027391c REFCNT = 1 FLAGS = (PADTMP\,IOK\,POK\,pIOK\,pPOK) IV = 123 NV = 0 PV = 0x102989f8 "Abc"\0 CUR = 3 LEN = 4
At 2002-11-06 16:32:25 +0000\, hv@crypt.org wrote:
The appended patch teaches save_re_context() to save the existing $1..$n SV's [...]
Is there really no cheaper way to fix this?
I don't see any cheaper way (and as Jarkko has shown\, this way doesn't seem unacceptably expensive). One could skimp and save a few selected bits instead of the entire scalar\, but I don't think that's worth it.
Can you give a detailed trace of what's going wrong in a sample testcase?
Perl_to_utf8_case (which is called by toLOWER_utf8 and siblings) calls swatch_init() if the swatch it needs isn't loaded. swatch_init() calls the SWASHNEW method\, which wipes out $1 somewhere in the Perl code.
#0 Perl_call_sv (sv=0x8180454\, flags=64) at perl.c:1760 #1 Perl_call_method (methname=0x8120ad2 "SWASHNEW"\, flags=0) at perl.c:1743 #2 Perl_swash_init (pkg=0x81209e4 "utf8"\, name=0x8120a7e "ToUpper"\, listsv=0x81302d8\, minbits=4\, none=0) at utf8.c:1549 #3 Perl_to_utf8_case (p=0x817af90 "foo"\, ustrp=0xbffff5a4 "\030"\, lenp=0xbffff59c\, swashp=0x8130be4\, normal=0x8120a7e "ToUpper"\, special=0x8120a6c "utf8::ToSpecUpper") at utf8.c:1349 #4 Perl_to_utf8_upper (p=0x817af90 "foo"\, ustrp=0xbffff5a4 "\030"\, lenp=0xbffff59c) at utf8.c:1441 #5 Perl_pp_uc () at pp.c:3454
(From perl -le '$_="foo.bar".chr(256);chop;s/([^.]+)/uc($1)/ge;print')
-- ams
I don't see any cheaper way (and as Jarkko has shown\, this way doesn't seem unacceptably expensive). One could skimp and save a few selected bits instead of the entire scalar\, but I don't think that's worth it.
I tried the speed in two ways:
(1) run op/regexp.t and op/pat.100 both 10 times on an almost zero-load quad-Alpha box: the user+system time difference was within plus minus 1.4% (the pat.t was slower\, but regexp.t was faster with the patch\, but I think the tests were so fast that all that was measuring noise).
(2) Since neither of the above tests really stress test UTF-8 substitutions\, I ran also this (on an empty single-cpu x86 box) (the repetition counts were hand-crafted to give something in the 3..30 seconds range)
use Benchmark; sub inita { $a = (("abc" x $_[0]) . " ") x $_[1]; utf8::upgrade($a) } inita( 1\, 1000); timethis( 50\, '$a =~ s/(\w+)/\U$1/g'); inita( 1\, 10000); timethis( 1\, '$a =~ s/(\w+)/\U$1/g'); inita( 10\, 100); timethis( 200\, '$a =~ s/(\w+)/\U$1/g'); inita( 10\, 1000); timethis( 10\, '$a =~ s/(\w+)/\U$1/g'); inita( 10\, 10000); timethis( 1\, '$a =~ s/(\w+)/\U$1/g'); inita( 100\, 10); timethis( 200\, '$a =~ s/(\w+)/\U$1/g'); inita( 100\, 100); timethis( 10\, '$a =~ s/(\w+)/\U$1/g'); inita( 100\, 1000); timethis( 5\, '$a =~ s/(\w+)/\U$1/g'); inita( 1000\, 1); timethis( 200\, '$a =~ s/(\w+)/\U$1/g'); inita( 10000\, 1); timethis( 20\, '$a =~ s/(\w+)/\U$1/g');
The results of 2+2 runs cleaned up:
a) b) c) d) with1 with2 out1 out2 1 1000 4000 50 3.90 3.89 3.97 3.95 1 10000 40000 1 3.44 3.44 3.45 3.45 10 100 3100 200 5.72 5.71 6.11 6.11 10 1000 31000 10 5.04 5.03 5.25 5.25 10 10000 310000 1 27.01 27.19 27.11 27.46 100 10 3010 200 4.98 4.97 5.38 5.37 100 100 30100 10 2.73 2.73 2.96 2.96 100 1000 301000 5 26.75 26.70 26.36 27.95 1000 1 3001 200 4.86 4.85 5.30 5.31 10000 1 30001 20 4.96 4.96 5.45 5.46
a) number of "abc":s in one (\w+) run b) number of "abc... ":in $a c) total size of $a d) number of repetitions
Yes\, quite consistently *with* the patch things get *faster*. No\, don't ask me *why*.
-- Jarkko Hietaniemi \jhi@​iki\.fi http://www.iki.fi/jhi/ "There is this special biologist word we use for 'stable'. It is 'dead'." -- Jack Cohen
Since no better ideas have come along\, I'm applying Abhijit's patch the on maint-5.8/perl branch and marking the issue as resolved.
@jhi - Status changed from 'new' to 'resolved'
Abhijit Menon-Sen \ams@​wiw\.org wrote: :At 2002-11-05 14:44:28 +0000\, nick@ccl4.org wrote: :> :> ../perl -Ilib -lwe '$_ = "foo.bar" . chr 256; chop $_; s/(\w+)/uc($1)/ge; print $_' :> FOO.B0B :> :> Who is this B0B character? :-( : :Nicholas\, say goodbye to B0B. : :The appended patch teaches save_re_context() to save the existing $1..$n :SV's\, and fixes all the problems I've seen so far from the original bug :report. I'll apply it shortly\, unless somebody has a better suggestion :for how to fetch the SV's to be saved.
Thanks\, belatedly applied as #18266.
Hugo
@jhi - Status changed from 'open' to 'resolved'
Migrated from rt.perl.org#18107 (status was 'resolved')
Searchable as RT18107$