Perl / perl5

🐪 The Perl programming language
https://dev.perl.org/perl5/
Other
1.94k stars 554 forks source link

lc(), uc() and ucfirst() broken inside utf8 regex #6038

Closed p5pRT closed 21 years ago

p5pRT commented 21 years ago

Migrated from rt.perl.org#18107 (status was 'resolved')

Searchable as RT18107$

p5pRT commented 21 years ago

From @audreyt

Created by autrijus@autrijus.org

The following snippet demonstrated corrupted utf8 strings for lc()\, uc() and ucfirst()\, but not lcfirst()​:

utf8​::upgrade($_="t.est"); s/([a-z]+)/lc($1)/ge; print "lc​: $_\n"; utf8​::upgrade($_="t.est"); s/([a-z]+)/uc($1)/ge; print "uc​: $_\n"; utf8​::upgrade($_="t.est"); s/([a-z]+)/lcfirst($1)/ge; print "lcfirst​: $_\n"; utf8​::upgrade($_="t.est"); s/([a-z]+)/ucfirst($1)/ge; print "ucfirst​: $_\n";

It seem to only happen when the following combination is met​: - String with unicode flag on - Regular expression with captured variables - Variables manipulated by lc\, uc\, ucfirst or the equivalent \L escapes

Since it occurred in Mail​::Header it effectively corrupted utf8-enabled mail applications writing in perl.

Thanks\, /Autrijus/

Perl Info ``` Flags: category=core severity=high This perlbug was built using Perl v5.8.0 - Mon Jun 10 19:48:03 CST 2002 It is being executed now by Perl v5.8.0 - Sat Oct 5 11:17:02 GMT 2002. Site configuration information for perl v5.8.0: Configured by root at Sat Oct 5 11:17:02 GMT 2002. Summary of my perl5 (revision 5.0 version 8 subversion 0) configuration: Platform: osname=freebsd, osvers=4.7-rc, archname=i386-freebsd uname='freebsd gohan17.freebsd.org 4.7-rc freebsd 4.7-rc #0: sun apr 1 02:34:56 pst 2002 asami@bento.freebsd.org:usrsrcsyscompilebento i386 ' config_args='-sde -Dprefix=/usr/local -Darchlib=/usr/local/lib/perl5/5.8.0/mach -Dprivlib=/usr/local/lib/perl5/5.8.0 -Dman3dir=/usr/local/lib/perl5/5.8.0/man/man3 -Dsitearch=/usr/local/lib/perl5/site_perl/5.8.0/mach -Dsitelib=/usr/local/lib/perl5/site_perl/5.8.0 -Ui_malloc -Ui_iconv -Uinstallusrbinperl -Dccflags=-DAPPLLIB_EXP="/usr/local/lib/perl5/5.8.0/BSDPAN" -Ui_gdbm -Dusemymalloc=n' hint=recommended, useposix=true, d_sigaction=define usethreads=undef use5005threads=undef useithreads=undef usemultiplicity=undef useperlio=define d_sfio=undef uselargefiles=define usesocks=undef use64bitint=undef use64bitall=undef uselongdouble=undef usemymalloc=n, bincompat5005=undef Compiler: cc='cc', ccflags ='-DAPPLLIB_EXP="/usr/local/lib/perl5/5.8.0/BSDPAN" -DHAS_FPSETMASK -DHAS_FLOATINGPOINT_H -fno-strict-aliasing -I/usr/local/include', optimize='-O -pipe ', cppflags='-DAPPLLIB_EXP="/usr/local/lib/perl5/5.8.0/BSDPAN" -DHAS_FPSETMASK -DHAS_FLOATINGPOINT_H -fno-strict-aliasing -I/usr/local/include' ccversion='', gccversion='2.95.4 20020320 [FreeBSD]', gccosandvers='' intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12 ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 alignbytes=4, prototype=define Linker and Libraries: ld='cc', ldflags ='-Wl,-E -L/usr/local/lib' libpth=/usr/lib /usr/local/lib libs=-lm -lc -lcrypt -lutil perllibs=-lm -lc -lcrypt -lutil libc=, so=so, useshrplib=false, libperl=libperl.a gnulibc_version='' Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags=' ' cccdlflags='-DPIC -fPIC', lddlflags='-shared -L/usr/local/lib' Locally applied patches: DEVEL17060 @INC for perl v5.8.0: /usr/local/lib/perl5/site_perl/5.8.0/mach /usr/local/lib/perl5/site_perl/5.8.0 /usr/local/lib/perl5/site_perl /usr/local/lib/perl5/5.8.0/BSDPAN /usr/local/lib/perl5/5.8.0/mach /usr/local/lib/perl5/5.8.0 . Environment for perl v5.8.0: HOME=/home/autrijus LANG (unset) LANGUAGE (unset) LC_CTYPE=en_US.ISO_8859-1 LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=/home/autrijus/bin:/usr/X11R6/bin:/usr/games:/bin:/usr/bin:/usr/local/bin:/usr/local/sbin:/sbin:/usr/sbin:/home/autrijus/Komodo-2.0 PERL_BADLANG (unset) SHELL=/bin/tcsh ```
p5pRT commented 21 years ago

From ams@wiw.org

At 2002-10-28 00​:11​:02 -0000\, perlbug@​perl.org wrote​:

utf8​::upgrade($_="t.est"); s/([a-z]+)/lc($1)/ge; print "lc​: $_\n"; utf8​::upgrade($_="t.est"); s/([a-z]+)/uc($1)/ge; print "uc​: $_\n"; utf8​::upgrade($_="t.est"); s/([a-z]+)/lcfirst($1)/ge; print "lcfirst​: $_\n"; utf8​::upgrade($_="t.est"); s/([a-z]+)/ucfirst($1)/ge; print "ucfirst​: $_\n";

This is a horrible morass of interacting bugs. Let's look at one of the cases above in more detail​:

  utf8​::upgrade($_="t.est.t.est");   s/(\w+)/uc($1)/eg; # $_ eq "T.E0B.T.EST"

Perl_pp_lc() is called twice each for "t" and "est" respectively. The first bug is that $1 does not have SVf_UTF8 set on the first call. This makes no difference with "t"\, but using "\N{greek​:sigma}" instead causes more brokenness.

SVf_UTF8 is set the next time\, and we enter the while loop at pp.c​:3452. The first time toUPPER_utf8() is called\, Perl_to_utf8() case decides to call swash_init()\, which wipes out the PV we're stepping over somewhere down the line (and causes the brokenness seen in the bug report). Once the swash is initialised\, however\, subsequent calls to pp_lc() work as expected. (The other three cases are essentially similar.)

The code in the bug report avoids breaking lcfirst only by a fortunate coincidence. I extended this to ucfirst in #18107\, but it doesn't fix any of the real problems.

Can anyone (Hugo?) help to explain why SVf_UTF8 isn't set for the first call to pp_lc? Fixing that would solve one class of bugs\, and help me to fix the other (and\, I suspect\, discover a third).

-- ams\, who foolishly saw "a nice opportunity to explore Unicode."

p5pRT commented 21 years ago

From @nwc10

On Tue\, Nov 05\, 2002 at 02​:36​:46PM +0530\, Abhijit Menon-Sen wrote​:

At 2002-10-28 00​:11​:02 -0000\, perlbug@​perl.org wrote​:

utf8​::upgrade($_="t.est"); s/([a-z]+)/lc($1)/ge; print "lc​: $_\n"; utf8​::upgrade($_="t.est"); s/([a-z]+)/uc($1)/ge; print "uc​: $_\n"; utf8​::upgrade($_="t.est"); s/([a-z]+)/lcfirst($1)/ge; print "lcfirst​: $_\n"; utf8​::upgrade($_="t.est"); s/([a-z]+)/ucfirst($1)/ge; print "ucfirst​: $_\n";

This is a horrible morass of interacting bugs. Let's look at one of the cases above in more detail​:

utf8​::upgrade\($\_="t\.est\.t\.est"\);
s/\(\\w\+\)/uc\($1\)/eg;                  \# $\_ eq "T\.E0B\.T\.EST"

Perl_pp_lc() is called twice each for "t" and "est" respectively. The first bug is that $1 does not have SVf_UTF8 set on the first call. This makes no difference with "t"\, but using "\N{greek​:sigma}" instead causes more brokenness.

Can anyone (Hugo?) help to explain why SVf_UTF8 isn't set for the first call to pp_lc? Fixing that would solve one class of bugs\, and help me to fix the other (and\, I suspect\, discover a third).

OK. I can do that one in latin1\, which may make it easier to view for many readers (like me) living in 8 bit land​:

perl -wle '$a = chr (201) . chr 256; chop $a; print $a; $b = $a; $c = $a; $b =~ s/./lc $&/e; print $b; $c =~ s/./lc $&/e; print $c; ' É É é

I think it is one I mailed to p5p a year or two ago\, but never got the tuity things organised to make a TODO test.

-- ams\, who foolishly saw "a nice opportunity to explore Unicode."

"Foolish mortal"

I think that there ought to be inspiration for a good spoof of Tolkien's verse about the ring here\, but I can't create one on the spur of the moment.

Nicholas Clark

p5pRT commented 21 years ago

From @hvds

Abhijit Menon-Sen \ams@​wiw\.org wrote​: :Can anyone (Hugo?) help to explain why SVf_UTF8 isn't set for the first :call to pp_lc? Fixing that would solve one class of bugs\, and help me to :fix the other (and\, I suspect\, discover a third).

It is because we're working on $1\, which is magic\, but we're checking utfness before we check for magic.

Could you try some tests with the patch below? This at least starts to fix things\, and may be the whole story.

Hugo

Inline Patch ```diff --- pp.c.old Sat Oct 19 16:21:46 2002 +++ pp.c Tue Nov 5 13:24:06 2002 @@ -3323,12 +3323,13 @@ register U8 *s; STRLEN slen; + SvGETMAGIC(sv); if (DO_UTF8(sv)) { U8 tmpbuf[UTF8_MAXLEN_UCLC+1]; STRLEN ulen; STRLEN tculen; - s = (U8*)SvPV(sv, slen); + s = (U8*)SvPV_nomg(sv, slen); utf8_to_uvchr(s, &ulen); toTITLE_utf8(s, tmpbuf, &tculen); @@ -3342,7 +3343,7 @@ SETs(TARG); } else { - s = (U8*)SvPV_force(sv, slen); + s = (U8*)SvPV_force_nomg(sv, slen); Copy(tmpbuf, s, tculen, U8); } } @@ -3350,11 +3351,11 @@ if (!SvPADTMP(sv) || SvREADONLY(sv)) { dTARGET; SvUTF8_off(TARG); /* decontaminate */ - sv_setsv(TARG, sv); + sv_setsv_nomg(TARG, sv); sv = TARG; SETs(sv); } - s = (U8*)SvPV_force(sv, slen); + s = (U8*)SvPV_force_nomg(sv, slen); if (*s) { if (IN_LOCALE_RUNTIME) { TAINT; @@ -3365,8 +3366,7 @@ *s = toUPPER(*s); } } - if (SvSMAGICAL(sv)) - mg_set(sv); + SvSETMAGIC(sv); RETURN; } @@ -3377,7 +3377,8 @@ register U8 *s; STRLEN slen; - if (DO_UTF8(sv) && (s = (U8*)SvPV(sv, slen)) && slen && UTF8_IS_START(*s)) { + SvGETMAGIC(sv); + if (DO_UTF8(sv) && (s = (U8*)SvPV_nomg(sv, slen)) && slen && UTF8_IS_START(*s)) { STRLEN ulen; U8 tmpbuf[UTF8_MAXLEN_UCLC+1]; U8 *tend; @@ -3396,7 +3397,7 @@ SETs(TARG); } else { - s = (U8*)SvPV_force(sv, slen); + s = (U8*)SvPV_force_nomg(sv, slen); Copy(tmpbuf, s, ulen, U8); } } @@ -3404,11 +3405,11 @@ if (!SvPADTMP(sv) || SvREADONLY(sv)) { dTARGET; SvUTF8_off(TARG); /* decontaminate */ - sv_setsv(TARG, sv); + sv_setsv_nomg(TARG, sv); sv = TARG; SETs(sv); } - s = (U8*)SvPV_force(sv, slen); + s = (U8*)SvPV_force_nomg(sv, slen); if (*s) { if (IN_LOCALE_RUNTIME) { TAINT; @@ -3419,8 +3420,7 @@ *s = toLOWER(*s); } } - if (SvSMAGICAL(sv)) - mg_set(sv); + SvSETMAGIC(sv); RETURN; } @@ -3431,6 +3431,7 @@ register U8 *s; STRLEN len; + SvGETMAGIC(sv); if (DO_UTF8(sv)) { dTARGET; STRLEN ulen; @@ -3438,7 +3439,7 @@ U8 *send; U8 tmpbuf[UTF8_MAXLEN_UCLC+1]; - s = (U8*)SvPV(sv,len); + s = (U8*)SvPV_nomg(sv,len); if (!len) { SvUTF8_off(TARG); /* decontaminate */ sv_setpvn(TARG, "", 0); @@ -3468,11 +3469,11 @@ if (!SvPADTMP(sv) || SvREADONLY(sv)) { dTARGET; SvUTF8_off(TARG); /* decontaminate */ - sv_setsv(TARG, sv); + sv_setsv_nomg(TARG, sv); sv = TARG; SETs(sv); } - s = (U8*)SvPV_force(sv, len); + s = (U8*)SvPV_force_nomg(sv, len); if (len) { register U8 *send = s + len; @@ -3488,8 +3489,7 @@ } } } - if (SvSMAGICAL(sv)) - mg_set(sv); + SvSETMAGIC(sv); RETURN; } @@ -3500,6 +3500,7 @@ register U8 *s; STRLEN len; + SvGETMAGIC(sv); if (DO_UTF8(sv)) { dTARGET; STRLEN ulen; @@ -3507,7 +3508,7 @@ U8 *send; U8 tmpbuf[UTF8_MAXLEN_UCLC+1]; - s = (U8*)SvPV(sv,len); + s = (U8*)SvPV_nomg(sv,len); if (!len) { SvUTF8_off(TARG); /* decontaminate */ sv_setpvn(TARG, "", 0); @@ -3554,12 +3555,12 @@ if (!SvPADTMP(sv) || SvREADONLY(sv)) { dTARGET; SvUTF8_off(TARG); /* decontaminate */ - sv_setsv(TARG, sv); + sv_setsv_nomg(TARG, sv); sv = TARG; SETs(sv); } - s = (U8*)SvPV_force(sv, len); + s = (U8*)SvPV_force_nomg(sv, len); if (len) { register U8 *send = s + len; @@ -3575,8 +3576,7 @@ } } } - if (SvSMAGICAL(sv)) - mg_set(sv); + SvSETMAGIC(sv); RETURN; } ```
p5pRT commented 21 years ago

From @tux

On Tue 05 Nov 2002 14​:30\, hv@​crypt.org wrote​:

Abhijit Menon-Sen \ams@​wiw\.org wrote​: :Can anyone (Hugo?) help to explain why SVf_UTF8 isn't set for the first :call to pp_lc? Fixing that would solve one class of bugs\, and help me to :fix the other (and\, I suspect\, discover a third).

It is because we're working on $1\, which is magic\, but we're checking utfness before we check for magic.

Could you try some tests with the patch below? This at least starts to fix things\, and may be the whole story.

What kind of tests? A full smoke for HP-UX 11.00?

Hugo --- pp.c.old Sat Oct 19 16​:21​:46 2002 +++ pp.c Tue Nov 5 13​:24​:06 2002 @​@​ -3323\,12 +3323\,13 @​@​ register U8 *s; STRLEN slen;

+ SvGETMAGIC(sv); if (DO_UTF8(sv)) { U8 tmpbuf[UTF8_MAXLEN_UCLC+1]; STRLEN ulen; STRLEN tculen;

- s = (U8*)SvPV(sv\, slen); + s = (U8*)SvPV_nomg(sv\, slen); utf8_to_uvchr(s\, &ulen);

 toTITLE\_utf8\(s\, tmpbuf\, &tculen\);

@​@​ -3342\,7 +3343\,7 @​@​ SETs(TARG); } else { - s = (U8*)SvPV_force(sv\, slen); + s = (U8*)SvPV_force_nomg(sv\, slen); Copy(tmpbuf\, s\, tculen\, U8); } } @​@​ -3350\,11 +3351\,11 @​@​ if (!SvPADTMP(sv) || SvREADONLY(sv)) { dTARGET; SvUTF8_off(TARG); /* decontaminate */ - sv_setsv(TARG\, sv); + sv_setsv_nomg(TARG\, sv); sv = TARG; SETs(sv); } - s = (U8*)SvPV_force(sv\, slen); + s = (U8*)SvPV_force_nomg(sv\, slen); if (*s) { if (IN_LOCALE_RUNTIME) { TAINT; @​@​ -3365\,8 +3366\,7 @​@​ *s = toUPPER(*s); } } - if (SvSMAGICAL(sv)) - mg_set(sv); + SvSETMAGIC(sv); RETURN; }

@​@​ -3377\,7 +3377\,8 @​@​ register U8 *s; STRLEN slen;

- if (DO_UTF8(sv) && (s = (U8*)SvPV(sv\, slen)) && slen && UTF8_IS_START(*s)) { + SvGETMAGIC(sv); + if (DO_UTF8(sv) && (s = (U8*)SvPV_nomg(sv\, slen)) && slen && UTF8_IS_START(*s)) { STRLEN ulen; U8 tmpbuf[UTF8_MAXLEN_UCLC+1]; U8 *tend; @​@​ -3396\,7 +3397\,7 @​@​ SETs(TARG); } else { - s = (U8*)SvPV_force(sv\, slen); + s = (U8*)SvPV_force_nomg(sv\, slen); Copy(tmpbuf\, s\, ulen\, U8); } } @​@​ -3404\,11 +3405\,11 @​@​ if (!SvPADTMP(sv) || SvREADONLY(sv)) { dTARGET; SvUTF8_off(TARG); /* decontaminate */ - sv_setsv(TARG\, sv); + sv_setsv_nomg(TARG\, sv); sv = TARG; SETs(sv); } - s = (U8*)SvPV_force(sv\, slen); + s = (U8*)SvPV_force_nomg(sv\, slen); if (*s) { if (IN_LOCALE_RUNTIME) { TAINT; @​@​ -3419\,8 +3420\,7 @​@​ *s = toLOWER(*s); } } - if (SvSMAGICAL(sv)) - mg_set(sv); + SvSETMAGIC(sv); RETURN; }

@​@​ -3431\,6 +3431\,7 @​@​ register U8 *s; STRLEN len;

+ SvGETMAGIC(sv); if (DO_UTF8(sv)) { dTARGET; STRLEN ulen; @​@​ -3438\,7 +3439\,7 @​@​ U8 *send; U8 tmpbuf[UTF8_MAXLEN_UCLC+1];

- s = (U8*)SvPV(sv\,len); + s = (U8*)SvPV_nomg(sv\,len); if (!len) { SvUTF8_off(TARG); /* decontaminate */ sv_setpvn(TARG\, ""\, 0); @​@​ -3468\,11 +3469\,11 @​@​ if (!SvPADTMP(sv) || SvREADONLY(sv)) { dTARGET; SvUTF8_off(TARG); /* decontaminate */ - sv_setsv(TARG\, sv); + sv_setsv_nomg(TARG\, sv); sv = TARG; SETs(sv); } - s = (U8*)SvPV_force(sv\, len); + s = (U8*)SvPV_force_nomg(sv\, len); if (len) { register U8 *send = s + len;

@​@​ -3488\,8 +3489\,7 @​@​ } } } - if (SvSMAGICAL(sv)) - mg_set(sv); + SvSETMAGIC(sv); RETURN; }

@​@​ -3500\,6 +3500\,7 @​@​ register U8 *s; STRLEN len;

+ SvGETMAGIC(sv); if (DO_UTF8(sv)) { dTARGET; STRLEN ulen; @​@​ -3507\,7 +3508\,7 @​@​ U8 *send; U8 tmpbuf[UTF8_MAXLEN_UCLC+1];

- s = (U8*)SvPV(sv\,len); + s = (U8*)SvPV_nomg(sv\,len); if (!len) { SvUTF8_off(TARG); /* decontaminate */ sv_setpvn(TARG\, ""\, 0); @​@​ -3554\,12 +3555\,12 @​@​ if (!SvPADTMP(sv) || SvREADONLY(sv)) { dTARGET; SvUTF8_off(TARG); /* decontaminate */ - sv_setsv(TARG\, sv); + sv_setsv_nomg(TARG\, sv); sv = TARG; SETs(sv); }

- s = (U8*)SvPV_force(sv\, len); + s = (U8*)SvPV_force_nomg(sv\, len); if (len) { register U8 *send = s + len;

@​@​ -3575\,8 +3576\,7 @​@​ } } } - if (SvSMAGICAL(sv)) - mg_set(sv); + SvSETMAGIC(sv); RETURN; }

-- H.Merijn Brand Amsterdam Perl Mongers (http​://amsterdam.pm.org/) using perl-5.6.1\, 5.8.0 & 633 on HP-UX 10.20 & 11.00\, AIX 4.2\, AIX 4.3\,   WinNT 4\, Win2K pro & WinCE 2.11. Smoking perl CORE​: smokers@​perl.org http​://archives.develooper.com/daily-build@​perl.org/ perl-qa@​perl.org send smoke reports to​: smokers-reports@​perl.org\, QA​: http​://qa.perl.org

p5pRT commented 21 years ago

From @nwc10

On Tue\, Nov 05\, 2002 at 02​:38​:49PM +0100\, H.Merijn Brand wrote​:

On Tue 05 Nov 2002 14​:30\, hv@​crypt.org wrote​:

Abhijit Menon-Sen \ams@​wiw\.org wrote​: :Can anyone (Hugo?) help to explain why SVf_UTF8 isn't set for the first :call to pp_lc? Fixing that would solve one class of bugs\, and help me to :fix the other (and\, I suspect\, discover a third).

It is because we're working on $1\, which is magic\, but we're checking utfness before we check for magic.

Could you try some tests with the patch below? This at least starts to fix things\, and may be the whole story.

What kind of tests? A full smoke for HP-UX 11.00?

This hunk failed​:

--- pp.c.old Sat Oct 19 16​:21​:46 2002 +++ pp.c Tue Nov 5 13​:24​:06 2002 @​@​ -3323\,12 +3323\,13 @​@​ register U8 *s; STRLEN slen;

+ SvGETMAGIC(sv); if (DO_UTF8(sv)) { U8 tmpbuf[UTF8_MAXLEN_UCLC+1]; STRLEN ulen; STRLEN tculen;

- s = (U8*)SvPV(sv\, slen); + s = (U8*)SvPV_nomg(sv\, slen); utf8_to_uvchr(s\, &ulen);

 toTITLE\_utf8\(s\, tmpbuf\, &tculen\);

..../perl -Ilib -wle '$a = chr (201) . chr 256; chop $a; print $a; $b = $a; $c = $a; $b =~ +s/./lc $&/e; print $b; $c =~ s/./lc $&/e; print $c; ' É é é

Well\, that one works now\, as do all variants except ucfirst. I can't see how to apply the rejected hunk.

Nicholas Clark

p5pRT commented 21 years ago

From nick.ing-simmons@elixent.com

\hv@​crypt\.org writes​:

Abhijit Menon-Sen \ams@​wiw\.org wrote​: ​:Can anyone (Hugo?) help to explain why SVf_UTF8 isn't set for the first ​:call to pp_lc? Fixing that would solve one class of bugs\, and help me to ​:fix the other (and\, I suspect\, discover a third).

It is because we're working on $1\, which is magic\, but we're checking utfness before we check for magic.

Could you try some tests with the patch below? This at least starts to fix things\, and may be the whole story.

Hugo --- pp.c.old Sat Oct 19 16​:21​:46 2002 +++ pp.c Tue Nov 5 13​:24​:06 2002 @​@​ -3323\,12 +3323\,13 @​@​ register U8 *s; STRLEN slen;

+ SvGETMAGIC(sv); if (DO_UTF8(sv)) { U8 tmpbuf[UTF8_MAXLEN_UCLC+1]; STRLEN ulen; STRLEN tculen;

- s = (U8*)SvPV(sv\, slen); + s = (U8*)SvPV_nomg(sv\, slen);

......

And presumably doing one SvGETMAGIC at the top should be faster than re-doing it in various spots below :-)

-- Nick Ing-Simmons http​://www.ni-s.u-net.com/

p5pRT commented 21 years ago

From ams@wiw.org

At 2002-11-05 13​:30​:00 +0000\, hv@​crypt.org wrote​:

Could you try some tests with the patch below? This at least starts to fix things\, and may be the whole story.

Now I feel stupid.

#18107 (my earlier patch) and #18109 (your patch with one hunk fixed) fix all four cases in the original bug report\, but the swash_init() problem remains​:

  $_ = "\N{greek​:sigma}foo.bar";   s/(\w+)/uc($1)/ge;   print "broken\n" unless $_ eq "\N{greek​:Sigma}FOO.BAR";

-- ams

p5pRT commented 21 years ago

From @nwc10

On Tue\, Nov 05\, 2002 at 07​:55​:23PM +0530\, Abhijit Menon-Sen wrote​:

At 2002-11-05 13​:30​:00 +0000\, hv@​crypt.org wrote​:

Could you try some tests with the patch below? This at least starts to fix things\, and may be the whole story.

Now I feel stupid.

#18107 (my earlier patch) and #18109 (your patch with one hunk fixed) fix all four cases in the original bug report\, but the swash_init() problem remains​:

$\_ = "\\N\{greek​:sigma\}foo\.bar";
s/\(\\w\+\)/uc\($1\)/ge;
print "broken\\n" unless $\_ eq "\\N\{greek​:Sigma\}FOO\.BAR";

Bah. You need full unicode for your example. That's tantamount to cheating Can I do it in mere 8 bit? No. I can do better than that. Here we go (wrong) in glorious 7 bit ASCII​:

perl5.00503 -lwe '$_ = "foo.bar" . chr 256; chop $_; s/(\w+)/uc($1)/ge; print $_' FOO.BAR

..../perl -Ilib -lwe '$_ = "foo.bar" . chr 256; chop $_; s/(\w+)/uc($1)/ge; print $_' FOO.B0B

Who is this B0B character? :-(

Nicholas Clark

PS I'm not really using chr 256. It's just a quantum effect until it meets   its anticharacter\, chr -256\, and annihilates. :-) [AKA I need a way to   set the utf8 flag on an ASCII scalar]

p5pRT commented 21 years ago

From @hvds

Abhijit Menon-Sen \ams@​wiw\.org wrote​: :At 2002-11-05 13​:30​:00 +0000\, hv@​crypt.org wrote​: :> Could you try some tests with the patch below? This at least starts to :> fix things\, and may be the whole story. : :Now I feel stupid.

Happens to the best of us. :)

:#18107 (my earlier patch) and #18109 (your patch with one hunk fixed)

That one hunk needed fixing only because #18107 changed the context\, right?

:fix all four cases in the original bug report\, but the swash_init() :problem remains​: : : $_ = "\N{greek​:sigma}foo.bar"; : s/(\w+)/uc($1)/ge; : print "broken\n" unless $_ eq "\N{greek​:Sigma}FOO.BAR";

I'll try to look at that one later.

Hugo

p5pRT commented 21 years ago

From @hvds

"H.Merijn Brand" \h\.m\.brand@​hccnet\.nl wrote​: :> Could you try some tests with the patch below? This at least starts to :> fix things\, and may be the whole story. : :What kind of tests? A full smoke for HP-UX 11.00?

No\, I just meant 'try some lc/uc/s{}{}e combinations'. I'm sure the smoke will out in any case.

Hugo

p5pRT commented 21 years ago

From nick.ing-simmons@elixent.com

\hv@​crypt\.org writes​:

Abhijit Menon-Sen \ams@​wiw\.org wrote​: ​:At 2002-11-05 13​:30​:00 +0000\, hv@​crypt.org wrote​: ​:> Could you try some tests with the patch below? This at least starts to ​:> fix things\, and may be the whole story. ​: ​:Now I feel stupid.

Happens to the best of us. :)

With the swash code it happens to _all_ of us.

For the uninitiated SWASHes are hashes that represent property data from the Unicode tables. The hashes are consulted by the C code to to upper/lower case and similar things. BUT the hashes are (lazily) populated by running perl code which uses regular expressions to poke about in the Unicode data files. So we re-enter lots of things (like the RE engine) which were not really expecting it. So in the swash code we attempt to save state - call the perl code - and then restore state.

-- Nick Ing-Simmons http​://www.ni-s.u-net.com/

p5pRT commented 21 years ago

From @andk

On Tue\, 05 Nov 2002 16​:41​:21 +0000\, Nick Ing-Simmons \nick\.ing\-simmons@​elixent\.com said​:

  > \hv@​crypt\.org writes​:

Abhijit Menon-Sen \ams@​wiw\.org wrote​: :At 2002-11-05 13​:30​:00 +0000\, hv@​crypt.org wrote​: :> Could you try some tests with the patch below? This at least starts to :> fix things\, and may be the whole story. : :Now I feel stupid.

Happens to the best of us. :)

  > With the swash code it happens to _all_ of us.

  > For the uninitiated SWASHes are hashes that represent property data   > from the Unicode tables. The hashes are consulted by the C code to   > to upper/lower case and similar things. BUT the hashes are (lazily) populated   > by running perl code which uses regular expressions to poke about   > in the Unicode data files. So we re-enter lots of things (like the RE engine)   > which were not really expecting it. So in the swash code we attempt to save   > state - call the perl code - and then restore state.

While you are at it\, may I ask you what you think about this​:

when code is running in the debugger and triggers that swash code\, it may happen that things become really sloooow. I've seen this but couldn't write a simple test case\, so I delayed talking about it. Do you see any possibility to disable the debugger for *this* perl code?

Thanks for any insight\, -- andreas

p5pRT commented 21 years ago

From ams@wiw.org

At 2002-11-05 14​:44​:28 +0000\, nick@​ccl4.org wrote​:

../perl -Ilib -lwe '$_ = "foo.bar" . chr 256; chop $_; s/(\w+)/uc($1)/ge; print $_' FOO.B0B

Who is this B0B character? :-(

Nicholas\, say goodbye to B0B.

The appended patch teaches save_re_context() to save the existing $1..$n SV's\, and fixes all the problems I've seen so far from the original bug report. I'll apply it shortly\, unless somebody has a better suggestion for how to fetch the SV's to be saved.

(By the way\, I liked p4d2p. Can p4genpatch be made to read from stdin?)

-- ams

==== //depot/perl/regcomp.c#311 - /home/ams/build/perl/current/regcomp.c ==== @​@​ -5070\,6 +5070\,23 @​@​   SAVEVPTR(PL_reg_curpm); /* from regexec.c */   SAVEI32(PL_regnpar); /* () count. */   SAVEI32(PL_regsize); /* from regexec.c */ + + { + /* Save $1..$n (#18107​: UTF-8 s/(\w+)/uc($1)/e); AMS 20021106. */ + int i; + GV *mgv; + REGEXP *rx; + char digits[16]; + + if (PL_curpm && (rx = PM_GETRE(PL_curpm))) { + for (i = 1; i \<= rx->nparens; i++) { + sprintf(digits\, "%lu"\, i); + if ((mgv = gv_fetchpv(digits\, FALSE\, SVt_PV))) + save_scalar(mgv); + } + } + } + #ifdef DEBUGGING   SAVEPPTR(PL_reg_starttry); /* from regexec.c */ #endif

==== //depot/perl/t/op/lc.t#10 - /home/ams/build/perl/current/t/op/lc.t ==== @​@​ -1\,6 +1\,6 @​@​ #!./perl

-print "1..51\n"; +print "1..55\n";

my $test = 1;

@​@​ -136\,3 +136\,18 @​@​ ok(uc("\x{1C5}") eq "\x{1C4}"\, "U+01C5 uc is U+01C4"); ok(uc("\x{1C6}") eq "\x{1C4}"\, "U+01C6 uc is U+01C4\, too");

+# #18107​: A host of bugs involving [ul]c{\,first}. AMS 20021106 +$a = "\x{3c3}foo.bar"; # \x{3c3} == GREEK SMALL LETTER SIGMA. +$b = "\x{3a3}FOO.BAR"; # \x{3a3} == GREEK CAPITAL LETTER SIGMA. + +($c = $b) =~ s/(\w+)/lc($1)/ge; +ok($c eq $a\, "Using s///e to change case."); + +($c = $a) =~ s/(\w+)/uc($1)/ge; +ok($c eq $b\, "Using s///e to change case."); + +($c = $b) =~ s/(\w+)/lcfirst($1)/ge; +ok($c eq "\x{3c3}FOO.bAR"\, "Using s///e to change case."); + +($c = $a) =~ s/(\w+)/ucfirst($1)/ge; +ok($c eq "\x{3a3}foo.Bar"\, "Using s///e to change case.");

p5pRT commented 21 years ago

From @andk

On Wed\, 6 Nov 2002 19​:38​:11 +0530\, Abhijit Menon-Sen \ams@&#8203;wiw\.org said​:

  > (By the way\, I liked p4d2p. Can p4genpatch be made to read from stdin?)

That would do it without re-indentation (to keep the patch short). Not well tested.

* accepts more than one patchnumber on commandline * reads patchnumbers from STDIN\, one per line

-- andreas

Inline Patch ```diff --- p4genpatch@18113 Wed Nov 6 16:01:13 2002 +++ p4genpatch Wed Nov 6 15:58:26 2002 @@ -20,7 +20,7 @@ sub Usage (); $0 =~ s|^.*[\\/]||; -my $VERSION = '0.05'; +my $VERSION = '0.06'; my $TOPDIR = cwd(); my @P4opt; our %OPT = ( "d" => "u", b => "//depot/perl", "D" => "diff" ); @@ -28,13 +28,17 @@ GetOptions(\%OPT, "b=s", "p=s", "d=s", "D=s", "h", "v", "V") or die Usage; print Usage and exit if $OPT{h}; print "$VERSION\n" and exit if $OPT{V}; -die Usage unless @ARGV == 1 && $ARGV[0] =~ /^\d+$/; -my $CHANGE = shift; +unless (@ARGV) { + @ARGV = <>; + chomp @ARGV; +} for my $p4opt (qw(p)) { push @P4opt, "-$p4opt $OPT{$p4opt}" if $OPT{$p4opt}; } +while (my $CHANGE = shift @ARGV) { +die "Illegal CHANGE[$CHANGE]\n", Usage unless $CHANGE =~ /^\d+$/; my $system = "p4 @P4opt describe -s $CHANGE |"; open my $p4, $system or die "Could not run $system"; my @action; @@ -136,6 +140,7 @@ } } print "End of Patch.\n"; +} my($tz_offset); sub correctmtime ($$$) { @@ -155,7 +160,7 @@ } sub Usage () { - qq{Usage: $0 [OPTIONS] patchnumber + qq{Usage: $0 [OPTIONS] [patchnumber]... -p host:port p4 port (e.g. myhost:1666) -d diffopt option to pass to diff(1) @@ -171,6 +176,8 @@ temporary directory with sensible names and sensible modification times and composes a patch to STDOUT using external diff command. Requires repository access. + +If no files are specified, reads patchnumbers from stdin. Examples: perl $0 12345 | gzip -c > 12345.gz ```
p5pRT commented 21 years ago

From @hvds

Abhijit Menon-Sen \ams@&#8203;wiw\.org wrote​: :The appended patch teaches save_re_context() to save the existing $1..$n :SV's\, and fixes all the problems I've seen so far from the original bug :report. I'll apply it shortly\, unless somebody has a better suggestion :for how to fetch the SV's to be saved.

If I correctly understand what this is doing\, this feels like an approach that is in danger of slowing things down substantially. Is there really no cheaper way to fix this?

Can you give a detailed trace of what's going wrong in a sample testcase? Is the problem specific to swash creation\, or does this also affect other intruded perl code\, such as re_eval?

Hugo

p5pRT commented 21 years ago

From @ysth

On Tue\, 05 Nov 2002 13​:30​:00 +0000\, hv@​crypt.org wrote​:

Abhijit Menon-Sen \ams@&#8203;wiw\.org wrote​: ​:Can anyone (Hugo?) help to explain why SVf_UTF8 isn't set for the first ​:call to pp_lc? Fixing that would solve one class of bugs\, and help me to ​:fix the other (and\, I suspect\, discover a third).

It is because we're working on $1\, which is magic\, but we're checking utfness before we check for magic.

Could you try some tests with the patch below? This at least starts to fix things\, and may be the whole story.

Nope :( I think SvPV has to come before DO_UTF8 for AMAGIC to work​:

~/bleadperl/perl $./perl -Ilib -MTest​::More=no_plan -wle' use overload q{""}=>sub{v256}; for $try ( lc(v256)\, lc(bless{}) ) {   is(("v".join"."\,map ord\, split //\, $try)\, "v257") }' ok 1 not ok 2 # Failed test (-e at line 2) # got​: 'v256' # expected​: 'v257' 1..2 # Looks like you failed 1 tests of 2.

Also\, it appears that the utf8 branches for [ul]c(first)? are doing setmagic on the source\, not the target (the non-utf8 branch does sv = TARG except when suborning the source).

And [ul]cfirst are assuming (in the SvPADTMP true branch) that the upper/lower-cased first character will have exactly the same byte length as the original. Is this a valid assumtion?

I'm going to try to come up with first a test patch and then a code patch.

p5pRT commented 21 years ago

From @hvds

sthoenna@​efn.org (Yitzchak Scott-Thoennes) wrote​: :On Tue\, 05 Nov 2002 13​:30​:00 +0000\, hv@​crypt.org wrote​: :>Abhijit Menon-Sen \ams@&#8203;wiw\.org wrote​: :>​:Can anyone (Hugo?) help to explain why SVf_UTF8 isn't set for the first :>​:call to pp_lc? Fixing that would solve one class of bugs\, and help me to :>​:fix the other (and\, I suspect\, discover a third). :> :>It is because we're working on $1\, which is magic\, but we're checking :>utfness before we check for magic. :> :>Could you try some tests with the patch below? This at least starts to :>fix things\, and may be the whole story. : :Nope :( I think SvPV has to come before DO_UTF8 for AMAGIC to work​:

Ah\, yes.

:Also\, it appears that the utf8 branches for [ul]c(first)? are doing :setmagic on the source\, not the target (the non-utf8 branch does :sv = TARG except when suborning the source).

Yup. Though my patch didn't change that.

:And [ul]cfirst are assuming (in the SvPADTMP true branch) that :the upper/lower-cased first character will have exactly the same :byte length as the original. Is this a valid assumtion?

I don't think so. Hmm\, I wonder if there is also any possibility that the locale+nonutf8 branch would end up needing a utf8 result.

:I'm going to try to come up with first a test patch and then a :code patch.

Thank you. In case it is helpful\, I attach below the patch that I think would fix ucfirst; as far as I can tell the C\< utf8_to_uvchr(tmpbuf\, 0); > does nothing\, and the C\< SvUTF8_off(TARG); /* decontaminate */ > should not be needed before an sv_setsv(). This does at least pass existing tests and the moral equivalent of your example overloading test.

Hugo

Inline Patch ```diff --- pp.c.old Tue Nov 5 13:24:06 2002 +++ pp.c Fri Nov 8 12:55:33 2002 @@ -3320,40 +3320,36 @@ { dSP; SV *sv = TOPs; - register U8 *s; STRLEN slen; + register U8 *s = (U8*)SvPV(sv, slen); /* handles GMAGIC, AMAGIC */ - SvGETMAGIC(sv); if (DO_UTF8(sv)) { U8 tmpbuf[UTF8_MAXLEN_UCLC+1]; STRLEN ulen; STRLEN tculen; - s = (U8*)SvPV_nomg(sv, slen); utf8_to_uvchr(s, &ulen); - toTITLE_utf8(s, tmpbuf, &tculen); - utf8_to_uvchr(tmpbuf, 0); - if (!SvPADTMP(sv) || SvREADONLY(sv)) { + if (!SvPADTMP(sv) || SvREADONLY(sv) || ulen != tculen) { dTARGET; sv_setpvn(TARG, (char*)tmpbuf, tculen); sv_catpvn(TARG, (char*)(s + ulen), slen - ulen); SvUTF8_on(TARG); SETs(TARG); + sv = TARG; } else { s = (U8*)SvPV_force_nomg(sv, slen); - Copy(tmpbuf, s, tculen, U8); + Copy(tmpbuf, s, ulen, U8); } } else { if (!SvPADTMP(sv) || SvREADONLY(sv)) { dTARGET; - SvUTF8_off(TARG); /* decontaminate */ sv_setsv_nomg(TARG, sv); + SETs(TARG); sv = TARG; - SETs(sv); } s = (U8*)SvPV_force_nomg(sv, slen); if (*s) { ```
p5pRT commented 21 years ago

From nick.ing-simmons@elixent.com

Yitzchak Scott-Thoennes \sthoenna@&#8203;efn\.org writes​:

And [ul]cfirst are assuming (in the SvPADTMP true branch) that the upper/lower-cased first character will have exactly the same byte length as the original. Is this a valid assumtion?

No.

-- Nick Ing-Simmons http​://www.ni-s.u-net.com/

p5pRT commented 21 years ago

From nick.ing-simmons@elixent.com

\hv@&#8203;crypt\.org writes​:

I don't think so. Hmm\, I wonder if there is also any possibility that the locale+nonutf8 branch would end up needing a utf8 result.

There are problem cases. iso-8859-1 0xff (y diaresis) does not have upper case version anywhere "near" it. Upper-case of accented characters in some locales (French?) drops the accent and makes it shorter in UTF-8.

-- Nick Ing-Simmons http​://www.ni-s.u-net.com/

p5pRT commented 21 years ago

From @ysth

On Fri\, 08 Nov 2002 12​:59​:49 +0000\, hv@​crypt.org wrote​:

sthoenna@​efn.org (Yitzchak Scott-Thoennes) wrote​: ​:On Tue\, 05 Nov 2002 13​:30​:00 +0000\, hv@​crypt.org wrote​: ​:>Abhijit Menon-Sen \ams@&#8203;wiw\.org wrote​: ​:>​:Can anyone (Hugo?) help to explain why SVf_UTF8 isn't set for the first ​:>​:call to pp_lc? Fixing that would solve one class of bugs\, and help me to ​:>​:fix the other (and\, I suspect\, discover a third). ​:> ​:>It is because we're working on $1\, which is magic\, but we're checking ​:>utfness before we check for magic. ​:> ​:>Could you try some tests with the patch below? This at least starts to ​:>fix things\, and may be the whole story. ​: ​:Nope :( I think SvPV has to come before DO_UTF8 for AMAGIC to work​:

Ah\, yes.

​:Also\, it appears that the utf8 branches for [ul]c(first)? are doing ​:setmagic on the source\, not the target (the non-utf8 branch does ​:sv = TARG except when suborning the source).

Yup. Though my patch didn't change that.

Sorry\, I didn't mean to imply it did. Been that way since at least 5.6.1.

​:And [ul]cfirst are assuming (in the SvPADTMP true branch) that ​:the upper/lower-cased first character will have exactly the same ​:byte length as the original. Is this a valid assumtion?

I don't think so. Hmm\, I wonder if there is also any possibility that the locale+nonutf8 branch would end up needing a utf8 result.

Sounded like NI-S thought so. What needs to be done? Will toupper(255) ever really return >255?

​:I'm going to try to come up with first a test patch and then a ​:code patch.

Thank you. In case it is helpful\, I attach below the patch that I think would fix ucfirst; as far as I can tell the C\< utf8_to_uvchr(tmpbuf\, 0); > does nothing\, and the C\< SvUTF8_off(TARG); /* decontaminate */ > should not be needed before an sv_setsv(). This does at least pass existing tests and the moral equivalent of your example overloading test.

Thanks. I'm still working on a test patch but have been busy with other things. Some further notes​:

I'm not sure we should even be doing plain sv_setsv. That's passing through any IV or NV values​:

~/bleadperl/perl $./perl -Ilib -MDevel​::Peek -MScalar​::Util=/./ -we'

Dump dualvar(123\,"abc")' SV = PVNV(0x1034e4a8) at 0x10261164   REFCNT = 1   FLAGS = (TEMP\,IOK\,POK\,pIOK\,pPOK)   IV = 123   NV = 0   PV = 0x102989f8 "abc"\0   CUR = 3   LEN = 4

or in pure perl​:

~/bleadperl/perl $./miniperl -wle'print length(~ucfirst !1)' 20

(which I think should print 0)

p5pRT commented 21 years ago

From @ysth

On Tue\, 12 Nov 2002\, Yitzchak Scott-Thoennes wrote​:

~/bleadperl/perl $./perl -Ilib -MDevel​::Peek -MScalar​::Util=/./ -we'

Dump dualvar(123\,"abc")' SV = PVNV(0x1034e4a8) at 0x10261164 REFCNT = 1 FLAGS = (TEMP\,IOK\,POK\,pIOK\,pPOK) IV = 123 NV = 0 PV = 0x102989f8 "abc"\0 CUR = 3 LEN = 4

I need to work on my cut-and-pasting skills. That should have been​: ~/bleadperl/perl $./perl -Ilib -MDevel​::Peek -MScalar​::Util=/./ -we'

Dump ucfirst dualvar(123\,"abc")' SV = PVNV(0x1034e4c8) at 0x1027391c   REFCNT = 1   FLAGS = (PADTMP\,IOK\,POK\,pIOK\,pPOK)   IV = 123   NV = 0   PV = 0x102989f8 "Abc"\0   CUR = 3   LEN = 4

p5pRT commented 21 years ago

From ams@wiw.org

At 2002-11-06 16​:32​:25 +0000\, hv@​crypt.org wrote​:

The appended patch teaches save_re_context() to save the existing $1..$n SV's [...]

Is there really no cheaper way to fix this?

I don't see any cheaper way (and as Jarkko has shown\, this way doesn't seem unacceptably expensive). One could skimp and save a few selected bits instead of the entire scalar\, but I don't think that's worth it.

Can you give a detailed trace of what's going wrong in a sample testcase?

Perl_to_utf8_case (which is called by toLOWER_utf8 and siblings) calls swatch_init() if the swatch it needs isn't loaded. swatch_init() calls the SWASHNEW method\, which wipes out $1 somewhere in the Perl code.

#0 Perl_call_sv (sv=0x8180454\, flags=64) at perl.c​:1760 #1 Perl_call_method (methname=0x8120ad2 "SWASHNEW"\, flags=0)   at perl.c​:1743 #2 Perl_swash_init (pkg=0x81209e4 "utf8"\, name=0x8120a7e "ToUpper"\,   listsv=0x81302d8\, minbits=4\, none=0) at utf8.c​:1549 #3 Perl_to_utf8_case (p=0x817af90 "foo"\, ustrp=0xbffff5a4 "\030"\,   lenp=0xbffff59c\, swashp=0x8130be4\, normal=0x8120a7e "ToUpper"\,   special=0x8120a6c "utf8​::ToSpecUpper") at utf8.c​:1349 #4 Perl_to_utf8_upper (p=0x817af90 "foo"\, ustrp=0xbffff5a4 "\030"\,   lenp=0xbffff59c) at utf8.c​:1441 #5 Perl_pp_uc () at pp.c​:3454

(From perl -le '$_="foo.bar".chr(256);chop;s/([^.]+)/uc($1)/ge;print')

-- ams

p5pRT commented 21 years ago

From @jhi

I don't see any cheaper way (and as Jarkko has shown\, this way doesn't seem unacceptably expensive). One could skimp and save a few selected bits instead of the entire scalar\, but I don't think that's worth it.

I tried the speed in two ways​:

(1) run op/regexp.t and op/pat.100 both 10 times on an almost zero-load   quad-Alpha box​: the user+system time difference was within plus minus   1.4% (the pat.t was slower\, but regexp.t was faster with the patch\,   but I think the tests were so fast that all that was measuring noise).

(2) Since neither of the above tests really stress test UTF-8 substitutions\, I ran also this (on an empty single-cpu x86 box) (the repetition counts were hand-crafted to give something in the 3..30 seconds range)

use Benchmark; sub inita { $a = (("abc" x $_[0]) . " ") x $_[1]; utf8​::upgrade($a) } inita( 1\, 1000); timethis( 50\, '$a =~ s/(\w+)/\U$1/g'); inita( 1\, 10000); timethis( 1\, '$a =~ s/(\w+)/\U$1/g'); inita( 10\, 100); timethis( 200\, '$a =~ s/(\w+)/\U$1/g'); inita( 10\, 1000); timethis( 10\, '$a =~ s/(\w+)/\U$1/g'); inita( 10\, 10000); timethis( 1\, '$a =~ s/(\w+)/\U$1/g'); inita( 100\, 10); timethis( 200\, '$a =~ s/(\w+)/\U$1/g'); inita( 100\, 100); timethis( 10\, '$a =~ s/(\w+)/\U$1/g'); inita( 100\, 1000); timethis( 5\, '$a =~ s/(\w+)/\U$1/g'); inita( 1000\, 1); timethis( 200\, '$a =~ s/(\w+)/\U$1/g'); inita( 10000\, 1); timethis( 20\, '$a =~ s/(\w+)/\U$1/g');

The results of 2+2 runs cleaned up​:

  a) b) c) d) with1 with2 out1 out2   1 1000 4000 50 3.90 3.89 3.97 3.95   1 10000 40000 1 3.44 3.44 3.45 3.45   10 100 3100 200 5.72 5.71 6.11 6.11   10 1000 31000 10 5.04 5.03 5.25 5.25   10 10000 310000 1 27.01 27.19 27.11 27.46   100 10 3010 200 4.98 4.97 5.38 5.37   100 100 30100 10 2.73 2.73 2.96 2.96   100 1000 301000 5 26.75 26.70 26.36 27.95   1000 1 3001 200 4.86 4.85 5.30 5.31   10000 1 30001 20 4.96 4.96 5.45 5.46

  a) number of "abc"​:s in one (\w+) run   b) number of "abc... "​:in $a   c) total size of $a   d) number of repetitions

Yes\, quite consistently *with* the patch things get *faster*. No\, don't ask me *why*.

-- Jarkko Hietaniemi \jhi@&#8203;iki\.fi http​://www.iki.fi/jhi/ "There is this special biologist word we use for 'stable'. It is 'dead'." -- Jack Cohen

p5pRT commented 21 years ago

From @jhi

Since no better ideas have come along\, I'm applying Abhijit's patch the on maint-5.8/perl branch and marking the issue as resolved.

p5pRT commented 21 years ago

@jhi - Status changed from 'new' to 'resolved'

p5pRT commented 21 years ago

From @hvds

Abhijit Menon-Sen \ams@&#8203;wiw\.org wrote​: :At 2002-11-05 14​:44​:28 +0000\, nick@​ccl4.org wrote​: :> :> ../perl -Ilib -lwe '$_ = "foo.bar" . chr 256; chop $_; s/(\w+)/uc($1)/ge; print $_' :> FOO.B0B :> :> Who is this B0B character? :-( : :Nicholas\, say goodbye to B0B. : :The appended patch teaches save_re_context() to save the existing $1..$n :SV's\, and fixes all the problems I've seen so far from the original bug :report. I'll apply it shortly\, unless somebody has a better suggestion :for how to fetch the SV's to be saved.

Thanks\, belatedly applied as #18266.

Hugo

p5pRT commented 21 years ago

@jhi - Status changed from 'open' to 'resolved'