delayed interpolation of \N{...} charnames escapes in regexes in perl 5.9.x and later causes breakage - they should be resolved and then converted to \x{...} not preserved verbatim

p5pRT commented 15 years ago

Migrated from rt.perl.org#56444 (status was 'resolved')

Searchable as RT56444$

p5pRT commented 15 years ago

From chris@pirazzi.net

This is a bug report for perl from perlbug@lurkertech.com\, generated with the help of perlbug 1.36 running under perl 5.10.0.

Perl 5.10 (ActiveState ActivePerl Build 1003) breaks the following script\, as compared with Perl 5.8.8 (ActiveState ActivePerl Build 822):

use utf8; use strict; use English qw( -no_match_vars ); use charnames ':full'; my $r1 = qr/\N{THAI CHARACTER SARA I}/; my $s1 = "foo"; $s1 =~ /$r1+/;

The problem is that the last line errs with:

Constant(\N{THAI CHARACTER SARA I}) unknown: (possibly a missing "use charnames...") in regex; marked by \<-- HERE in m/(?-xism:\N{THAI CHARACTER SARA I} \<-- HERE )+/ at t.pl line 7.

Note that I did use 'charnames' and that the \N{} DOES work in the first regex. The err is in line 7\, the last line\, where the correctly compiled regex gets re-interpolated.

In Perl 5.8.8\, this script runs without error and the regex works as expected.

I did a bunch of google searches but could not find mention of this.

This might be related\, but it is very old:

http://groups.google.co.th/group/perl.perl5.changes/browse_thread/thread/8a1489441e6e248/835a4e9963ac2011?lnk=st&q=perl+bug+%5CN+missing+charnames#835a4e9963ac2011

A workaround is:

my $a = "\N{THAI CHARACTER SARA I}"; my $r1 = qr/$a/; $s1 =~ /$r1+/;

however this is quite inconvenient as I have hundreds of regexes that need change!

Thanks.

Flags: category=core severity=medium

Site configuration information for perl 5.10.0:

Configured by SYSTEM at Tue May 13 16:52:25 2008.

Summary of my perl5 (revision 5 version 10 subversion 0) configuration: Platform: osname=MSWin32\, osvers=5.00\, archname=MSWin32-x86-multi-thread uname='' config_args='undef' hint=recommended\, useposix=true\, d_sigaction=undef useithreads=define\, usemultiplicity=define useperlio=define\, d_sfio=undef\, uselargefiles=define\, usesocks=undef use64bitint=undef\, use64bitall=undef\, uselongdouble=undef usemymalloc=n\, bincompat5005=undef Compiler: cc='cl'\, ccflags ='-nologo -GF -W3 -MD -Zi -DNDEBUG -O1 -DWIN32 -D_CONSOLE -DNO_STRICT -DHAVE_DES_FCRYPT -DUSE_SITECUSTOMIZE -DPRIVLIB_LAST_IN_INC -DPERL_IMPLICIT_CONTEXT -DPERL_IMPLICIT_SYS -DUSE_PERLIO -DPERL_MSVCRT_READFIX'\, optimize='-MD -Zi -DNDEBUG -O1'\, cppflags='-DWIN32' ccversion='14.0.50727'\, gccversion=''\, gccosandvers='' intsize=4\, longsize=4\, ptrsize=4\, doublesize=8\, byteorder=1234 d_longlong=undef\, longlongsize=8\, d_longdbl=define\, longdblsize=10 ivtype='long'\, ivsize=4\, nvtype='double'\, nvsize=8\, Off_t='__int64'\, lseeksize=8 alignbytes=8\, prototype=define Linker and Libraries: ld='link'\, ldflags ='-nologo -nodefaultlib -debug -opt:ref\,icf -libpath:"C:\perl\lib\CORE" -machine:x86' libpth=\lib libs= oldnames.lib kernel32.lib user32.lib gdi32.lib winspool.lib comdlg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib netapi32.lib uuid.lib ws2_32.lib mpr.lib winmm.lib version.lib odbc32.lib odbccp32.lib msvcrt.lib perllibs= oldnames.lib kernel32.lib user32.lib gdi32.lib winspool.lib comdlg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib netapi32.lib uuid.lib ws2_32.lib mpr.lib winmm.lib version.lib odbc32.lib odbccp32.lib msvcrt.lib libc=msvcrt.lib\, so=dll\, useshrplib=true\, libperl=perl510.lib gnulibc_version='' Dynamic Linking: dlsrc=dl_win32.xs\, dlext=dll\, d_dlsymun=undef\, ccdlflags=' ' cccdlflags=' '\, lddlflags='-dll -nologo -nodefaultlib -debug -opt:ref\,icf -libpath:"C:\perl\lib\CORE" -machine:x86'

Locally applied patches: ACTIVEPERL_LOCAL_PATCHES_ENTRY 33741 avoids segfaults invoking S_raise_signal() (on Linux) 33763 Win32 process ids can have more than 16 bits 32809 Load 'loadable object' with non-default file extension 32728 64-bit fix for Time::Local

@INC for perl 5.10.0: c:/perl/site/lib c:/perl/lib .

Environment for perl 5.10.0: HOME=c:\ LANG (unset) LANGUAGE (unset) LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=C:\perl\bin;C:\tcl\bin;c:\mysql\bin;C:\WINDOWS\system32;C:\WINDOWS;C:\WINDOWS\System32\Wbem;C:\Program Files\Common Files\Roxio Shared\DLLShared;C:\s\Common7\IDE;C:\s\VC\BIN;C:\s\Common7\Tools;C:\s\Common7\Tools\bin;C:\s\VC\PlatformSDK\bin;C:\s\SDK\v2.0\bin;C:\WINDOWS\Microsoft.NET\Framework\v2.0.50727;C:\s\VC\VCPackages;c:\cygwin\bin;c:\cygwin\usr\X11R6\bin;c:\cygwin\usr\local\bin;c:\bin;c:\stlport\STLport-5.1.5\bin;c:\icu\icu-3.4.1\bin;;C:\graphviz\Graphviz\bin;c:\doxygen\bin;C:\quicktime\QTSystem\ PERL_BADLANG (unset) SHELL=c:\cygwin\bin\zsh.exe

p5pRT commented 15 years ago

From @andk

On Sun\, 29 Jun 2008 00:38:27 -0700\, "Chris Pirazzi" (via RT) \perlbug\-followup@perl\.org said:

> use utf8; > use strict; > use English qw( -no_match_vars ); > use charnames ':full'; > my $r1 = qr/\N{THAI CHARACTER SARA I}/; > my $s1 = "foo"; > $s1 =~ /$r1+/;

> The problem is that the last line errs with:

> Constant(\N{THAI CHARACTER SARA I}) unknown: (possibly a missing "use > charnames...") in regex; marked by \<-- HERE in m/(?-xism:\N{THAI > CHARACTER SARA I} \<-- HERE )+/ at t.pl line 7.

Thanks for the report. The patch that broke this script was 28868:

Change 28868 by merijn@merijn-lt09 on 2006/09/19 06:56:36

Subject: Re: \N{...} in regular expression [PATCH] From: demerphq \demerphq@gmail\.com Date: Tue\, 19 Sep 2006 01:37:19 +0200 Message-ID: \9b18b3110609181637m796d6c16o1b2741edc5f09eb2@mail\.gmail\.com

See also http://rt.cpan.org/Ticket/Display.html?id=34388

HTH\, -- andreas

p5pRT commented 15 years ago

The RT System itself - Status changed from 'new' to 'open'

p5pRT commented 15 years ago

From @rgs

2008/6/29 via RT Chris Pirazzi \perlbug\-followup@perl\.org:

Perl 5.10 (ActiveState ActivePerl Build 1003) breaks the following script\, as compared with Perl 5.8.8 (ActiveState ActivePerl Build 822):

use utf8; use strict; use English qw( -no_match_vars ); use charnames ':full'; my $r1 = qr/\N{THAI CHARACTER SARA I}/; my $s1 = "foo"; $s1 =~ /$r1+/;

The problem is that the last line errs with:

Constant(\N{THAI CHARACTER SARA I}) unknown: (possibly a missing "use charnames...") in regex; marked by \<-- HERE in m/(?-xism:\N{THAI CHARACTER SARA I} \<-- HERE )+/ at t.pl line 7.

Note that I did use 'charnames' and that the \N{} DOES work in the first regex. The err is in line 7\, the last line\, where the correctly compiled regex gets re-interpolated.

Interestingly\, if we wrap the code from "my $r1" to the end in an eval("")\, then it compiles correctly. So that's some kind of time-of-loading problem.

p5pRT commented 15 years ago

From rick@bort.ca

On Jul 07 2008\, Rafael Garcia-Suarez wrote:

2008/6/29 via RT Chris Pirazzi \perlbug\-followup@perl\.org:

use utf8; use strict; use English qw( -no_match_vars ); use charnames ':full'; my $r1 = qr/\N{THAI CHARACTER SARA I}/; my $s1 = "foo"; $s1 =~ /$r1+/; [...] Interestingly\, if we wrap the code from "my $r1" to the end in an eval("")\, then it compiles correctly. So that's some kind of time-of-loading problem.

I think that may just be because "\N{THAI CHARACTER SARA I}" is interpolated before qr// gets it. It looks like the stringification of qr// references has changed as a side effect of the structure changes.

use charnames ':full'; use Devel::Peek; $x = qr/\N{THAI CHARACTER SARA I}/; print $x; Dump $x

5.8.8 output

(?-xism:ิ) SV = RV(0x819cff4) at 0x81495d0 REFCNT = 1 FLAGS = (ROK\,UTF8) RV = 0x8148d54 SV = PVMG(0x8163050) at 0x8148d54 REFCNT = 1 FLAGS = (OBJECT\,SMG) IV = 0 NV = 0 PV = 0 MAGIC = 0x816b410 MG_VIRTUAL = 0x8144608 MG_TYPE = PERL_MAGIC_qr(r) MG_OBJ = 0x816b198 MG_LEN = 12 MG_PTR = 0x8163c10 "(?-xism:\340\270\264)" STASH = 0x81490f0 "Regexp"

blead output

(?-xism:\N{THAI CHARACTER SARA I}) SV = IV(0x83ac0fc) at 0x83ac100 REFCNT = 1 FLAGS = (ROK\,UTF8) RV = 0x83ac0e0 SV = REGEXP(0x83b65f0) at 0x83ac0e0 REFCNT = 2 FLAGS = (OBJECT\,POK\,pPOK\,UTF8) IV = 0 PV = 0x83b06b0 "(?-xism:\\N{THAI CHARACTER SARA I})"\0 [UTF8 "(?-xism:\N{THAI CHARACTER SARA I})"] CUR = 34 LEN = 36 STASH = 0x8397940 "Regexp"

-- Rick Delaney rick@bort.ca

p5pRT commented 15 years ago

From @demerphq

2008/7/8 Rick Delaney \rick@bort\.ca:

On Jul 07 2008\, Rafael Garcia-Suarez wrote:

2008/6/29 via RT Chris Pirazzi \perlbug\-followup@perl\.org:

use utf8; use strict; use English qw( -no_match_vars ); use charnames ':full'; my $r1 = qr/\N{THAI CHARACTER SARA I}/; my $s1 = "foo"; $s1 =~ /$r1+/; [...] Interestingly\, if we wrap the code from "my $r1" to the end in an eval("")\, then it compiles correctly. So that's some kind of time-of-loading problem.

I think that may just be because "\N{THAI CHARACTER SARA I}" is interpolated before qr// gets it. It looks like the stringification of qr// references has changed as a side effect of the structure changes.

Yes this was the intent of the change\, to prevent the conversion of named escapes to strings before the regex engine saw them\, otherwise characters specified by the \N{} notation could/would be treated as regex metachars\, with not so cool consequences.

I really dont understand why its not working in this case. The second pattern is in the same scope as the first\, so this doesnt make a lot of sense for me. I guess someone needs to look at in the debugger and see why it thinks the charnames decl isnt in scope.

Yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 15 years ago

From elliot@foobiebletch.com

Created by perl@galumph.com

Putting a variable expansion into a regex with a \N{} escape results in a compilation error in 5.10.0. For example\, the following program compiles:

#!/usr/bin/env perl use charnames ':full'; m/\N{START OF HEADING}/

However\, this

#!/usr/bin/env perl use charnames ':full'; m/$x\N{START OF HEADING}/

results in

Constant(\N{START OF HEADING}) unknown: (possibly a missing "use charnames ...") in regex;

Perl Info

``` Flags: category=library severity=medium Site configuration information for perl 5.10.0: Configured by elliot at Sun Sep 14 15:07:20 CDT 2008. Summary of my perl5 (revision 5 version 10 subversion 0) configuration: Platform: osname=darwin, osvers=9.4.0, archname=darwin-thread-multi-64int-ld-2level uname='darwin quaquaversal.local 9.4.0 darwin kernel version 9.4.0: mon jun 9 19:30:53 pdt 2008; root:xnu-1228.5.20~1release_i386 i386 ' config_args='-Duse64bitint -Dusethreads -Dinc_version_list=none -Dprefix=/Users/elliot/opt/perl/perl-5.10.0-64bit-threads' hint=recommended, useposix=true, d_sigaction=define useithreads=define, usemultiplicity=define useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef use64bitint=define, use64bitall=undef, uselongdouble=define usemymalloc=n, bincompat5005=undef Compiler: cc='cc', ccflags ='-fno-common -DPERL_DARWIN -no-cpp-precomp -fno-strict-aliasing -pipe -I/usr/local/include -I/opt/local/include', optimize='-O3', cppflags='-no-cpp-precomp -fno-common -DPERL_DARWIN -no-cpp-precomp -fno-strict-aliasing -pipe -I/usr/local/include -I/opt/local/include' ccversion='', gccversion='4.0.1 (Apple Inc. build 5465)', gccosandvers='' intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=12345678 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16 ivtype='long long', ivsize=8, nvtype='long double', nvsize=16, Off_t='off_t', lseeksize=8 alignbytes=8, prototype=define Linker and Libraries: ld='env MACOSX_DEPLOYMENT_TARGET=10.5 cc', ldflags =' -L/usr/local/lib -L/opt/local/lib' libpth=/usr/local/lib /opt/local/lib /usr/lib libs=-ldbm -ldl -lm -lutil -lc perllibs=-ldl -lm -lutil -lc libc=/usr/lib/libc.dylib, so=dylib, useshrplib=false, libperl=libperl.a gnulibc_version='' Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=bundle, d_dlsymun=undef, ccdlflags=' ' cccdlflags=' ', lddlflags=' -bundle -undefined dynamic_lookup -L/usr/local/lib -L/opt/local/lib' Locally applied patches: @INC for perl 5.10.0: /Users/elliot/opt/perl/perl-5.10.0-64bit-threads/lib/5.10.0/darwin-thread-multi-64int-ld-2level /Users/elliot/opt/perl/perl-5.10.0-64bit-threads/lib/5.10.0 /Users/elliot/opt/perl/perl-5.10.0-64bit-threads/lib/site_perl/5.10.0/darwin-thread-multi-64int-ld-2level /Users/elliot/opt/perl/perl-5.10.0-64bit-threads/lib/site_perl/5.10.0 . Environment for perl 5.10.0: DYLD_LIBRARY_PATH (unset) HOME=/Users/elliot LANG=en_US.UTF-8 LANGUAGE (unset) LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=/Users/elliot/bin:/Users/elliot/opt/bin:/Users/elliot/opt/perl/perl-5.10.0-64bit-threads/bin:/opt/local/bin:/usr/local/bin:/bin:/sbin:/usr/bin:/usr/sbin:/usr/X11R6/bin:/usr/X11/bin PERL_BADLANG (unset) SHELL=/bin/bash ```

p5pRT commented 15 years ago

From @moritz

Elliot Shank wrote:

# New Ticket Created by Elliot Shank # Please include the string: [perl #62056] # in the subject line of all future correspondence about this issue. # \<URL: http://rt.perl.org/rt3/Ticket/Display.html?id=62056 >

This is a bug report for perl from perl@galumph.com\, generated with the help of perlbug 1.36 running under perl 5.10.0.

----------------------------------------------------------------- [Please enter your report here]

Putting a variable expansion into a regex with a \N{} escape results in a compilation error in 5.10.0. For example\, the following program compiles:
\#\!/usr/bin/env perl
use charnames '&#8203;:full';
m/\\N\{START OF HEADING\}/
However\, this
\#\!/usr/bin/env perl
use charnames '&#8203;:full';
m/$x\\N\{START OF HEADING\}/
results in
Constant$\\N\{START OF HEADING\}$ unknown&#8203;: $possibly a missing
"use charnames \.\.\."$ in regex;

This worked in perl-5.8.8\, and fails in perl-5.10.0. So I bisected it\, and this is what git-bisect says is the offending commit:

fc8cd66c26827f6c2ee1aa00ab2d3b3c320a4a28 is first bad commit commit fc8cd66c26827f6c2ee1aa00ab2d3b3c320a4a28 Author: Yves Orton \demerphq@gmail\.com Date: Tue Sep 19 03:37:19 2006 +0200

Re: \N{...} in regular expression [PATCH] Message-ID: \9b18b3110609181637m796d6c16o1b2741edc5f09eb2@mail\.gmail\.com

p4raw-id: //depot/perl@28868

Cheers\, Moritz

p5pRT commented 15 years ago

The RT System itself - Status changed from 'new' to 'open'

p5pRT commented 14 years ago

From @nwc10

Dave notes:

regression since 5.8.x
30/12/08 Yves says its tricky to fix

p5pRT commented 14 years ago

From @schwern

Bizarrely\, it works in an eval.

$ perl5.10.0 -wle 'use charnames ":full"; my $x = ""; print "\N{LATIN CAPITAL LETTER E}" =~ /$x\N{LATIN CAPITAL LETTER E}/; print $@'Constant(\N{LATIN CAPITAL LETTER E}) unknown: (possibly a missing "use charnames ...") in regex; marked by \<-- HERE in m/\N{LATIN CAPITAL LETTER E} \<-- HERE / at -e line 1.

$ perl5.10.0 -wle 'use charnames ":full"; my $x = ""; print eval q{"\N{LATIN CAPITAL LETTER E}" =~ /$x\N{LATIN CAPITAL LETTER E}/}; print $@' 1

p5pRT commented 14 years ago

From @schwern

Looking at the code that would generate that specific error\, there's three places it could happen. Turns out its the one in regcomp.c.

if (!table || !(PL_hints & HINT_LOCALIZE_HH)) { vFAIL2("Constant(\\N{%s}) unknown: " "regcomp.c (possibly a missing \"use charnames ...\")"\, SvPVX(sv_name)); }

Digging further\, its the second clause which is false so its got the wrong hints. For /$x\N{...}/ PL_hints has a value of 2**8 and for /\N{...}/ its 131328 which is 2**8 + HINT_LOCALIZE_HH.

And that's about as far as I can go.

p5pRT commented 14 years ago

From @schwern

A test for this is available from git://github.com/schwern/perl.git in branch rt.cpan.org-62056. Also supplied here as a patch.

p5pRT commented 14 years ago

From @schwern

0001-Test-rt.cpan.org-62056.patch

```diff From 7bed143fcc74b8bed3d7ed13de2ef000e5523b9e Mon Sep 17 00:00:00 2001 From: Michael G. Schwern Date: Sat, 11 Jul 2009 01:49:19 -0700 Subject: [PATCH] Test rt.cpan.org 62056 --- t/op/pat.t | 24 +++++++++++++++++++++++- 1 files changed, 23 insertions(+), 1 deletions(-) diff --git a/t/op/pat.t b/t/op/pat.t index aa6299f..039ac50 100644 --- a/t/op/pat.t +++ b/t/op/pat.t @@ -13,7 +13,7 @@ sub run_tests; $| = 1; -my $EXPECTED_TESTS = 4065; # Update this when adding/deleting tests. +my $EXPECTED_TESTS = 4066; # Update this when adding/deleting tests. BEGIN { chdir 't' if -d 't'; @@ -1792,6 +1792,28 @@ sub run_tests { } + # rt.cpan.org 62056 + # Problem with a variable before a \N{...} in a pattern match + { + package RT::62056; + + # We need fresh_perl_is() + require './test.pl'; + + # Synch its test count with the one in pat.t + $test++; + curr_test($test); + + local $RT::62056::TODO = "rt.cpan.org 62056"; + + fresh_perl_is(<<'CODE', "Yes", {}, 'variable before \N{...}'); +use charnames ":full"; +$x = ""; +print "Yes" if "$x\N{LATIN CAPITAL LETTER E}" =~ /$x\N{LATIN CAPITAL LETTER E}/; +CODE + } + + { local $Message = "Final Sigma"; -- 1.6.2.4 ```

p5pRT commented 14 years ago

From @schwern

This appears to be a duplicate of 62056

p5pRT commented 14 years ago

From @rgs

2009/7/11 Michael G Schwern via RT \perlbug\-followup@perl\.org:

A test for this is available from git://github.com/schwern/perl.git in branch rt.cpan.org-62056. Also supplied here as a patch.

I don't like this patch. You require() test.pl in a new package RT::62056. Ok\, you don't want to stomp on the ok() subroutine already defined here. But then you need to do an awkward setting of $RT::62056::TODO\, and if you want to require test.pl in another test in the same file\, it will fail\, because we have already test.pl in %INC.

I'd probably go for the less effort and put that in a new file.

p5pRT commented 14 years ago

From @khwilliamson

Rafael Garcia-Suarez wrote:

2009/7/11 Michael G Schwern via RT \perlbug\-followup@perl\.org:

A test for this is available from git://github.com/schwern/perl.git in branch rt.cpan.org-62056. Also supplied here as a patch.

I don't like this patch. You require() test.pl in a new package RT::62056. Ok\, you don't want to stomp on the ok() subroutine already defined here. But then you need to do an awkward setting of $RT::62056::TODO\, and if you want to require test.pl in another test in the same file\, it will fail\, because we have already test.pl in %INC.

I'd probably go for the less effort and put that in a new file.

Here's 3 more lines for the patch\, if you like\, that reproduce the similar #56444

my $r1 = qr/\N{THAI CHARACTER SARA I}/; #56444 my $s1 = "foo"; $s1 =~ /$r1+/;

FWIW\, I had written about this a couple of weeks ago\, and concluded that probably both bugs were from the same root\, and by forcing the code to ignore the problem with HINT_LOCALIZE_HH caused 62056 to not fail\, and 56444 failed later with a message that looked like if the root were fixed it would succeed.

p5pRT commented 14 years ago

From @schwern

Rafael Garcia-Suarez wrote:

2009/7/11 Michael G Schwern via RT \perlbug\-followup@perl\.org:

A test for this is available from git://github.com/schwern/perl.git in branch rt.cpan.org-62056. Also supplied here as a patch.

I don't like this patch. You require() test.pl in a new package RT::62056. Ok\, you don't want to stomp on the ok() subroutine already defined here. But then you need to do an awkward setting of $RT::62056::TODO\, and if you want to require test.pl in another test in the same file\, it will fail\, because we have already test.pl in %INC.

I'd probably go for the less effort and put that in a new file.

I didn't like it either\, but I didn't want to clean up the whole file nor lump it into fresh_perl.t.

The real problem is op/pat.t is far too big. I think I'll start by spliting all the charnames tests out. Good a start as any.

git://github.com/schwern/perl.git branch rt.cpan.org-62056 has the changes. On the way I also added note() to test.pl to replace the $Message system used in pat.t. Patches attached.

Karl Williamson wrote:

Here's 3 more lines for the patch\, if you like\, that reproduce the similar #56444

my $r1 = qr/\N{THAI CHARACTER SARA I}/; #56444 my $s1 = "foo"; $s1 =~ /$r1+/;

FWIW\, I had written about this a couple of weeks ago\, and concluded that probably both bugs were from the same root\, and by forcing the code to ignore the problem with HINT_LOCALIZE_HH caused 62056 to not fail\, and 56444 failed later with a message that looked like if the root were fixed it would succeed.

Yeah\, I came to that conclusion too.

A possibly related note:

$ perl5.10.0 -wl use charnames ":full"; print "Yes" if "\N{THAI CHARACTER SARA I}" =~ /\N{THAI CHARACTER SARA I}+/; \N{THAI CHARACTER SARA I}+ matches null string many times in regex; marked by \<-- HERE in m/\N{THAI CHARACTER SARA I}+ \<-- HERE / at - line 2. Yes

$ perl5.10.1 -wl use charnames ":full"; print "Yes" if "\N{THAI CHARACTER SARA I}" =~ /\N{THAI CHARACTER SARA I}+/; \N{THAI CHARACTER SARA I}+ matches null string many times in regex; marked by \<-- HERE in m/\N{THAI CHARACTER SARA I}+ \<-- HERE / at - line 2. Yes

I don't know if that's a bug or a feature\, but is a test now.

-- 60. "The Giant Space Ants" are not at the top of my chain of command. -- The 213 Things Skippy Is No Longer Allowed To Do In The U.S. Army http://skippyslist.com/list/

p5pRT commented 14 years ago

From @schwern

0001-Test-rt.cpan.org-62056.patch

```diff From 7bed143fcc74b8bed3d7ed13de2ef000e5523b9e Mon Sep 17 00:00:00 2001 From: Michael G. Schwern Date: Sat, 11 Jul 2009 01:49:19 -0700 Subject: [PATCH 1/4] Test rt.cpan.org 62056 --- t/op/pat.t | 24 +++++++++++++++++++++++- 1 files changed, 23 insertions(+), 1 deletions(-) diff --git a/t/op/pat.t b/t/op/pat.t index aa6299f..039ac50 100644 --- a/t/op/pat.t +++ b/t/op/pat.t @@ -13,7 +13,7 @@ sub run_tests; $| = 1; -my $EXPECTED_TESTS = 4065; # Update this when adding/deleting tests. +my $EXPECTED_TESTS = 4066; # Update this when adding/deleting tests. BEGIN { chdir 't' if -d 't'; @@ -1792,6 +1792,28 @@ sub run_tests { } + # rt.cpan.org 62056 + # Problem with a variable before a \N{...} in a pattern match + { + package RT::62056; + + # We need fresh_perl_is() + require './test.pl'; + + # Synch its test count with the one in pat.t + $test++; + curr_test($test); + + local $RT::62056::TODO = "rt.cpan.org 62056"; + + fresh_perl_is(<<'CODE', "Yes", {}, 'variable before \N{...}'); +use charnames ":full"; +$x = ""; +print "Yes" if "$x\N{LATIN CAPITAL LETTER E}" =~ /$x\N{LATIN CAPITAL LETTER E}/; +CODE + } + + { local $Message = "Final Sigma"; -- 1.6.2.4 ```

p5pRT commented 14 years ago

From @schwern

0002-Add-note-to-test.pl-like-Test-More-s.patch

```diff From fcbd249a3f121bc7465f6b95be3ba885d75f14da Mon Sep 17 00:00:00 2001 From: Michael G. Schwern Date: Sat, 11 Jul 2009 16:25:46 -0700 Subject: [PATCH 2/4] Add note() to test.pl like Test::More's --- t/test.pl | 14 +++++++++++--- 1 files changed, 11 insertions(+), 3 deletions(-) diff --git a/t/test.pl b/t/test.pl index 32c4a37..332fc60 100644 --- a/t/test.pl +++ b/t/test.pl @@ -67,16 +67,24 @@ END { # Use this instead of "print STDERR" when outputing failure diagnostic # messages sub _diag { - return unless @_; - my @mess = map { /^#/ ? "$_\n" : "# $_\n" } - map { split /\n/ } @_; + my @mess = _comment(@_); $TODO ? _print(@mess) : _print_stderr(@mess); } +sub _comment { + return unless @_; + return map { /^#/ ? "$_\n" : "# $_\n" } + map { split /\n/ } @_; +} + sub diag { _diag(@_); } +sub note { + _print _comment(@_); +} + sub skip_all { if (@_) { _print "1..0 # Skip @_\n"; -- 1.6.2.4 ```

p5pRT commented 14 years ago

From @schwern

0003-Chop-out-the-tests-from-op-pat.t-which-involve-using.patch

```diff From 18fe84f50714f3413e090b9a7111dd7edb3c6351 Mon Sep 17 00:00:00 2001 From: Michael G. Schwern Date: Sat, 11 Jul 2009 16:25:59 -0700 Subject: [PATCH 3/4] Chop out the tests from op/pat.t which involve using charnames and put them into their own test file. The $Message system is replaced by note(). Most of the special case work to identify which test is which is unnecessary in a shorter test file. may_not_warn() should probably be pushed into test.pl --- t/op/pat.t | 195 +------------------------------------------- t/op/regexp_charnames.t | 208 +++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 209 insertions(+), 194 deletions(-) create mode 100644 t/op/regexp_charnames.t diff --git a/t/op/pat.t b/t/op/pat.t index 039ac50..192a972 100644 --- a/t/op/pat.t +++ b/t/op/pat.t @@ -13,7 +13,7 @@ sub run_tests; $| = 1; -my $EXPECTED_TESTS = 4066; # Update this when adding/deleting tests. +my $EXPECTED_TESTS = 4002; # Update this when adding/deleting tests. BEGIN { chdir 't' if -d 't'; @@ -1726,95 +1726,6 @@ sub run_tests { { - use charnames ':full'; - local $Message = "Folding 'LATIN LETTER A WITH GRAVE'"; - - my $lower = "\N{LATIN SMALL LETTER A WITH GRAVE}"; - my $UPPER = "\N{LATIN CAPITAL LETTER A WITH GRAVE}"; - - ok $lower =~ m/$UPPER/i; - ok $UPPER =~ m/$lower/i; - ok $lower =~ m/[$UPPER]/i; - ok $UPPER =~ m/[$lower]/i; - - local $Message = "Folding 'GREEK LETTER ALPHA WITH VRACHY'"; - - $lower = "\N{GREEK CAPITAL LETTER ALPHA WITH VRACHY}"; - $UPPER = "\N{GREEK SMALL LETTER ALPHA WITH VRACHY}"; - - ok $lower =~ m/$UPPER/i; - ok $UPPER =~ m/$lower/i; - ok $lower =~ m/[$UPPER]/i; - ok $UPPER =~ m/[$lower]/i; - - local $Message = "Folding 'LATIN LETTER Y WITH DIAERESIS'"; - - $lower = "\N{LATIN SMALL LETTER Y WITH DIAERESIS}"; - $UPPER = "\N{LATIN CAPITAL LETTER Y WITH DIAERESIS}"; - - ok $lower =~ m/$UPPER/i; - ok $UPPER =~ m/$lower/i; - ok $lower =~ m/[$UPPER]/i; - ok $UPPER =~ m/[$lower]/i; - } - - - { - use charnames ':full'; - local $PatchId = "13843"; - local $Message = "GREEK CAPITAL LETTER SIGMA vs " . - "COMBINING GREEK PERISPOMENI"; - - my $SIGMA = "\N{GREEK CAPITAL LETTER SIGMA}"; - my $char = "\N{COMBINING GREEK PERISPOMENI}"; - - may_not_warn sub {ok "_:$char:_" !~ m/_:$SIGMA:_/i}; - } - - - { - local $Message = '\X'; - use charnames ':full'; - - ok "a!" =~ /^(\X)!/ && $1 eq "a"; - ok "\xDF!" =~ /^(\X)!/ && $1 eq "\xDF"; - ok "\x{100}!" =~ /^(\X)!/ && $1 eq "\x{100}"; - ok "\x{100}\x{300}!" =~ /^(\X)!/ && $1 eq "\x{100}\x{300}"; - ok "\N{LATIN CAPITAL LETTER E}!" =~ /^(\X)!/ && - $1 eq "\N{LATIN CAPITAL LETTER E}"; - ok "\N{LATIN CAPITAL LETTER E}\N{COMBINING GRAVE ACCENT}!" - =~ /^(\X)!/ && - $1 eq "\N{LATIN CAPITAL LETTER E}\N{COMBINING GRAVE ACCENT}"; - - local $Message = '\C and \X'; - ok "!abc!" =~ /a\Cc/; - ok "!abc!" =~ /a\Xc/; - } - - - # rt.cpan.org 62056 - # Problem with a variable before a \N{...} in a pattern match - { - package RT::62056; - - # We need fresh_perl_is() - require './test.pl'; - - # Synch its test count with the one in pat.t - $test++; - curr_test($test); - - local $RT::62056::TODO = "rt.cpan.org 62056"; - - fresh_perl_is(<<'CODE', "Yes", {}, 'variable before \N{...}'); -use charnames ":full"; -$x = ""; -print "Yes" if "$x\N{LATIN CAPITAL LETTER E}" =~ /$x\N{LATIN CAPITAL LETTER E}/; -CODE - } - - - { local $Message = "Final Sigma"; my $SIGMA = "\x{03A3}"; # CAPITAL @@ -1860,46 +1771,6 @@ CODE { - use charnames ':full'; - local $Message = "Parlez-Vous " . - "Fran\N{LATIN SMALL LETTER C WITH CEDILLA}ais?"; - - ok "Fran\N{LATIN SMALL LETTER C}ais" =~ /Fran.ais/ && - $& eq "Francais"; - ok "Fran\N{LATIN SMALL LETTER C WITH CEDILLA}ais" =~ /Fran.ais/ && - $& eq "Fran\N{LATIN SMALL LETTER C WITH CEDILLA}ais"; - ok "Fran\N{LATIN SMALL LETTER C}ais" =~ /Fran\Cais/ && - $& eq "Francais"; - # COMBINING CEDILLA is two bytes when encoded - ok "Franc\N{COMBINING CEDILLA}ais" =~ /Franc\C\Cais/; - ok "Fran\N{LATIN SMALL LETTER C}ais" =~ /Fran\Xais/ && - $& eq "Francais"; - ok "Fran\N{LATIN SMALL LETTER C WITH CEDILLA}ais" =~ /Fran\Xais/ && - $& eq "Fran\N{LATIN SMALL LETTER C WITH CEDILLA}ais"; - ok "Franc\N{COMBINING CEDILLA}ais" =~ /Fran\Xais/ && - $& eq "Franc\N{COMBINING CEDILLA}ais"; - ok "Fran\N{LATIN SMALL LETTER C WITH CEDILLA}ais" =~ - /Fran\N{LATIN SMALL LETTER C WITH CEDILLA}ais/ && - $& eq "Fran\N{LATIN SMALL LETTER C WITH CEDILLA}ais"; - ok "Franc\N{COMBINING CEDILLA}ais" =~ /Franc\N{COMBINING CEDILLA}ais/ && - $& eq "Franc\N{COMBINING CEDILLA}ais"; - - my @f = ( - ["Fran\N{LATIN SMALL LETTER C}ais", "Francais"], - ["Fran\N{LATIN SMALL LETTER C WITH CEDILLA}ais", - "Fran\N{LATIN SMALL LETTER C WITH CEDILLA}ais"], - ["Franc\N{COMBINING CEDILLA}ais", "Franc\N{COMBINING CEDILLA}ais"], - ); - foreach my $entry (@f) { - my ($subject, $match) = @$entry; - ok $subject =~ /Fran(?:c\N{COMBINING CEDILLA}?| - \N{LATIN SMALL LETTER C WITH CEDILLA})ais/x && - $& eq $match; - } - } - - - { local $Message = "Lingering (and useless) UTF8 flag doesn't mess up /i"; my $pat = "ABcde"; my $str = "abcDE\x{100}"; @@ -1920,38 +1791,6 @@ CODE { - use charnames ':full'; - local $Message = "LATIN SMALL LETTER SHARP S " . - "(\N{LATIN SMALL LETTER SHARP S})"; - - ok "\N{LATIN SMALL LETTER SHARP S}" =~ - /\N{LATIN SMALL LETTER SHARP S}/; - ok "\N{LATIN SMALL LETTER SHARP S}" =~ - /\N{LATIN SMALL LETTER SHARP S}/i; - ok "\N{LATIN SMALL LETTER SHARP S}" =~ - /[\N{LATIN SMALL LETTER SHARP S}]/; - ok "\N{LATIN SMALL LETTER SHARP S}" =~ - /[\N{LATIN SMALL LETTER SHARP S}]/i; - - ok "ss" =~ /\N{LATIN SMALL LETTER SHARP S}/i; - ok "SS" =~ /\N{LATIN SMALL LETTER SHARP S}/i; - ok "ss" =~ /[\N{LATIN SMALL LETTER SHARP S}]/i; - ok "SS" =~ /[\N{LATIN SMALL LETTER SHARP S}]/i; - - ok "\N{LATIN SMALL LETTER SHARP S}" =~ /ss/i; - ok "\N{LATIN SMALL LETTER SHARP S}" =~ /SS/i; - - local $Message = "Unoptimized named sequence in class"; - ok "ss" =~ /[\N{LATIN SMALL LETTER SHARP S}x]/i; - ok "SS" =~ /[\N{LATIN SMALL LETTER SHARP S}x]/i; - ok "\N{LATIN SMALL LETTER SHARP S}" =~ - /[\N{LATIN SMALL LETTER SHARP S}x]/; - ok "\N{LATIN SMALL LETTER SHARP S}" =~ - /[\N{LATIN SMALL LETTER SHARP S}x]/i; - } - - - { # More whitespace: U+0085, U+2028, U+2029\n"; # U+0085, U+00A0 need to be forced to be Unicode, the \x{100} does that. @@ -2930,23 +2769,6 @@ CODE { - use charnames ':full'; - - ok 'aabc' !~ /a\N{PLUS SIGN}b/, '/a\N{PLUS SIGN}b/ against aabc'; - ok 'a+bc' =~ /a\N{PLUS SIGN}b/, '/a\N{PLUS SIGN}b/ against a+bc'; - - ok ' A B' =~ /\N{SPACE}\N{U+0041}\N{SPACE}\N{U+0042}/, - 'Intermixed named and unicode escapes'; - ok "\N{SPACE}\N{U+0041}\N{SPACE}\N{U+0042}" =~ - /\N{SPACE}\N{U+0041}\N{SPACE}\N{U+0042}/, - 'Intermixed named and unicode escapes'; - ok "\N{SPACE}\N{U+0041}\N{SPACE}\N{U+0042}" =~ - /[\N{SPACE}\N{U+0041}][\N{SPACE}\N{U+0042}]/, - 'Intermixed named and unicode escapes'; - } - - - { our $brackets; $brackets = qr{ { (?> [^{}]+ | (??{ $brackets }) )* } @@ -3655,21 +3477,6 @@ CODE { - use charnames ":full"; - ok "\N{ROMAN NUMERAL ONE}" =~ /\p{Alphabetic}/, "I =~ Alphabetic"; - ok "\N{ROMAN NUMERAL ONE}" =~ /\p{Uppercase}/, "I =~ Uppercase"; - ok "\N{ROMAN NUMERAL ONE}" !~ /\p{Lowercase}/, "I !~ Lowercase"; - ok "\N{ROMAN NUMERAL ONE}" =~ /\p{IDStart}/, "I =~ ID_Start"; - ok "\N{ROMAN NUMERAL ONE}" =~ /\p{IDContinue}/, "I =~ ID_Continue"; - ok "\N{SMALL ROMAN NUMERAL ONE}" =~ /\p{Alphabetic}/, "i =~ Alphabetic"; - ok "\N{SMALL ROMAN NUMERAL ONE}" !~ /\p{Uppercase}/, "i !~ Uppercase"; - ok "\N{SMALL ROMAN NUMERAL ONE}" =~ /\p{Lowercase}/, "i =~ Lowercase"; - ok "\N{SMALL ROMAN NUMERAL ONE}" =~ /\p{IDStart}/, "i =~ ID_Start"; - ok "\N{SMALL ROMAN NUMERAL ONE}" =~ /\p{IDContinue}/, "i =~ ID_Continue" - } - - - { # requirement of Unicode Technical Standard #18, 1.7 Code Points # cf. http://www.unicode.org/reports/tr18/#Supplementary_Characters for my $u (0x7FF, 0x800, 0xFFFF, 0x10000) { diff --git a/t/op/regexp_charnames.t b/t/op/regexp_charnames.t new file mode 100644 index 0000000..e128671 --- /dev/null +++ b/t/op/regexp_charnames.t @@ -0,0 +1,208 @@ +#!./perl + +# This is a test of regexes problems which involve charnames. + +use strict; +use warnings; +use 5.010; + +BEGIN { + chdir 't' if -d 't'; + @INC = '../lib'; + require "./test.pl"; +} + +plan tests => 64; + + +sub may_not_warn { + my ($code, $name) = @_; + my $w = ''; + local $SIG {__WARN__} = sub {$w .= join "" => @_}; + use warnings 'all'; + ref $code ? &$code : eval $code; + is $w, "", $name // "Did not warn"; +} + + +{ + use charnames ':full'; + note "Folding 'LATIN LETTER A WITH GRAVE'"; + + my $lower = "\N{LATIN SMALL LETTER A WITH GRAVE}"; + my $UPPER = "\N{LATIN CAPITAL LETTER A WITH GRAVE}"; + + ok $lower =~ m/$UPPER/i; + ok $UPPER =~ m/$lower/i; + ok $lower =~ m/[$UPPER]/i; + ok $UPPER =~ m/[$lower]/i; + + note "Folding 'GREEK LETTER ALPHA WITH VRACHY'"; + + $lower = "\N{GREEK CAPITAL LETTER ALPHA WITH VRACHY}"; + $UPPER = "\N{GREEK SMALL LETTER ALPHA WITH VRACHY}"; + + ok $lower =~ m/$UPPER/i; + ok $UPPER =~ m/$lower/i; + ok $lower =~ m/[$UPPER]/i; + ok $UPPER =~ m/[$lower]/i; + + note "Folding 'LATIN LETTER Y WITH DIAERESIS'"; + + $lower = "\N{LATIN SMALL LETTER Y WITH DIAERESIS}"; + $UPPER = "\N{LATIN CAPITAL LETTER Y WITH DIAERESIS}"; + + ok $lower =~ m/$UPPER/i; + ok $UPPER =~ m/$lower/i; + ok $lower =~ m/[$UPPER]/i; + ok $UPPER =~ m/[$lower]/i; +} + + +{ + use charnames ':full'; + note "Patch 13843"; + note "GREEK CAPITAL LETTER SIGMA vs COMBINING GREEK PERISPOMENI"; + + my $SIGMA = "\N{GREEK CAPITAL LETTER SIGMA}"; + my $char = "\N{COMBINING GREEK PERISPOMENI}"; + + may_not_warn sub {ok "_:$char:_" !~ m/_:$SIGMA:_/i}; +} + + +{ + use charnames ':full'; + + note '\X'; + ok "a!" =~ /^(\X)!/ && $1 eq "a"; + ok "\xDF!" =~ /^(\X)!/ && $1 eq "\xDF"; + ok "\x{100}!" =~ /^(\X)!/ && $1 eq "\x{100}"; + ok "\x{100}\x{300}!" =~ /^(\X)!/ && $1 eq "\x{100}\x{300}"; + ok "\N{LATIN CAPITAL LETTER E}!" =~ /^(\X)!/ && + $1 eq "\N{LATIN CAPITAL LETTER E}"; + ok "\N{LATIN CAPITAL LETTER E}\N{COMBINING GRAVE ACCENT}!" + =~ /^(\X)!/ && + $1 eq "\N{LATIN CAPITAL LETTER E}\N{COMBINING GRAVE ACCENT}"; + + note '\C and \X'; + ok "!abc!" =~ /a\Cc/; + ok "!abc!" =~ /a\Xc/; +} + + +# rt.cpan.org 62056 +# Problem with a variable before a \N{...} in a pattern match +{ + local $::TODO = "rt.cpan.org 62056"; + + fresh_perl_is(<<'CODE', "Yes", {}, 'variable before \N{...}'); +use charnames ":full"; +$x = ""; +print "Yes" if "$x\N{LATIN CAPITAL LETTER E}" =~ /$x\N{LATIN CAPITAL LETTER E}/; +CODE +} + + +{ + use charnames ':full'; + note "Parlez-Vous Fran\N{LATIN SMALL LETTER C WITH CEDILLA}ais?"; + + ok "Fran\N{LATIN SMALL LETTER C}ais" =~ /Fran.ais/ && + $& eq "Francais"; + ok "Fran\N{LATIN SMALL LETTER C WITH CEDILLA}ais" =~ /Fran.ais/ && + $& eq "Fran\N{LATIN SMALL LETTER C WITH CEDILLA}ais"; + ok "Fran\N{LATIN SMALL LETTER C}ais" =~ /Fran\Cais/ && + $& eq "Francais"; + # COMBINING CEDILLA is two bytes when encoded + ok "Franc\N{COMBINING CEDILLA}ais" =~ /Franc\C\Cais/; + ok "Fran\N{LATIN SMALL LETTER C}ais" =~ /Fran\Xais/ && + $& eq "Francais"; + ok "Fran\N{LATIN SMALL LETTER C WITH CEDILLA}ais" =~ /Fran\Xais/ && + $& eq "Fran\N{LATIN SMALL LETTER C WITH CEDILLA}ais"; + ok "Franc\N{COMBINING CEDILLA}ais" =~ /Fran\Xais/ && + $& eq "Franc\N{COMBINING CEDILLA}ais"; + ok "Fran\N{LATIN SMALL LETTER C WITH CEDILLA}ais" =~ + /Fran\N{LATIN SMALL LETTER C WITH CEDILLA}ais/ && + $& eq "Fran\N{LATIN SMALL LETTER C WITH CEDILLA}ais"; + ok "Franc\N{COMBINING CEDILLA}ais" =~ /Franc\N{COMBINING CEDILLA}ais/ && + $& eq "Franc\N{COMBINING CEDILLA}ais"; + + my @f = ( + ["Fran\N{LATIN SMALL LETTER C}ais", "Francais"], + ["Fran\N{LATIN SMALL LETTER C WITH CEDILLA}ais", + "Fran\N{LATIN SMALL LETTER C WITH CEDILLA}ais"], + ["Franc\N{COMBINING CEDILLA}ais", "Franc\N{COMBINING CEDILLA}ais"], + ); + foreach my $entry (@f) { + my ($subject, $match) = @$entry; + ok $subject =~ /Fran(?:c\N{COMBINING CEDILLA}?| + \N{LATIN SMALL LETTER C WITH CEDILLA})ais/x && + $& eq $match; + } +} + + +{ + use charnames ':full'; + note "LATIN SMALL LETTER SHARP S (\N{LATIN SMALL LETTER SHARP S})"; + + ok "\N{LATIN SMALL LETTER SHARP S}" =~ + /\N{LATIN SMALL LETTER SHARP S}/; + ok "\N{LATIN SMALL LETTER SHARP S}" =~ + /\N{LATIN SMALL LETTER SHARP S}/i; + ok "\N{LATIN SMALL LETTER SHARP S}" =~ + /[\N{LATIN SMALL LETTER SHARP S}]/; + ok "\N{LATIN SMALL LETTER SHARP S}" =~ + /[\N{LATIN SMALL LETTER SHARP S}]/i; + + ok "ss" =~ /\N{LATIN SMALL LETTER SHARP S}/i; + ok "SS" =~ /\N{LATIN SMALL LETTER SHARP S}/i; + ok "ss" =~ /[\N{LATIN SMALL LETTER SHARP S}]/i; + ok "SS" =~ /[\N{LATIN SMALL LETTER SHARP S}]/i; + + ok "\N{LATIN SMALL LETTER SHARP S}" =~ /ss/i; + ok "\N{LATIN SMALL LETTER SHARP S}" =~ /SS/i; + + note "Unoptimized named sequence in class"; + ok "ss" =~ /[\N{LATIN SMALL LETTER SHARP S}x]/i; + ok "SS" =~ /[\N{LATIN SMALL LETTER SHARP S}x]/i; + ok "\N{LATIN SMALL LETTER SHARP S}" =~ + /[\N{LATIN SMALL LETTER SHARP S}x]/; + ok "\N{LATIN SMALL LETTER SHARP S}" =~ + /[\N{LATIN SMALL LETTER SHARP S}x]/i; +} + + +{ + use charnames ':full'; + + ok 'aabc' !~ /a\N{PLUS SIGN}b/, '/a\N{PLUS SIGN}b/ against aabc'; + ok 'a+bc' =~ /a\N{PLUS SIGN}b/, '/a\N{PLUS SIGN}b/ against a+bc'; + + ok ' A B' =~ /\N{SPACE}\N{U+0041}\N{SPACE}\N{U+0042}/, + 'Intermixed named and unicode escapes'; + ok "\N{SPACE}\N{U+0041}\N{SPACE}\N{U+0042}" =~ + /\N{SPACE}\N{U+0041}\N{SPACE}\N{U+0042}/, + 'Intermixed named and unicode escapes'; + ok "\N{SPACE}\N{U+0041}\N{SPACE}\N{U+0042}" =~ + /[\N{SPACE}\N{U+0041}][\N{SPACE}\N{U+0042}]/, + 'Intermixed named and unicode escapes'; +} + + +{ + use charnames ":full"; + ok "\N{ROMAN NUMERAL ONE}" =~ /\p{Alphabetic}/, "I =~ Alphabetic"; + ok "\N{ROMAN NUMERAL ONE}" =~ /\p{Uppercase}/, "I =~ Uppercase"; + ok "\N{ROMAN NUMERAL ONE}" !~ /\p{Lowercase}/, "I !~ Lowercase"; + ok "\N{ROMAN NUMERAL ONE}" =~ /\p{IDStart}/, "I =~ ID_Start"; + ok "\N{ROMAN NUMERAL ONE}" =~ /\p{IDContinue}/, "I =~ ID_Continue"; + ok "\N{SMALL ROMAN NUMERAL ONE}" =~ /\p{Alphabetic}/, "i =~ Alphabetic"; + ok "\N{SMALL ROMAN NUMERAL ONE}" !~ /\p{Uppercase}/, "i !~ Uppercase"; + ok "\N{SMALL ROMAN NUMERAL ONE}" =~ /\p{Lowercase}/, "i =~ Lowercase"; + ok "\N{SMALL ROMAN NUMERAL ONE}" =~ /\p{IDStart}/, "i =~ ID_Start"; + ok "\N{SMALL ROMAN NUMERAL ONE}" =~ /\p{IDContinue}/, "i =~ ID_Continue" +} + + -- 1.6.2.4 ```

p5pRT commented 14 years ago

From @schwern

0004-Add-tests-for-similar-rt.cpan.org-56444.patch

```diff From 5fc9a94f55277ea133571f6e2bc52019b7ed240b Mon Sep 17 00:00:00 2001 From: Michael G. Schwern Date: Sat, 11 Jul 2009 16:43:13 -0700 Subject: [PATCH 4/4] Add tests for similar rt.cpan.org 56444 --- t/op/regexp_charnames.t | 27 +++++++++++++++++++++++++-- 1 files changed, 25 insertions(+), 2 deletions(-) diff --git a/t/op/regexp_charnames.t b/t/op/regexp_charnames.t index e128671..b7d9b88 100644 --- a/t/op/regexp_charnames.t +++ b/t/op/regexp_charnames.t @@ -12,7 +12,7 @@ BEGIN { require "./test.pl"; } -plan tests => 64; +plan tests => 67; sub may_not_warn { @@ -93,14 +93,37 @@ sub may_not_warn { # rt.cpan.org 62056 # Problem with a variable before a \N{...} in a pattern match +# Regressions in 5.10.0 from 5.8.8. { - local $::TODO = "rt.cpan.org 62056"; + use charnames ":full"; + + local $::TODO = "rt.cpan.org 62056 and 56444"; + # rt.cpan.org 62056 fresh_perl_is(<<'CODE', "Yes", {}, 'variable before \N{...}'); use charnames ":full"; $x = ""; print "Yes" if "$x\N{LATIN CAPITAL LETTER E}" =~ /$x\N{LATIN CAPITAL LETTER E}/; CODE + + fresh_perl_is(<<'CODE', "Yes", {}, 'variable before \N{...}'); +use charnames ":full"; +$x = ""; +print "Yes" if "$x\N{LATIN CAPITAL LETTER E}" =~ /$x \N{LATIN CAPITAL LETTER E}/x; +CODE + + # rt.cpan.org 56444 + fresh_perl_is(<<'CODE', "Yes", {}, '\N{...}+' ); +use charnames ":full"; +my $r1 = qr/\N{THAI CHARACTER SARA I}/; +my $s1 = "\N{THAI CHARACTER SARA I}" x 2; +print "Yes" if $s1 =~ /$r1+/; +CODE + + fresh_perl_is(<<'CODE', "Yes", { switches => ['-w'], stderr => 1 }); +use charnames ":full"; +print "Yes" if "\N{THAI CHARACTER SARA I}" =~ /\N{THAI CHARACTER SARA I}+/; +CODE } -- 1.6.2.4 ```

p5pRT commented 14 years ago

From @demerphq

2009/1/8 Moritz Lenz \moritz@casella\.verplant\.org:

Elliot Shank wrote:

# New Ticket Created by Elliot Shank # Please include the string: [perl #62056] # in the subject line of all future correspondence about this issue. # \<URL: http://rt.perl.org/rt3/Ticket/Display.html?id=62056 >

This is a bug report for perl from perl@galumph.com\, generated with the help of perlbug 1.36 running under perl 5.10.0.

----------------------------------------------------------------- [Please enter your report here]

Putting a variable expansion into a regex with a \N{} escape results in a compilation error in 5.10.0. For example\, the following program compiles:

#!/usr/bin/env perl use charnames ':full'; m/\N{START OF HEADING}/

However\, this

#!/usr/bin/env perl use charnames ':full'; m/$x\N{START OF HEADING}/

results in

Constant(\N{START OF HEADING}) unknown: (possibly a missing "use charnames ...") in regex;

This worked in perl-5.8.8\, and fails in perl-5.10.0. So I bisected it\, and this is what git-bisect says is the offending commit:

fc8cd66c26827f6c2ee1aa00ab2d3b3c320a4a28 is first bad commit commit fc8cd66c26827f6c2ee1aa00ab2d3b3c320a4a28 Author: Yves Orton \demerphq@gmail\.com Date: Tue Sep 19 03:37:19 2006 +0200

Re: \N{...} in regular expression [PATCH] Message-ID: \9b18b3110609181637m796d6c16o1b2741edc5f09eb2@mail\.gmail\.com

p4raw-id: //depot/perl@28868

I think the right solution to this problem is to fix charnames.

Making charnames lexically scoped poses serious conceptual difficulties in the regex engine\, for IMO very very little benefit.

IMO we should just make \N{} escapes work always. And disable this silly "charnames not in scope" behaviour. At least in regex patterns. I mean what do we gain?

Yes -- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 14 years ago

From @schwern

demerphq wrote:

I think the right solution to this problem is to fix charnames.

Making charnames lexically scoped poses serious conceptual difficulties in the regex engine\, for IMO very very little benefit.

IMO we should just make \N{} escapes work always. And disable this silly "charnames not in scope" behaviour. At least in regex patterns. I mean what do we gain?

(Note: this is all by someone who doesn't really do much Unicode)

Oddly enough\, this is a case where a wild stretch of backwards compatibility was broken.

$ perl5.5.5 -wle 'print "\N{FOO}"' N{FOO}

$ perl5.6.1 -wle 'print "\N{FOO}"' Constant(\N{...}) unknown: (possibly a missing "use charnames ...") at -e line 1\, within string Execution of -e aborted due to compilation errors.

BUT WHAT IF SOMEONE WAS USING "\N{BLAH}" IN THEIR CODE?!

*ahem*

So the argument that a lexical charnames is protecting against code which does not use charnames is bogus since \N{...} is already globally broken without it.

Seems to me the problem is there's not just one charnames. There's lots of them. :full\, :short\, :alias\, greek\, cyrillic... and you can even define you own. How do you know which one is in use?

This comes down to how charnames works. There's not a big table somewhere\, you export a "translator" function... which probably looks at some big table on disk. But it means only one translator can be in effect at any given time. This seems to me like overkill.

On the one hand\, who cares? Its not like its bad if there are too many charname symbols. Consider \N{...} a big namespace and leave it up to the charnames authors to be polite and not clobber each other.

If someone writes a charnames extension I'd like to use why do I have to exclude all the others?

use Encode::JP::Mobile::Charnames; use charnames ":full"; binmode STDOUT\, ":utf8"; print "\N{DoCoMo Beer}\n\N{GREEK SMALL LETTER SIGMA}\n"; __END__ Unknown charname 'DoCoMo Beer' at /usr/local/perl/5.10.0/lib/5.10.0/unicore/Name.pl line 1 � σ

Case in point\, Encode::JP::Mobile::Charnames (the only custom charnames module I could find on CPAN) works around this by falling back to charnames::charnames(). Of course this hack only works if I load it AFTER charnames.

So in order to unlexicalize charnames an additive system would have to be put in place. Perhaps something as simple as a list of translation functions to try. Just keep trying until one works.

-- You are wicked and wrong to have broken inside and peeked at the implementation and then relied upon it. -- tchrist in \31832\.969261130@chthon

p5pRT commented 14 years ago

From ben@morrow.me.uk

Quoth schwern@pobox.com (Michael G Schwern):

Seems to me the problem is there's not just one charnames. There's lots of them. :full\, :short\, :alias\, greek\, cyrillic... and you can even define you own. How do you know which one is in use?

This comes down to how charnames works. There's not a big table somewhere\, you export a "translator" function... which probably looks at some big table on disk. But it means only one translator can be in effect at any given time. This seems to me like overkill.

On the one hand\, who cares? Its not like its bad if there are too many charname symbols. Consider \N{...} a big namespace and leave it up to the charnames authors to be polite and not clobber each other.

But what about

~% perl -E' {use charnames "latin"; say charnames::viacode(ord "\N{upsilon}")} {use charnames "greek"; say charnames::viacode(ord "\N{upsilon}")}' LATIN SMALL LETTER UPSILON GREEK SMALL LETTER UPSILON

and other cases of conflict? One of the points of charnames is to avoid having to say LATIN SMALL LETTER BLAH WITH MANY EXTRA VERY VERBOSE DIACRITICALS all the time\, in favour of shorter but ambiguous names: it would be a shame to lose that.

Ben

p5pRT commented 14 years ago

From @rgs

2009/7/16 Ben Morrow \ben@morrow\.me\.uk:

Quoth schwern@pobox.com (Michael G Schwern):

Seems to me the problem is there's not just one charnames. There's lots of them. :full\, :short\, :alias\, greek\, cyrillic... and you can even define you own. How do you know which one is in use?

This comes down to how charnames works. There's not a big table somewhere\, you export a "translator" function... which probably looks at some big table on disk. But it means only one translator can be in effect at any given time. This seems to me like overkill.

On the one hand\, who cares? Its not like its bad if there are too many charname symbols. Consider \N{...} a big namespace and leave it up to the charnames authors to be polite and not clobber each other.

But what about

~% perl -E' {use charnames "latin"; say charnames::viacode(ord "\N{upsilon}")} {use charnames "greek"; say charnames::viacode(ord "\N{upsilon}")}' LATIN SMALL LETTER UPSILON GREEK SMALL LETTER UPSILON

and other cases of conflict? One of the points of charnames is to avoid having to say LATIN SMALL LETTER BLAH WITH MANY EXTRA VERY VERBOSE DIACRITICALS all the time\, in favour of shorter but ambiguous names: it would be a shame to lose that.

Good point. I like Yves' suggestion\, because it's simple. I think that Karl was doing something about the ambiguous names you're pointing to\, but I might be wrong. Also:

$ perl -E ' use charnames qw(greek latin); say charnames::viacode(ord "\N{upsilon}"); ' LATIN SMALL LETTER UPSILON

$ perl -E ' use charnames qw(latin greek); say charnames::viacode(ord "\N{upsilon}"); ' LATIN SMALL LETTER UPSILON

p5pRT commented 14 years ago

From @rgs

2009/7/16 Michael G Schwern \schwern@pobox\.com:

Oddly enough\, this is a case where a wild stretch of backwards compatibility was broken.

$ perl5.5.5 -wle 'print "\N{FOO}"' N{FOO}

$ perl5.6.1 -wle 'print "\N{FOO}"' Constant(\N{...}) unknown: (possibly a missing "use charnames ...") at -e line 1\, within string Execution of -e aborted due to compilation errors.

BUT WHAT IF SOMEONE WAS USING "\N{BLAH}" IN THEIR CODE?!

*ahem*

Seriously\, this is getting a bit tiresome. I don't know from where did originate this myth of P5P being opposed to any form of compatibility breakage\, but it's a myth. A not a much flattering one. No dragons are slain in it. (or vampires)

Did you notice that I recently added meaning for \N alone in regexes ? (the opposite of /\n/\, for the record.) And what it someone was using it before ?

p5pRT commented 14 years ago

From @abigail

On Fri\, Jul 17\, 2009 at 10:02:32AM +0200\, Rafael Garcia-Suarez wrote:

2009/7/16 Michael G Schwern \schwern@pobox\.com:

Oddly enough\, this is a case where a wild stretch of backwards compatibility was broken.

$ perl5.5.5 -wle 'print "\N{FOO}"' N{FOO}

$ perl5.6.1 -wle 'print "\N{FOO}"' Constant(\N{...}) unknown: (possibly a missing "use charnames ...") at -e line 1\, within string Execution of -e aborted due to compilation errors.

BUT WHAT IF SOMEONE WAS USING "\N{BLAH}" IN THEIR CODE?!

*ahem*

Seriously\, this is getting a bit tiresome. I don't know from where did originate this myth of P5P being opposed to any form of compatibility breakage\, but it's a myth. A not a much flattering one. No dragons are slain in it. (or vampires)

Did you notice that I recently added meaning for \N alone in regexes ? (the opposite of /\n/\, for the record.) And what it someone was using it before ?

Well\, in 5.10\, \N without braces is actually an error:

$ perl -wE '"" =~ /\N/' Missing braces on \N{} in regex; marked by \<-- HERE in m/\N \<-- HERE / at -e line 1. $

as it is in 5.8.9 (and 5.6.2):

$ /opt/perl/5.8.9/bin/perl -wle '"" =~ /\N/' Missing braces on \N{} at -e line 1\, within pattern Execution of -e aborted due to compilation errors. $

You'd have to go back to the 5.005 era (that is\, the previous century) to be able to have a '\N' in your regexp\, and having it mean 'N'.

Also note the line in perlrebackslash:

If the character following the backslash is a letter or a digit\, then the sequence may be special; if so\, it’s listed below. A few letters have not been used yet\, and escaping them with a backslash is safe for now\, but a future version of Perl may assign a special meaning to it. However\, if you have warnings turned on\, Perl will issue a warning if you use such a sequence. [1].

So\, it's not that we aren't documenting the fact that a letter preceeded by a backslash may get a special meaning in a later version of Perl.

Perhaps the only people who may get bitten are the ones that have been using 'N' as a regexp delimiter.

p5pRT commented 14 years ago

From @demerphq

On Sat Jul 11 00:56:45 2009\, schwern wrote:

Bizarrely\, it works in an eval.

$ perl5.10.0 -wle 'use charnames ":full"; my $x = ""; print "\N{LATIN CAPITAL LETTER E}" =~ /$x\N{LATIN CAPITAL LETTER E}/; print $@'Constant(\N{LATIN CAPITAL LETTER E}) unknown: (possibly a missing "use charnames ...") in regex; marked by \<-- HERE in m/\N{LATIN CAPITAL LETTER E} \<-- HERE / at -e line 1.

$ perl5.10.0 -wle 'use charnames ":full"; my $x = ""; print eval q{"\N{LATIN CAPITAL LETTER E}" =~ /$x\N{LATIN CAPITAL LETTER E}/}; print $@' 1

The reason it works in the string eval is because the \N{...} is interpolated and resolved before eval is called. In raw code the \N{} is not interpolated before it is handed to the regex engine. Somehow the regex engine is not seeing the charnames data.

p5pRT commented 14 years ago

From @demerphq

The core problem here is that delayed evaluation of charnames directly contradicts expectations of charnames behaviour. Specifically\, delayed evaluation may mean that different parts of a pattern are compiled with different charnames associations.

I can see a few ways to handle this

1. (really hard) Store the charnames in effect with each qr. Hack the concatenation logic and regex compilation logic to be able to handle different charnames associations for different subsections of the pattern. 2. (hard) Figure out better semantics for charnames and fix it 3. (moderate) Restore old early evaluation of charnames in regexes. This has the downside that if you used something like \N{full stop} it would be the same as putting a literal "." in your pattern\, and have the same magic side effects as "dot" does normally in a pattern. IMO this is not desirable. 4. (easy?) Use charnames to resolve the character at toker/compile time\, but convert it to an \x{...} escape on the fly when storing it in the regex pattern. That way it always expands to the right character later on regardless as to what charnames handlers are in scope at the time.

Option 4 seems to be clearly the best solution. Im not too familiar with the toker\, but my guess is it is probably fairly easy.

p5pRT commented 14 years ago

From @obra

On Sat Oct 17 05:52:01 2009\, demerphq wrote:

The core problem here is that delayed evaluation of charnames directly contradicts expectations of charnames behaviour. Specifically\, delayed evaluation may mean that different parts of a pattern are compiled with different charnames associations.

I can see a few ways to handle this

1. (really hard) Store the charnames in effect with each qr. Hack the concatenation logic and regex compilation logic to be able to handle different charnames associations for different subsections of the pattern. 2. (hard) Figure out better semantics for charnames and fix it 3. (moderate) Restore old early evaluation of charnames in regexes. This has the downside that if you used something like \N{full stop} it would be the same as putting a literal "." in your pattern\, and have the same magic side effects as "dot" does normally in a pattern. IMO this is not desirable. 4. (easy?) Use charnames to resolve the character at toker/compile time\, but convert it to an \x{...} escape on the fly when storing it in the regex pattern. That way it always expands to the right character later on regardless as to what charnames handlers are in scope at the time.

Option 4 seems to be clearly the best solution. Im not too familiar with the toker\, but my guess is it is probably fairly easy.

Yves\,

My recollection is that you said this bug was in the "easy to fix\, but not the end of the world if we ship 5.12.0 without it" category. Is that right?

If so\, I'll remove it's blocking-ness.

Jesse

p5pRT commented 14 years ago

From @khwilliamson

I have been thinking about this\, and see an issue with the proposed solution: "they should be resolved and then converted to \x{...} not preserved verbatim".

The problem is that qr/\N{LATIN CHARACTER CAPITAL A}/ currently implies that the regex is to have Unicode semantics\, and its resolved equivalent\, \x41\, does not. Hence\, the conversion loses information\, and causes breakage.

One solution is to not resolve to \x\, but to resolve to \N{U+41} instead\, so that the information would be preserved\, and this construct does not require charnames to be in scope.

But the problem is: currently the \N{U+...} constructs don't imply Unicode semantics\, neither in toke.c nor in regcomp.c\, the two places they're handled. I view this as a bug because there shouldn't be a semantic difference between the two forms of \N{}\, and intended to fix it when I got a chance.

Another solution is to use the \x conversion\, but pass on that the expression is supposed to have unicode semantics by adding a letter to the modifier list\, so that it looks like (?u-xism:\x{41}). This letter is the direction that Yves was heading for in 5.14 anyway\, and I have it implemented. The letter would mean use utf8 for the regex in 5.12 (but when we get regexes fixed so they can have unicode semantics without being in utf8\, the letter would mean merely to use unicode semantics). And we can defer until a later release allowing users to have a /u

If you want me to work on this\, I will\, whichever way is decided. I think I prefer the first one.

p5pRT commented 14 years ago

From zefram@fysh.org

karl williamson wrote:

The problem is that qr/\N{LATIN CHARACTER CAPITAL A}/ currently implies
that the regex is to have Unicode semantics\, and its resolved
equivalent\, \x41\, does not.

This sounds like a bug. I wouldn't work too hard to maintain the distinction.

-zefram

p5pRT commented 14 years ago

From @tux

On Mon\, 25 Jan 2010 09:32:55 -0700\, karl williamson \public@khwilliamson\.com wrote:

I have been thinking about this\, and see an issue with the proposed solution: "they should be resolved and then converted to \x{...} not preserved verbatim".

The problem is that qr/\N{LATIN CHARACTER CAPITAL A}/ currently implies that the regex is to have Unicode semantics\, and its resolved equivalent\, \x41\, does not. Hence\, the conversion loses information\, and causes breakage.

isn't \x{0041} (note the 4 positions) not guaranteed to be Unicode\, where \x41 and \x{41} are not?

One solution is to not resolve to \x\, but to resolve to \N{U+41} instead\, so that the information would be preserved\, and this construct does not require charnames to be in scope.

Nice!

But the problem is: currently the \N{U+...} constructs don't imply Unicode semantics\, neither in toke.c nor in regcomp.c\, the two places they're handled. I view this as a bug because there shouldn't be a semantic difference between the two forms of \N{}\, and intended to fix it when I got a chance.

Another solution is to use the \x conversion\, but pass on that the expression is supposed to have unicode semantics by adding a letter to the modifier list\, so that it looks like (?u-xism:\x{41}). This letter is the direction that Yves was heading for in 5.14 anyway\, and I have it implemented. The letter would mean use utf8 for the regex in 5.12 (but when we get regexes fixed so they can have unicode semantics without being in utf8\, the letter would mean merely to use unicode semantics). And we can defer until a later release allowing users to have a /u

If you want me to work on this\, I will\, whichever way is decided. I think I prefer the first one.

-- H.Merijn Brand http://tux.nl Perl Monger http://amsterdam.pm.org/ using & porting perl 5.6.2\, 5.8.x\, 5.10.x\, 5.11.x on HP-UX 10.20\, 11.00\, 11.11\, 11.23\, and 11.31\, OpenSuSE 10.3\, 11.0\, and 11.1\, AIX 5.2 and 5.3. http://mirrors.develooper.com/hpux/ http://www.test-smoke.org/ http://qa.perl.org http://www.goldmark.org/jeff/stupid-disclaimers/

p5pRT commented 14 years ago

From @tux

On Mon\, 25 Jan 2010 17:56:22 +0100\, "H.Merijn Brand" \h\.m\.brand@xs4all\.nl wrote:

On Mon\, 25 Jan 2010 09:32:55 -0700\, karl williamson \public@khwilliamson\.com wrote:

I have been thinking about this\, and see an issue with the proposed solution: "they should be resolved and then converted to \x{...} not preserved verbatim".

The problem is that qr/\N{LATIN CHARACTER CAPITAL A}/ currently implies that the regex is to have Unicode semantics\, and its resolved equivalent\, \x41\, does not. Hence\, the conversion loses information\, and causes breakage.

isn't \x{0041} (note the 4 positions) not guaranteed to be Unicode\, where \x41 and \x{41} are not?

Apparently not

pc09:/home/merijn 111 > perl -MDP -we'DDump"\x{20ac}"' SV = PV(0x743298) at 0x745328 REFCNT = 1 FLAGS = (POK\,READONLY\,pPOK\,UTF8) PV = 0x76e630 "\342\202\254"\0 [UTF8 "\x{20ac}"] CUR = 3 LEN = 8 pc09:/home/merijn 112 > perl -MDP -we'DDump"\x{0081}"' SV = PV(0x743298) at 0x745328 REFCNT = 1 FLAGS = (POK\,READONLY\,pPOK) PV = 0x76e630 "\201"\0 CUR = 1 LEN = 8 pc09:/home/merijn 113 > perl -MDP -we'DDump"\x{0041}"' SV = PV(0x743298) at 0x745328 REFCNT = 1 FLAGS = (POK\,READONLY\,pPOK) PV = 0x76e630 "A"\0 CUR = 1 LEN = 8 pc09:/home/merijn 114 > perl -MDP -we'DDump"\N{U+41}"' SV = PV(0x743298) at 0x745328 REFCNT = 1 FLAGS = (POK\,READONLY\,pPOK) PV = 0x76e630 "A"\0 CUR = 1 LEN = 8 pc09:/home/merijn 115 >

-- H.Merijn Brand http://tux.nl Perl Monger http://amsterdam.pm.org/ using & porting perl 5.6.2\, 5.8.x\, 5.10.x\, 5.11.x on HP-UX 10.20\, 11.00\, 11.11\, 11.23\, and 11.31\, OpenSuSE 10.3\, 11.0\, and 11.1\, AIX 5.2 and 5.3. http://mirrors.develooper.com/hpux/ http://www.test-smoke.org/ http://qa.perl.org http://www.goldmark.org/jeff/stupid-disclaimers/

p5pRT commented 14 years ago

From @iabyn

On Mon\, Jan 25\, 2010 at 09:32:55AM -0700\, karl williamson wrote:

I have been thinking about this\, and see an issue with the proposed
solution: "they should be resolved and then converted to \x{...} not
preserved verbatim".

There's another issue with it as well\, principally that \N{NAME} can expand to a multi-codepoint string\, in which case the literal

/\N{FOO}+/

can't simply be translated to

/\x{F001}\x{F002}+/

(assuming that's what FOO expands to)\, and instead needs to be something like

/(?:\x{F001}\x{F002})+/

but then that screws up character classes:

/[\N{FOO}]/

becomes

/[(?:\x{F001}\x{F002})]/ ???

-- The warp engines start playing up a bit\, but seem to sort themselves out after a while without any intervention from boy genius Wesley Crusher. -- Things That Never Happen in "Star Trek" #17

p5pRT commented 14 years ago

From @khwilliamson

Dave Mitchell wrote:

On Mon\, Jan 25\, 2010 at 09:32:55AM -0700\, karl williamson wrote:

I have been thinking about this\, and see an issue with the proposed
solution: "they should be resolved and then converted to \x{...} not
preserved verbatim".

There's another issue with it as well\, principally that \N{NAME} can expand to a multi-codepoint string\, in which case the literal
/\\N\{FOO\}\+/
can't simply be translated to
/\\x\{F001\}\\x\{F002\}\+/
(assuming that's what FOO expands to)\, and instead needs to be something like
/$?&#8203;:\\x\{F001\}\\x\{F002\}$\+/
but then that screws up character classes:
/\[\\N\{FOO\}\]/

becomes

/\[$?&#8203;:\\x\{F001\}\\x\{F002\}$\]/ ???

Actually\, the data file that charnames uses has never had multi-code points in it. I looked a while back to see if user-defined ones could be multi\, and it didn't appear to allow that\, but maybe I overlooked something. I was planning to add support for named sequences (where a multi-code point string is returned) for 5.14\, and have worked out an easy way to get character classes to allow them\, namely to use a transformation based on the fact that [a-c] is a shorthand for a|b|c\, so that /[(?:\x{F001}\x{F002})] is (?:\x{F001}\x{F002})

p5pRT commented 14 years ago

From @khwilliamson

Yves Orton wrote:

On Mon\, 2010-01-25 at 18:18 +0000\, Dave Mitchell wrote:
On Mon\, Jan 25\, 2010 at 09:32:55AM -0700\, karl williamson wrote:

I have been thinking about this\, and see an issue with the proposed
solution: "they should be resolved and then converted to \x{...} not
preserved verbatim". There's another issue with it as well\, principally that \N{NAME} can expand to a multi-codepoint string\, in which case the literal
/\\N\{FOO\}\+/
can't simply be translated to
/\\x\{F001\}\\x\{F002\}\+/
Hrm\, except that well\, this behaviour is not well defined.

The history of this behaviour is like this:

1. At something like the beginning we had \N{..} escapes being interpolated by the toker at compile time\, significantly before the regex engine compiled the pattern.

2. Later we realized it is incorrect for /a\N{U+7c}b/ to behave the same as /a|b/ so we changed it so that \N{...} parsing was deferred to the regex engine.

3. Now we are in the position where we realized that the lexical nature of charnames binding\, combined with how qr// is supposed to work leaves us in a fundamental quandry. If we want to defer the charnames binding then we have to remember what charnames pragmata where in scope for a pattern\, and not only that\, we have to deal with the possibility that two qr// objects which were compiled in different scopes maybe be embedded in another. In this case we may even end up with \N{FOO} meaning two totally different sequences of characters in two different places in the string.

So\, it seems to me that if we do the "convert to \x{..} escapes early" as proposed we are no worse off than we were when \N{...} escapes were introduced to the regex engine\, and we resolve the annoying problem with charnames and qr// entirely\, and we are better off than we were then as \N{U+2E} stops being a regex meta-character equivalent to /./
(assuming that's what FOO expands to)\, and instead needs to be something like
/$?&#8203;:\\x\{F001\}\\x\{F002\}$\+/
Yeah\, since we are doing pattern translations /anyway/ adding in the (?:..) as well wont hurt anything. However....
but then that screws up character classes:
/\[\\N\{FOO\}\]/

becomes

/\[$?&#8203;:\\x\{F001\}\\x\{F002\}$\]/ ???
The behavior of \N{...} in charclasses was never well defined and therefore I do not think that this is a problem.

Neverthless\, IMO the option truest to the state of the regex engine when charnames were invented is the unwrapped raw \x{...} translation.

cheers\, Yves

I don't see any reasons in your post to preferring \x{7c} over \N{U+7C} as a translation of \N{VERTICAL LINE}\, though you say you do. My claim is that \N{U+7C} is preferable because it allows us to keep the Unicodeness implied by the original. And no one is proposing that it be changed to have the meta symbol effect of '|'. Please elaborate.

p5pRT commented 14 years ago

From @demerphq

On Mon\, 2010-01-25 at 11:48 -0700\, karl williamson wrote: Dave Mitchell wrote:

On Mon\, Jan 25\, 2010 at 09:32:55AM -0700\, karl williamson wrote:

I have been thinking about this\, and see an issue with the proposed solution: "they should be resolved and then converted to \x{...} not preserved verbatim".

There's another issue with it as well\, principally that \N{NAME} can expand to a multi-codepoint string\, in which case the literal
/\\N\{FOO\}\+/
can't simply be translated to
/\\x\{F001\}\\x\{F002\}\+/
(assuming that's what FOO expands to)\, and instead needs to be something like
/$?&#8203;:\\x\{F001\}\\x\{F002\}$\+/
but then that screws up character classes:
/\[\\N\{FOO\}\]/

becomes

/\[$?&#8203;:\\x\{F001\}\\x\{F002\}$\]/ ???
Actually\, the data file that charnames uses has never had multi-code points in it. I looked a while back to see if user-defined ones could be multi\, and it didn't appear to allow that\, but maybe I overlooked something.

They can be multi. And they can actually return a different value every time. Even such that they return a different value on different passes of the regex compiler. Since the regex engine does mutli-pass compiling with the first pass being used to calculate storage space\, this means that if we dont preserve the value ourselves then on the second or third pass they can easily construct a buffer overrun\, if not an outright code injection attack. Hence the reason for preserving the results in the hash like that.

I was planning to add support for named sequences (where a

multi-code point string is returned) for 5.14\, and have worked out an easy way to get character classes to allow them\, namely to use a transformation based on the fact that [a-c] is a shorthand for a|b|c\, so that /[(?:\x{F001}\x{F002})] is (?:\x{F001}\x{F002})

I agree. This is what should happen. Sortof. The problem here is that if this translation is done at compile time it will be done before the regex engine sees the \N{...} and thus you wont be able to tell if they really meant that charclass or not.

So hrm....

Seems to me that we really need a /new/ way to represent a sequence of codepoints\, preferable in a single escape structure. We can use up one of remaining escape letters for it\, or extend \x{...} to do it\, or extend \N{U+...} to do it. The latter seems the superior option as you have pointed out as it allows us to preserve the "must flag the pattern as unicode" behaviour which \x{...} does not. So\, we need to support \N{U+1:2:3:4} or something like that.

/[\N{foo}]/

would convert into

/[\N{U+1:2:3:4}]/

and the regex engine would know this really was meant to be a multibyte sequence\, and since it was previously illegal there is no ambiguity with any legal existing syntax. For now it would work fine if it just matched \N{U+1}\, which is I think the current behavior (After warning).

Additionally the non-charclass part of the engine engine would need to treat \N{U+1:2:3:4} the same as (?:\x{1}\x{2}\x{3}\x{4}) but that should be fairly easy I think.

cheers\, Yves

p5pRT commented 14 years ago

From @demerphq

2010/1/25 karl williamson \public@khwilliamson\.com:

Yves Orton wrote:

On Mon\, 2010-01-25 at 18:18 +0000\, Dave Mitchell wrote:

On Mon\, Jan 25\, 2010 at 09:32:55AM -0700\, karl williamson wrote:

I have been thinking about this\, and see an issue with the proposed solution: "they should be resolved and then converted to \x{...} not preserved verbatim".

There's another issue with it as well\, principally that \N{NAME} can expand to a multi-codepoint string\, in which case the literal

/\N{FOO}+/

can't simply be translated to

/\x{F001}\x{F002}+/

Hrm\, except that well\, this behaviour is not well defined.

The history of this behaviour is like this:

1. At something like the beginning we had \N{..} escapes being interpolated by the toker at compile time\, significantly before the regex engine compiled the pattern.

2. Later we realized it is incorrect for /a\N{U+7c}b/ to behave the same as /a|b/ so we changed it so that \N{...} parsing was deferred to the regex engine.

3. Now we are in the position where we realized that the lexical nature of charnames binding\, combined with how qr// is supposed to work leaves us in a fundamental quandry. If we want to defer the charnames binding then we have to remember what charnames pragmata where in scope for a pattern\, and not only that\, we have to deal with the possibility that two qr// objects which were compiled in different scopes maybe be embedded in another. In this case we may even end up with \N{FOO} meaning two totally different sequences of characters in two different places in the string.

So\, it seems to me that if we do the "convert to \x{..} escapes early" as proposed we are no worse off than we were when \N{...} escapes were introduced to the regex engine\, and we resolve the annoying problem with charnames and qr// entirely\, and we are better off than we were then as \N{U+2E} stops being a regex meta-character equivalent to /./

(assuming that's what FOO expands to)\, and instead needs to be something like

/(?:\x{F001}\x{F002})+/

Yeah\, since we are doing pattern translations /anyway/ adding in the (?:..) as well wont hurt anything. However....

but then that screws up character classes:

/[\N{FOO}]/

becomes

/[(?:\x{F001}\x{F002})]/ ???

The behavior of \N{...} in charclasses was never well defined and therefore I do not think that this is a problem.

Neverthless\, IMO the option truest to the state of the regex engine when charnames were invented is the unwrapped raw \x{...} translation.

cheers\, Yves

I don't see any reasons in your post to preferring \x{7c} over \N{U+7C} as a translation of \N{VERTICAL LINE}\, though you say you do. My claim is that \N{U+7C} is preferable because it allows us to keep the Unicodeness implied by the original. And no one is proposing that it be changed to have the

Sorry\, in the above reply\, when I said "convert to \x{..} escapes early" I was thinking of the abstract sense of "an escape of some sort" not that it must be literally an \x{...}\, just that it must be an escape of some sort and not a literal vertical line/aka "alternation regex-metachar".

The \N{U+7C} escape form is better as you have explained\, sorry for confusing things.

Yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 14 years ago

From tchrist@perl.com

In-Reply-To: Message from demerphq \demerphq@gmail\.com of "Mon\, 25 Jan 2010 20:27:37 +0100." \9b18b3111001251127j5fbfce26p215d40bee5c580b7@mail\.gmail\.com

They can be multi.

They can? Is this due solely to Perl's uses of aliases in via \N{} charnames\, or does the Unicode standard somewhere specify that this be supported?

I confess I've wanted a way to simply say something like

use charnames qw[:full :alias] => { LAQ => "SINGLE LEFT-POINTING ANGLE QUOTATION MARK"\, RAQ => "SINGLE RIGHT-POINTING ANGLE QUOTATION MARK"\, d_dental => [ "LATIN SMALL LETTER D"\, "COMBINING BRIDGE BELOW" ]\, b_approx => [ "GREEK SMALL LETTER BETA"\, "COMBINING DOWN TACK BELOW" ]\, aw_chicago => [ "SMALL LETTER OPEN O"\, "COMBINING DIAERESIS"\, "COMBINING DOWN TACK BELOW" ]\, dezh_breath => [ "LATIN SMALL LETTER DEZH DIGRAPH"\, "COMBINING DIAERESIS BELOW" ]\, tezh_affric1 => [ "LATIN SMALL LETTER TESH DIGRAPH"\, "MODIFIER LETTER SMALL H" ]\, tezh_affric2 => [ "LATIN SMALL LETTER T"\, "COMBINING DOUBLE INVERTED BREVE"\, "LATIN SMALL LETTER ESH"\, "MODIFIER LETTER SMALL H" ]\, tezh_affric => "tezh_affric1"\, };

print "Affricated tezh #1 is \N{LAQ}\N{tezh_affric}\N{RAQ}\n"; print "Affricated tezh #2 is \N{LAQ}\N{tezh_affric2}\N{RAQ}\n";

and have that "just work" without having to go writing a module with a custom import that uses

$^H{charnames} = \&translator;

And they can actually return a different value every time. Even such that they return a different value on different passes of the regex compiler.

Because &translator isn't implicitly memoized.

Since the regex engine does multi-pass compiling with the first pass being used to calculate storage space\, this means that if we dont preserve the value ourselves then on the second or third pass they can easily construct a buffer overrun\, if not an outright code injection attack. Hence the reason for preserving the results in the hash like that.

Gee\, thanks for the nightmares. :)

I was planning to add support for named sequences (where a multi-code point string is returned) for 5.14\, and have worked out an easy way to get character classes to allow them\, namely to use a transformation based on the fact that [a-c] is a shorthand for a|b|c\, so that /[(?:\x{F001}\x{F002})] is (?:\x{F001}\x{F002})

I agree. This is what should happen. Sortof. The problem here is that if this translation is done at compile time it will be done before the regex engine sees the \N{...} and thus you wont be able to tell if they really meant that charclass or not.

/[\N{foo}]/

would convert into

/[\N{U+1:2:3:4}]/

I can easily generate \N{U+1.2.3.4} via

printf "\\N{U+%04vX}"\, $str;

but not the colon version. Given that code points are non- negative integers\, is there a reason to prefer the colon version over the dot version?

and the regex engine would know this really was meant to be a multibyte sequence\, and since it was previously illegal there is no ambiguity with any legal existing syntax. For now it would work fine if it just matched \N{U+1}\, which is I think the current behavior (After warning).

Additionally the non-charclass part of the engine engine would need to treat \N{U+1:2:3:4} the same as (?:\x{1}\x{2}\x{3}\x{4}) but that should be fairly easy I think.

I should think so. Some are harder though\, like these

[X[:digit:]Y[:^word:]\N{many}Z] [^X[:digit:]Y[:^word:]\N{many}Z]

becoming something like

(?:X|[:digit:]|Y|[:^word:]|\N{many}|Z) (?:(?!X|[:digit:]|Y|[:^word:]|\N{many}|Z)(?s:.))

although I imagine those would really be \p{prop}\, \P{prop}\, and \N{U+c1:c2:c3} (well\, or \N{U+c1.c2.c3} )\, or lastly some (?u:\x{C1}\x{c2}\x{c3}) equivalent.

--tom

p5pRT commented 14 years ago

From @demerphq

2010/1/25 Tom Christiansen \tchrist@perl\.com:

In-Reply-To: Message from demerphq \demerphq@gmail\.com of "Mon\, 25 Jan 2010 20:27:37 +0100." \9b18b3111001251127j5fbfce26p215d40bee5c580b7@mail\.gmail\.com

They can be multi.

They can? Is this due solely to Perl's uses of aliases in via \N{} charnames\, or does the Unicode standard somewhere specify that this be supported?

I think it was because it seemed like a good idea at the time and maybe it could be useful.

I confess I've wanted a way to simply say something like

use charnames qw[:full :alias] => { LAQ => "SINGLE LEFT-POINTING ANGLE QUOTATION MARK"\, RAQ => "SINGLE RIGHT-POINTING ANGLE QUOTATION MARK"\, d_dental => [ "LATIN SMALL LETTER D"\, "COMBINING BRIDGE BELOW" ]\, b_approx => [ "GREEK SMALL LETTER BETA"\, "COMBINING DOWN TACK BELOW" ]\, aw_chicago => [ "SMALL LETTER OPEN O"\, "COMBINING DIAERESIS"\, "COMBINING DOWN TACK BELOW" ]\, dezh_breath => [ "LATIN SMALL LETTER DEZH DIGRAPH"\, "COMBINING DIAERESIS BELOW" ]\, tezh_affric1 => [ "LATIN SMALL LETTER TESH DIGRAPH"\, "MODIFIER LETTER SMALL H" ]\, tezh_affric2 => [ "LATIN SMALL LETTER T"\, "COMBINING DOUBLE INVERTED BREVE"\, "LATIN SMALL LETTER ESH"\, "MODIFIER LETTER SMALL H" ]\, tezh_affric => "tezh_affric1"\, };

print "Affricated tezh #1 is \N{LAQ}\N{tezh_affric}\N{RAQ}\n"; print "Affricated tezh #2 is \N{LAQ}\N{tezh_affric2}\N{RAQ}\n";

and have that "just work" without having to go writing a module with a custom import that uses

$^H{charnames} = \&translator;

I have to admit i forget the details. I think the test files probably include the best example\, as im pretty sure i tested this.

And they can actually return a different value every time. Even such that they return a different value on different passes of the regex compiler.

Because &translator isn't implicitly memoized.

Right.

Since the regex engine does multi-pass compiling with the first pass being used to calculate storage space\, this means that if we dont preserve the value ourselves then on the second or third pass they can easily construct a buffer overrun\, if not an outright code injection attack. Hence the reason for preserving the results in the hash like that.

Gee\, thanks for the nightmares. :)

Just try to pretend I didnt say "third pass"\, and be glad I didnt explain why we need one and be really glad I failed to mention the possible fourth pass which I'm pretty sure sometimes also happens. :-(

I was planning to add support for named sequences (where a multi-code point string is returned) for 5.14\, and have worked out an easy way to get character classes to allow them\, namely to use a transformation based on the fact that [a-c] is a shorthand for a|b|c\, so that /[(?:\x{F001}\x{F002})] is (?:\x{F001}\x{F002})

I agree. This is what should happen. Sortof. The problem here is that if this translation is done at compile time it will be done before the regex engine sees the \N{...} and thus you wont be able to tell if they really meant that charclass or not.

/[\N{foo}]/

would convert into

/[\N{U+1:2:3:4}]/

I can easily generate \N{U+1.2.3.4} via

printf "\\N{U+%04vX}"\, $str;

but not the colon version. Given that code points are non- negative integers\, is there a reason to prefer the colon version over the dot version?

Er\, no not really. I just thought it looked better than commas and it didnt occur to me to try dots. And well\, no reason we cant TMTOWTDI :-)

and the regex engine would know this really was meant to be a multibyte sequence\, and since it was previously illegal there is no ambiguity with any legal existing syntax. For now it would work fine if it just matched \N{U+1}\, which is I think the current behavior (After warning).

Additionally the non-charclass part of the engine engine would need to treat \N{U+1:2:3:4} the same as (?:\x{1}\x{2}\x{3}\x{4}) but that should be fairly easy I think.

I should think so. Some are harder though\, like these

[X[:digit:]Y[:^word:]\N{many}Z] [^X[:digit:]Y[:^word:]\N{many}Z]

becoming something like

(?:X|[:digit:]|Y|[:^word:]|\N{many}|Z) (?:(?!X|[:digit:]|Y|[:^word:]|\N{many}|Z)(?s:.))

Im not getting you actually. it could be

[X[:digit:]Y[:^word:]${first_of_many}Z](?:(?\<=${first_of_many})$rest_of_many)

It could also be internally:

(?:[X[:digit:]Y[:^word:]Z]|\N{many})

Alternatviely\, consider that a characterclass is equivalent to a trie with depth 1. So charclasses could be eliminated/merged with tries (they nearly are already) and then this would just be a trie node constructed directly.

although I imagine those would really be \p{prop}\, \P{prop}\, and \N{U+c1:c2:c3} (well\, or \N{U+c1.c2.c3} )\, or lastly some (?u:\x{C1}\x{c2}\x{c3}) equivalent.

Like i said\, (?u:...) doesnt save us here. We need a new *escape* for this purpose\, as (?u:...) would always be a legitimate content in a charclass\, and the time we would do the transformation would occur at such an early time that we couldnt tell we were in a charclass\, or so late that we couldnt tell that the string was originally an \N{..}.

So we have to use a new/modified escape sequenc And for the reasons that Karl pointed out earlier that either means something new\, or using \N{U+...}.

cheers\, Yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 14 years ago

From @khwilliamson

demerphq wrote:

2010/1/25 Tom Christiansen \tchrist@perl\.com:

In-Reply-To: Message from demerphq \demerphq@gmail\.com of "Mon\, 25 Jan 2010 20:27:37 +0100." \9b18b3111001251127j5fbfce26p215d40bee5c580b7@mail\.gmail\.com

They can be multi. They can? Is this due solely to Perl's uses of aliases in via \N{} charnames\, or does the Unicode standard somewhere specify that this be supported?

I think it was because it seemed like a good idea at the time and maybe it could be useful.

I confess I've wanted a way to simply say something like

use charnames qw[:full :alias] => { LAQ => "SINGLE LEFT-POINTING ANGLE QUOTATION MARK"\, RAQ => "SINGLE RIGHT-POINTING ANGLE QUOTATION MARK"\, d_dental => [ "LATIN SMALL LETTER D"\, "COMBINING BRIDGE BELOW" ]\, b_approx => [ "GREEK SMALL LETTER BETA"\, "COMBINING DOWN TACK BELOW" ]\, aw_chicago => [ "SMALL LETTER OPEN O"\, "COMBINING DIAERESIS"\, "COMBINING DOWN TACK BELOW" ]\, dezh_breath => [ "LATIN SMALL LETTER DEZH DIGRAPH"\, "COMBINING DIAERESIS BELOW" ]\, tezh_affric1 => [ "LATIN SMALL LETTER TESH DIGRAPH"\, "MODIFIER LETTER SMALL H" ]\, tezh_affric2 => [ "LATIN SMALL LETTER T"\, "COMBINING DOUBLE INVERTED BREVE"\, "LATIN SMALL LETTER ESH"\, "MODIFIER LETTER SMALL H" ]\, tezh_affric => "tezh_affric1"\, };

print "Affricated tezh #1 is \N{LAQ}\N{tezh_affric}\N{RAQ}\n"; print "Affricated tezh #2 is \N{LAQ}\N{tezh_affric2}\N{RAQ}\n";

and have that "just work" without having to go writing a module with a custom import that uses

$^H{charnames} = \&translator;

I have to admit i forget the details. I think the test files probably include the best example\, as im pretty sure i tested this.

And they can actually return a different value every time. Even such that they return a different value on different passes of the regex compiler. Because &translator isn't implicitly memoized.

Right.

Since the regex engine does multi-pass compiling with the first pass being used to calculate storage space\, this means that if we dont preserve the value ourselves then on the second or third pass they can easily construct a buffer overrun\, if not an outright code injection attack. Hence the reason for preserving the results in the hash like that. Gee\, thanks for the nightmares. :)

Just try to pretend I didnt say "third pass"\, and be glad I didnt explain why we need one and be really glad I failed to mention the possible fourth pass which I'm pretty sure sometimes also happens. :-(

I was planning to add support for named sequences (where a multi-code point string is returned) for 5.14\, and have worked out an easy way to get character classes to allow them\, namely to use a transformation based on the fact that [a-c] is a shorthand for a|b|c\, so that /[(?:\x{F001}\x{F002})] is (?:\x{F001}\x{F002}) I agree. This is what should happen. Sortof. The problem here is that if this translation is done at compile time it will be done before the regex engine sees the \N{...} and thus you wont be able to tell if they really meant that charclass or not. /[\N{foo}]/ would convert into /[\N{U+1:2:3:4}]/ I can easily generate \N{U+1.2.3.4} via

printf "\\N{U+%04vX}"\, $str;

but not the colon version. Given that code points are non- negative integers\, is there a reason to prefer the colon version over the dot version?

Er\, no not really. I just thought it looked better than commas and it didnt occur to me to try dots. And well\, no reason we cant TMTOWTDI :-)
and the regex engine would know this really was meant to be a multibyte sequence\, and since it was previously illegal there is no ambiguity with any legal existing syntax. For now it would work fine if it just matched \N{U+1}\, which is I think the current behavior (After warning). Additionally the non-charclass part of the engine engine would need to treat \N{U+1:2:3:4} the same as (?:\x{1}\x{2}\x{3}\x{4}) but that should be fairly easy I think. I should think so. Some are harder though\, like these
\[X\[&#8203;:digit&#8203;:\]Y\[&#8203;:^word&#8203;:\]\\N\{many\}Z\]
[^X[:digit:]Y[:^word:]\N{many}Z]

becoming something like
  $?&#8203;:X|\[&#8203;:digit&#8203;:\]|Y|\[&#8203;:^word&#8203;:\]|\\N\{many\}|Z$
(?:(?!X|[:digit:]|Y|[:^word:]|\N{many}|Z)(?s:.))
Im not getting you actually. it could be

[X[:digit:]Y[:^word:]${first_of_many}Z](?:(?\<=${first_of_many})$rest_of_many)

It could also be internally:

(?:[X[:digit:]Y[:^word:]Z]|\N{many})

Alternatviely\, consider that a characterclass is equivalent to a trie with depth 1. So charclasses could be eliminated/merged with tries (they nearly are already) and then this would just be a trie node constructed directly.

although I imagine those would really be \p{prop}\, \P{prop}\, and \N{U+c1:c2:c3} (well\, or \N{U+c1.c2.c3} )\, or lastly some (?u:\x{C1}\x{c2}\x{c3}) equivalent.

Like i said\, (?u:...) doesnt save us here. We need a new *escape* for this purpose\, as (?u:...) would always be a legitimate content in a charclass\, and the time we would do the transformation would occur at such an early time that we couldnt tell we were in a charclass\, or so late that we couldnt tell that the string was originally an \N{..}.

So we have to use a new/modified escape sequenc And for the reasons that Karl pointed out earlier that either means something new\, or using \N{U+...}.

cheers\, Yves

As a relative newcomer to this\, I do say things here that reveal my ignorance\, which some may mistake for stupidity. Sometimes I say things that are wrong and nobody corrects me\, and I discover it after delving deeper in the code. I would appreciate any corrections people know about.

I'm afraid I don't understand some of the things in the above. I don't see the need for adding the syntax \N{U+c1.c2.c3}.

First\, for 5.12 I don't think it is needed at all\, because Yves is right that what happens with a multi-character result is returned from charnames is that in a character class only the first value is used\, and a warning is raised. I know I'm right about what the code does. Am I right that this means that we can put off the discussion of the proper syntax extensions until later?

What I read from Tom's email is that if nothing else you can get multi-character returns from using the $^H{charnames} = \&translator; construct.

Yves some months ago asked for someone who knows the tokenizer to do some work there for his proposal. I don't understand that either\, unless it is to allow the extended syntax. I have checked with gdb that a qr/\N{...}/ doesn't involve the tokenizer for parsing the \N part\, but perhaps some combination of interpolated variables will involve the tokenizer. Please explain.

But in case we do need to talk about the extensions now\, it seems to me that outside a character class\, a multi-character return can be compiled as just \N{U+c1}\N{U+c2}\N{U+c3}\, and that inside a character class it can be compiled as (?:[rest-of-class]|\N{U+c1}\N{U+c2}\N{U+c3}) without loss of information. And so\, no new syntax is needed. Again\, am I wrong?

p5pRT commented 14 years ago

From tchrist@perl.com

Yves gmailed:

They can be multi.

They can? Is this due solely to Perl's uses of aliases in via \N{} charnames\, or does the Unicode standard somewhere specify that this be supported?

I think it was because it seemed like a good idea at the time and maybe it could be useful.

Oh\, it does\, it does. That's why I showed my example. I just hadn't considered the price one might have to pay.

I confess I've wanted a way to simply say something like

use charnames qw[:full :alias] => { LAQ => "SINGLE LEFT-POINTING ANGLE QUOTATION MARK"\, RAQ => "SINGLE RIGHT-POINTING ANGLE QUOTATION MARK"\, d_dental => [ "LATIN SMALL LETTER D"\, "COMBINING BRIDGE BELOW" ]\, b_approx => [ "GREEK SMALL LETTER BETA"\, "COMBINING DOWN TACK BELOW" ]\, aw_chicago => [ "SMALL LETTER OPEN O"\, "COMBINING DIAERESIS"\, "COMBINING DOWN TACK BELOW" ]\, dezh_breath => [ "LATIN SMALL LETTER DEZH DIGRAPH"\, "COMBINING DIAERESIS BELOW" ]\, tezh_affric1 => [ "LATIN SMALL LETTER TESH DIGRAPH"\, "MODIFIER LETTER SMALL H" ]\, tezh_affric2 => [ "LATIN SMALL LETTER T"\, "COMBINING DOUBLE INVERTED BREVE"\, "LATIN SMALL LETTER ESH"\, "MODIFIER LETTER SMALL H" ]\, tezh_affric => "tezh_affric1"\, };

print "Affricated tezh #1 is \N{LAQ}\N{tezh_affric}\N{RAQ}\n"; print "Affricated tezh #2 is \N{LAQ}\N{tezh_affric2}\N{RAQ}\n";

and have that "just work" without having to go writing a module with a custom import that uses

$^H{charnames} = \&translator;

I was planning to add support for named sequences (where a multi-code point string is returned) for 5.14\, and have worked out an easy way to get character classes to allow them\, namely to use a transformation based on the fact that [a-c] is a shorthand for a|b|c\, so that /[(?:\x{F001}\x{F002})] is (?:\x{F001}\x{F002})

That's pretty costly--more than an order of magnitude--under the current system; see below. Do you have some overhaulish plans for that?

/[\N{foo}]/

would convert into

/[\N{U+1:2:3:4}]/

I can easily generate \N{U+1.2.3.4} via

printf "\\N{U+%04vX}"\, $str;

but not the colon version. Given that code points are non- negative integers\, is there a reason to prefer the colon version over the dot version?

Er\, no not really. I just thought it looked better than commas and it didnt occur to me to try dots. And well\, no reason we cant TMTOWTDI :-)

True\, that.

Additionally the non-charclass part of the engine engine would need to treat \N{U+1:2:3:4} the same as (?:\x{1}\x{2}\x{3}\x{4}) but that should be fairly easy I think.

I should think so. Some are harder though\, like these
\[X\[&#8203;:digit&#8203;:\]Y\[&#8203;:^word&#8203;:\]\\N\{many\}Z\]
[^X[:digit:]Y[:^word:]\N{many}Z]

becoming something like
  $?&#8203;:X|\[&#8203;:digit&#8203;:\]|Y|\[&#8203;:^word&#8203;:\]|\\N\{many\}|Z$
(?:(?!X|[:digit:]|Y|[:^word:]|\N{many}|Z)(?s:.))
Im not getting you actually.

I was winging conversion of [abc] and [^abc] to alternation. The first is easy as (?:want|this)\, but the second requires a notted alternation\, which I've always coded as something like (?:(?!dont|want)(?s:.))

it could be

[X[:digit:]Y[:^word:]${first_of_many}Z](?:(?\<=${first_of_many})$rest_of_many)

I can see why separating out the first character might be more efficiently executed\, but shouldn't a good trie implementation do that on its own?

It could also be internally:

(?:[X[:digit:]Y[:^word:]Z]|\N{many})

Alternatively\, consider that a characterclass is equivalent to a trie with depth 1. So charclasses could be eliminated/merged with tries (they nearly are already) and then this would just be a trie node constructed directly.

Mapping [abc] to (?:x|y|z) is straightforward enough\, and with a good trie implmentation should be reasonably efficient. But (naïve) alternation still comes out to be severely expensive.

use Benchmark qw[ :hireswallclock cmpthese ]; $str = "=" x 120 . "super" x 4; cmpthese($count\, { CharClass => q{ $str =~ /[a-z]+/ }\, Alternation => q{ $str =~ /(?:a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z)+/ }\, });

Gives:

Platform A: 5.8.8 5.10.0 5.11.3 5.11.4 CharClass 1496772/s 1284750/s 1070296/s 1047791/s Alternation 13816/s 39727/s 38128/s 35856/s

Platform B: 5.8.8 5.10.0 5.10.1 5.11.3 CharClass 238432/s 233372/s 226796/s 178877/s Alternation 3377/s 10223/s 9930/s 9816/s

So casually converting to alternation has a big performance hit.

With Unicode properties\, it can be even worse.

use Benchmark qw[ :hireswallclock cmpthese ]; $str = "=" x 120 . "super" x 4; cmpthese($count\, { CharClass => q{ $str =~ /[a-z]+/ }\, Alphabetic => q{ $str =~ /\p{Alphabetic}+/ }\, Latin => q{ $str =~ /\p{Latin}+/ }\, });

Gives:

Platform A: 5.8.8 5.10.0 5.11.3 5.11.4 CharClass 1505280/s 1282982/s 1073828/s 1058097/s Latin 147552/s 146616/s 52130/s 52463/s Alphabetic 146457/s 145691/s 52118/s 52049/s

Platform B: 5.8.8 5.10.0 5.10.1 5.11.3 CharClass 241630/s 231872/s 231226/s 175170/s Latin 28525/s 29563/s 7204/s 5655/s Alphabetic 28291/s 29513/s 7139/s 5303/s

Apart from the big Unicode hit\, (quite nearly) everything keeps getting ever sl.o..w...e....r\, as $] increases. I shouldn't care to project that trend-line out a few more versions. :(

although I imagine those would really be \p{prop}\, \P{prop}\, and \N{U+c1:c2:c3} (well\, or \N{U+c1.c2.c3} )\, or lastly some (?u:\x{C1}\x{c2}\x{c3}) equivalent.

Like i said\, (?u:...) doesnt save us here. We need a new *escape* for this purpose\, as (?u:...) would always be a legitimate content in a charclass\, and the time we would do the transformation would occur at such an early time that we couldnt tell we were in a charclass\, or so late that we couldnt tell that the string was originally an \N{..}.

Drat.

So we have to use a new/modified escape sequence And for the reasons that Karl pointed out earlier that either means something new\, or using \N{U+...}.

Ok.

--tom

p5pRT commented 14 years ago

From yves.orton@booking.com

On Mon\, 2010-01-25 at 18:18 +0000\, Dave Mitchell wrote:

On Mon\, Jan 25\, 2010 at 09:32:55AM -0700\, karl williamson wrote:

I have been thinking about this\, and see an issue with the proposed
solution: "they should be resolved and then converted to \x{...} not
preserved verbatim".

There's another issue with it as well\, principally that \N{NAME} can expand to a multi-codepoint string\, in which case the literal
/\\N\{FOO\}\+/
can't simply be translated to
/\\x\{F001\}\\x\{F002\}\+/

Hrm\, except that well\, this behaviour is not well defined.

The history of this behaviour is like this:

1. At something like the beginning we had \N{..} escapes being interpolated by the toker at compile time\, significantly before the regex engine compiled the pattern.

2. Later we realized it is incorrect for /a\N{U+7c}b/ to behave the same as /a|b/ so we changed it so that \N{...} parsing was deferred to the regex engine.

3. Now we are in the position where we realized that the lexical nature of charnames binding\, combined with how qr// is supposed to work leaves us in a fundamental quandry. If we want to defer the charnames binding then we have to remember what charnames pragmata where in scope for a pattern\, and not only that\, we have to deal with the possibility that two qr// objects which were compiled in different scopes maybe be embedded in another. In this case we may even end up with \N{FOO} meaning two totally different sequences of characters in two different places in the string.

So\, it seems to me that if we do the "convert to \x{..} escapes early" as proposed we are no worse off than we were when \N{...} escapes were introduced to the regex engine\, and we resolve the annoying problem with charnames and qr// entirely\, and we are better off than we were then as \N{U+2E} stops being a regex meta-character equivalent to /./

(assuming that's what FOO expands to)\, and instead needs to be something like
/$?&#8203;:\\x\{F001\}\\x\{F002\}$\+/

Yeah\, since we are doing pattern translations /anyway/ adding in the (?:..) as well wont hurt anything. However....

but then that screws up character classes:
/\[\\N\{FOO\}\]/

becomes

/\[$?&#8203;:\\x\{F001\}\\x\{F002\}$\]/ ???

The behavior of \N{...} in charclasses was never well defined and therefore I do not think that this is a problem.

Neverthless\, IMO the option truest to the state of the regex engine when charnames were invented is the unwrapped raw \x{...} translation.

cheers\, Yves

p5pRT commented 14 years ago

From @demerphq

2010/1/26 Tom Christiansen \tchrist@perl\.com:

Yves gmailed:

They can be multi.

They can? Is this due solely to Perl's uses of aliases in via \N{} charnames\, or does the Unicode standard somewhere specify that this be supported?

I think it was because it seemed like a good idea at the time and maybe it could be useful.

Oh\, it does\, it does. That's why I showed my example. I just hadn't considered the price one might have to pay.

I suspect many of perls gnarlier features come from just that problem. :-)

I confess I've wanted a way to simply say something like

use charnames qw[:full :alias] => { LAQ => "SINGLE LEFT-POINTING ANGLE QUOTATION MARK"\, RAQ => "SINGLE RIGHT-POINTING ANGLE QUOTATION MARK"\, d_dental => [ "LATIN SMALL LETTER D"\, "COMBINING BRIDGE BELOW" ]\, b_approx => [ "GREEK SMALL LETTER BETA"\, "COMBINING DOWN TACK BELOW" ]\, aw_chicago => [ "SMALL LETTER OPEN O"\, "COMBINING DIAERESIS"\, "COMBINING DOWN TACK BELOW" ]\, dezh_breath => [ "LATIN SMALL LETTER DEZH DIGRAPH"\, "COMBINING DIAERESIS BELOW" ]\, tezh_affric1 => [ "LATIN SMALL LETTER TESH DIGRAPH"\, "MODIFIER LETTER SMALL H" ]\, tezh_affric2 => [ "LATIN SMALL LETTER T"\, "COMBINING DOUBLE INVERTED BREVE"\, "LATIN SMALL LETTER ESH"\, "MODIFIER LETTER SMALL H" ]\, tezh_affric => "tezh_affric1"\, };

print "Affricated tezh #1 is \N{LAQ}\N{tezh_affric}\N{RAQ}\n"; print "Affricated tezh #2 is \N{LAQ}\N{tezh_affric2}\N{RAQ}\n";

and have that "just work" without having to go writing a module with a custom import that uses

$^H{charnames} = \&translator;

I was planning to add support for named sequences (where a multi-code point string is returned) for 5.14\, and have worked out an easy way to get character classes to allow them\, namely to use a transformation based on the fact that [a-c] is a shorthand for a|b|c\, so that /[(?:\x{F001}\x{F002})] is (?:\x{F001}\x{F002})

That's pretty costly--more than an order of magnitude--under the current system; see below. Do you have some overhaulish plans for that?

/[\N{foo}]/

would convert into

/[\N{U+1:2:3:4}]/

I can easily generate \N{U+1.2.3.4} via

printf "\\N{U+%04vX}"\, $str;

but not the colon version. Given that code points are non- negative integers\, is there a reason to prefer the colon version over the dot version?

Er\, no not really. I just thought it looked better than commas and it didnt occur to me to try dots. And well\, no reason we cant TMTOWTDI :-)

True\, that.

And well in line for traditions sake. :-)

Additionally the non-charclass part of the engine engine would need to treat \N{U+1:2:3:4} the same as (?:\x{1}\x{2}\x{3}\x{4}) but that should be fairly easy I think.

I should think so. Some are harder though\, like these

[X[:digit:]Y[:^word:]\N{many}Z] [^X[:digit:]Y[:^word:]\N{many}Z]

becoming something like

(?:X|[:digit:]|Y|[:^word:]|\N{many}|Z) (?:(?!X|[:digit:]|Y|[:^word:]|\N{many}|Z)(?s:.))

Im not getting you actually.

I was winging conversion of [abc] and [^abc] to alternation. The first is easy as (?:want|this)\, but the second requires a notted alternation\, which I've always coded as something like (?:(?!dont|want)(?s:.))

Ah\, I was thrown off becuase the [:^word:] is a special case in the regex engine anyway.

We dont (just) store all the characters involved. We store a bit that says "we match non word characters"

it could be

[X[:digit:]Y[:^word:]${first_of_many}Z](?:(?\<=${first_of_many})$rest_of_many)

I can see why separating out the first character might be more efficiently executed\, but shouldn't a good trie implementation do that on its own?

Oh yes. i was just thinking of the non-trie case.

It could also be internally:

(?:[X[:digit:]Y[:^word:]Z]|\N{many})

Alternatively\, consider that a characterclass is equivalent to a trie with depth 1. So charclasses could be eliminated/merged with tries (they nearly are already) and then this would just be a trie node constructed directly.

Mapping [abc] to (?:x|y|z) is straightforward enough\, and with a good trie implmentation should be reasonably efficient. But (naïve) alternation still comes out to be severely expensive.

use Benchmark qw[ :hireswallclock cmpthese ]; $str = "=" x 120 . "super" x 4; cmpthese($count\, { CharClass => q{ $str =~ /[a-z]+/ }\, Alternation => q{ $str =~ /(?:a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z)+/ }\, });

Gives:

Platform A: 5.8.8 5.10.0 5.11.3 5.11.4 CharClass 1496772/s 1284750/s 1070296/s 1047791/s Alternation 13816/s 39727/s 38128/s 35856/s

Platform B: 5.8.8 5.10.0 5.10.1 5.11.3 CharClass 238432/s 233372/s 226796/s 178877/s Alternation 3377/s 10223/s 9930/s 9816/s

So casually converting to alternation has a big performance hit.

Which version of perl is this? This shouldnt construct a trie at all\, or not a full one. And in that case it should be nearly as fast.

Also\, it seems to me you are comparing the time to construct the pattern as well. That is probably not wise and would probably swamp the benefit of using a trie.

With Unicode properties\, it can be even worse.

use Benchmark qw[ :hireswallclock cmpthese ]; $str = "=" x 120 . "super" x 4; cmpthese($count\, { CharClass => q{ $str =~ /[a-z]+/ }\, Alphabetic => q{ $str =~ /\p{Alphabetic}+/ }\, Latin => q{ $str =~ /\p{Latin}+/ }\, });

Gives:

Platform A: 5.8.8 5.10.0 5.11.3 5.11.4 CharClass 1505280/s 1282982/s 1073828/s 1058097/s Latin 147552/s 146616/s 52130/s 52463/s Alphabetic 146457/s 145691/s 52118/s 52049/s

Platform B: 5.8.8 5.10.0 5.10.1 5.11.3 CharClass 241630/s 231872/s 231226/s 175170/s Latin 28525/s 29563/s 7204/s 5655/s Alphabetic 28291/s 29513/s 7139/s 5303/s

Thats not a great comparison unfortunately. The way that unicode charclasses are handled is *extremely* inefficient. And a unicode property is really just a special charclass.

Apart from the big Unicode hit\, (quite nearly) everything keeps getting ever sl.o..w...e....r\, as $] increases. I shouldn't care to project that trend-line out a few more versions. :(

Yes this is true. However it also gets more correct\, and more powerful. And does things it didnt used to do.

So for instance we used to be faster but with the trade off that a regex could segv your perl and uncatchably terminate your process. That doesnt happen anymore. We used to do a number of things that were actually wrong\, and fixing stuff often has a negative performance impact. Its really easy to be fast if you dont have to worry about correctness.

Having said that we do have some very serious breakage. For instance the superlinear cache was broken when we de-recursivized the regex engine\, and nobody has had the combination of the tuits\, inclination\, and skills to fix it.

although I imagine those would really be \p{prop}\, \P{prop}\, and \N{U+c1:c2:c3} (well\, or \N{U+c1.c2.c3} )\, or lastly some (?u:\x{C1}\x{c2}\x{c3}) equivalent.

Like i said\, (?u:...) doesnt save us here. We need a new *escape* for this purpose\, as (?u:...) would always be a legitimate content in a charclass\, and the time we would do the transformation would occur at such an early time that we couldnt tell we were in a charclass\, or so late that we couldnt tell that the string was originally an \N{..}.

Drat.

So we have to use a new/modified escape sequence And for the reasons that Karl pointed out earlier that either means something new\, or using \N{U+...}.

Ok.

We have metphorically painted ourselves in a corner. Luckily it is a hyperroom and if we can slide off into the 7th dimension we probably can get out without messing up our shoes (or our paint job).

Yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 14 years ago

From @demerphq

2010/1/26 karl williamson \public@khwilliamson\.com:

As a relative newcomer to this\, I do say things here that reveal my ignorance\, which some may mistake for stupidity. Sometimes I say things that are wrong and nobody corrects me\, and I discover it after delving deeper in the code. I would appreciate any corrections people know about.

Bah. You kick ass. And dont forget it. Even being willing to look at this crap earns you a very large gold star.

I'm afraid I don't understand some of the things in the above. I don't see the need for adding the syntax \N{U+c1.c2.c3}.

I think there is a need long term.

First\, for 5.12 I don't think it is needed at all\, because Yves is right that what happens with a multi-character result is returned from charnames is that in a character class only the first value is used\, and a warning is raised. I know I'm right about what the code does. Am I right that this means that we can put off the discussion of the proper syntax extensions until later?

Thats a hard call. Im inlined to say yes\, but the spidey sense goes crazy when i start typing. :-)

What I read from Tom's email is that if nothing else you can get multi-character returns from using the $^H{charnames} = \&translator; construct.

Right. And imagine if said translator is written by someone evil.....

Yves some months ago asked for someone who knows the tokenizer to do some work there for his proposal. I don't understand that either\, unless it is to allow the extended syntax. I have checked with gdb that a qr/\N{...}/ doesn't involve the tokenizer for parsing the \N part\, but perhaps some combination of interpolated variables will involve the tokenizer. Please explain.

Let me recap the history once more.

Originally\, and even now with the exception of the regex engine\, \N{...} is handled at compile time by the toker. So basically by the time the parser sees the string it has NO IDEA that the source contained \N{...} constructs and uses charnames. This is really the only time that you KNOW FOR SURE which is the correct charnames handler for the \N{...} construct you are using.

This was changed several years ago to exempt the regex engine\, and at the same time patches were made to the regex engine so that it could ALSO expand out the \N{...} constructs at run time when the pattern was compiled (which is a run time activity). This was done so that the regex engine didnt treat \N{...} constructs as their literal equivalent and interpret them as regex meta chars.

This seemed all good except for the problem that charnames were never meant to be deffered\, they were meant to happen before the parser even saw the input. So deferring them introduced a world of problems.

Which brings us to now. If we defer processing of the \N{...} to the regex engine we have a problem. And so we have to undo the patch that deferred\, and modify the old state so that instead of deffering the expansion we convert the charname expansion into some other form of *escape* that the regex engine will not treat as a literal metachar.

Essentially once fixed right the regex engine will no longer need ANY support for \N{...} constructs as it will never ever see them. The tokenizer will have expanded them to a "constant escaped representation" long before the regex engine ever looks at them.

But in case we do need to talk about the extensions now\, it seems to me that outside a character class\, a multi-character return can be compiled as just \N{U+c1}\N{U+c2}\N{U+c3}\, and that inside a character class it can be compiled as (?:[rest-of-class]|\N{U+c1}\N{U+c2}\N{U+c3}) without loss of information. And so\, no new syntax is needed. Again\, am I wrong?

Depends on your point of view. What you say literally is not wrong.

However it doesnt cover the full picture as the thing that will be doing the charnames expansion will be the toker. And the toker knows almost nothing about regex syntax and in fact should not know anything about regex syntax (think alternate engines). So the thing that really has to do the expansion wont know whether the \N{...} is in a "normal" part of the regex or if its inside a charclass. So basically it will want to tansform the \N{..} into some special syntax that the regex engine will recognize and since the RE DOES know whether the special syntax is in a charclass or not it can figure out what is the right thing to do.

So for instance I would say the expediant thing to do now would be to make charnames support in regexes be handled by the toker and have it convert to \N{U+c1.c2.c3} syntax. Inside of the regex engine have the normal parse mode treat that as \N{U+c1}\N{U+c2}\N{U+c3} and then have the charclass parse mode treat it as \N{U+c1} and throw a warning when encountering a mutli-byte sequence.

cheers\, Yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 14 years ago

From @demerphq

2010/1/27 Eric Brine \ikegami@adaelis\.com:

On Tue\, Jan 26\, 2010 at 5:06 PM\, demerphq \demerphq@gmail\.com wrote:

And the toker [...] should not know anything about regex syntax (think alternate engines). So the thing that really has to do the expansion won't know whether the \N{...} is in a "normal" part of the regex or if its inside a charclass.

What about if it's in something like (?{ ... })?

I believe that the toker handles discriminating between "regex" mode and "code mode" and that this isnt a problem.

Yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 14 years ago

From @ikegami

On Tue\, Jan 26\, 2010 at 5:06 PM\, demerphq \demerphq@gmail\.com wrote:

And the toker [...] should not know anything about regex syntax (think alternate engines). So the thing that really has to do the expansion won't know whether the \N{...} is in a "normal" part of the regex or if its inside a charclass.

What about if it's in something like (?{ ... })?

p5pRT commented 14 years ago

From tchrist@perl.com

Yves gmailed:

Oh yes. i was just thinking of the non-trie case.

So casually converting to alternation has a big performance hit.

Which version of perl is this? This shouldnt construct a trie at all\, or not a full one. And in that case it should be nearly as fast.

It was the versions of Perl listed at the tops of column headers:

Platform A: 5.8.8 5.10.0 5.11.3 5.11.4 Platform B: 5.8.8 5.10.0 5.10.1 5.11.3

Also\, it seems to me you are comparing the time to construct the pattern as well. That is probably not wise and would probably swamp the benefit of using a trie.

You're right. Here I've factored that out:

use Benchmark qw[ :hireswallclock cmpthese ]; $str = "=" x 120 . "super" x 4; %patterns = ( Alphabetic => qr{\p{Alphabetic}+}\, Alternation => qr{(?:a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z)+}\, CharClass => qr{[a-z]+}\, PerlWord => qr{\w+}\, ); cmpthese 0 => { Alphabetic => q{ $str =~ $patterns{Alphabetic} }\, Alternation => q{ $str =~ $patterns{Alternation} }\, CharClass => q{ $str =~ $patterns{CharClass} }\, PerlWord => q{ $str =~ $patterns{PerlWord} }\, };

Here are the results. They start out about 2 orders of magnitude apart. The alternation gets ~3x faster with 5.10\, but the two charclass tests drop to half speed. That makes them only about 1 order of magnitude apart\, in very rough figures. The \p{Alphabetic} test starts out running about 1 order of magnitude faster than the alternation test\, but then something bad happened to it in 5.11\, cutting it more than in half. So between that and alternation getting faster\, currently \p{Alphabetic} isn't even half again as fast\, and it started out 10x as fast.

perl5.8.8 Rate Alternation Alphabetic CharClass PerlWord Alternation 13871/s -- -89% -99% -99% Alphabetic 132000/s 852% -- -88% -89% CharClass 1142524/s 8137% 766% -- -8% PerlWord 1244772/s 8874% 843% 9% --

perl5.10.0 Rate Alternation Alphabetic CharClass PerlWord Alternation 36654/s -- -71% -94% -94% Alphabetic 127707/s 248% -- -79% -80% CharClass 605390/s 1552% 374% -- -4% PerlWord 630601/s 1620% 394% 4% --

perl5.11.3 Rate Alternation Alphabetic CharClass PerlWord Alternation 36422/s -- -28% -94% -95% Alphabetic 50310/s 38% -- -92% -93% CharClass 627199/s 1622% 1147% -- -7% PerlWord 674494/s 1752% 1241% 8% --

perl5.11.4 Rate Alternation Alphabetic CharClass PerlWord Alternation 34171/s -- -32% -95% -95% Alphabetic 50425/s 48% -- -92% -92% CharClass 645119/s 1788% 1179% -- -3% PerlWord 665929/s 1849% 1221% 3% --

I don't understand without looking in a very dark place until I risk hallucinations from sensor deprivation why charclasses now take twice as long to run.

Thats not a great comparison unfortunately. The way that unicode charclasses are handled is *extremely* inefficient. And a unicode property is really just a special charclass.

So it seems.

Apart from the big Unicode hit\, (quite nearly) everything keeps getting ever sl.o..w...e....r\, as $] increases. I shouldn't care to project that trend-line out a few more versions. :(

Yes this is true. However it also gets more correct\, and more powerful. And does things it didnt used to do.

Is matching [a-z] slower because you can't just do a clever compare\, now that code points aren't byte-constained?

So for instance we used to be faster but with the trade off that a regex could segv your perl and uncatchably terminate your process. That doesnt happen anymore. We used to do a number of things that were actually wrong\, and fixing stuff often has a negative performance impact. Its really easy to be fast if you dont have to worry about correctness.

That's my line. :(

Having said that we do have some very serious breakage. For instance the superlinear cache was broken when we de-recursivized the regex engine\, and nobody has had the combination of the tuits\, inclination\, and skills to fix it.

I'm afraid that while I believe I understand the de-recursivized issue and why it was done\, I don't even know what a superlinear cache *is*. :(

--tom

Perl / perl5