Closed p5pRT closed 21 years ago
SUMMARY:
1) Gee\, it'd be nice to support (?\
2) I created a new magic variable $^N\, similar to $+.
3) If this is of interest\, is $^N a good name?
Feeling envy for highly advanced :-) languages like Java and Visual Basic\, whose regex languages allows named captures a'la
(?\
(sets $myval to the digits captured)\, I thought I'd use regex overloading to convert this syntaxt to
(\d+) (?{ $varname = $+ })
(capture to normal parens\, then use $+ to access it and assign to the variable)
This works fine except for a few details\, such as being able to have capturing parens nested within\, since $+ refers to the *hightest-numbered* set of parens used so far\, not the most-recently *closed* set of parens used so far.
So\, I created a new magic variable\, $^N\, that pretty much parallels $+ except it does indeed refer to the most-recently *closed* set of parens.
Now\, converting to
(\d+) (?{ $varname = $^N })
works even with nesting. This is very nice.
I can send the patches if wanted. But if wanted\, what is a good variable name? I picked $^N simply because I saw it was free.
For those interested\, I've appended my package to allow this\, and a short test program.
There are still "issues" with my overloading -- the variables named are not checked at runtime ('use strict' doesn't save you from referring to a nonexistant variable)\, and the variables are not "protected"\, so even if during a match a variable is set\, it won't be unset if the match later fails.
So\, it really would be nice if named captures were officially supported. I spent some hours digging through regcomp and regexec\, and succeeded only in killing massive amounts of neurons )-:\, so I don't think I'll be able to add it.
Jeffrey
Here is a short test script:
---snip------------------------------------------------------------------ use strict; use warnings; use Regex::SupportNamedCapture;
my $areacode; ## both lexical our $exchange; ## and global my $number; ## variables work fine.
"My number is 408-555-1212." =~ m{
\b
(?\
print "phone number is: ($areacode) $exchange-$number\n";
my $fullnumber;
"My number is 408-555-1212." =~ m{
\b
(?\
if ($fullnumber ne "$areacode-$exchange-$number") { print "Bummer\, you don't have \$^N support: fullnumber is [$fullnumber]\n"; } ---snip------------------------------------------------------------------
Here is the package:
---snip------------------------------------------------------------------ package Regex::SupportNamedCapture;
##
## This package allows regular expressions to have named captures\, a'la
##
## (?\
## ## Test to see if my proposed $^N is supported. ## Set $GutsResult to $^N if so\, set to $+ if not. ## my $GutsResult = do { no warnings; "1" =~ m/(1)/; if ($^N) { '$^N'; } else { '$+'; } };
our $OpenParens; ## needed for matching nested parens
my $NestedParenGuts = qr{
(?{ local $OpenParens = 0 }) ## counts the number of nested opens waiting to close
(?> (?> ## stuff not parens\, not escaped [^()\\]+
## escaped stuff | (?s: \\. )
# another opening paren | \( (?{ $OpenParens++ })
# a closing paren\, if we're expecting any | (?(?{ $OpenParens }) (?{ $OpenParens-- }) \)) )* ) }x;
##
## Mimics named capturing parens by converting something like
## (?\
$re =~ s{ (?\<! \\ ) # an unescaped... \(\? # "(?" \< # '\<' (\w+) # $1 - an identifier > # '>' ($NestedParenGuts) # $2 - regex guts \) # ')' }{ my $id = '$' . $1; my $guts = convert($2); "(?:($guts)(?{ $id=$GutsResult }))"; }exg;
return $re; ## return mangled regex }
1; ---snip------------------------------------------------------------------
Well\, since there has been no objection\, here is the patch. Jeffrey
Thanks\, applied. (base/lex.t need tweaking\, too\, since it assumed $^N is an unused variable...)
-----------------------------------------------------------------
diff -u -r .orig/embedvar.h ./embedvar.h --- .orig/embedvar.h Wed Jun 20 11:35:50 2001 +++ ./embedvar.h Tue Jun 26 12:16:35 2001 @@ -113\,6 +113\,7 @@ #define PL_regint_start (vTHX->Tregint_start) #define PL_regint_string (vTHX->Tregint_string) #define PL_reginterp_cnt (vTHX->Treginterp_cnt) +#define PL_reglastcloseparen (vTHX->Treglastcloseparen) #define PL_reglastparen (vTHX->Treglastparen) #define PL_regnarrate (vTHX->Tregnarrate) #define PL_regnaughty (vTHX->Tregnaughty) @@ -821\,6 +822\,7 @@ #define PL_regint_start (aTHXo->interp.Tregint_start) #define PL_regint_string (aTHXo->interp.Tregint_string) #define PL_reginterp_cnt (aTHXo->interp.Treginterp_cnt) +#define PL_reglastcloseparen (aTHXo->interp.Treglastcloseparen) #define PL_reglastparen (aTHXo->interp.Treglastparen) #define PL_regnarrate (aTHXo->interp.Tregnarrate) #define PL_regnaughty (aTHXo->interp.Tregnaughty) @@ -1518\,6 +1520\,7 @@ #define PL_regint_start (aTHX->Tregint_start) #define PL_regint_string (aTHX->Tregint_string) #define PL_reginterp_cnt (aTHX->Treginterp_cnt) +#define PL_reglastcloseparen (aTHX->Treglastcloseparen) #define PL_reglastparen (aTHX->Treglastparen) #define PL_regnarrate (aTHX->Tregnarrate) #define PL_regnaughty (aTHX->Tregnaughty) @@ -1654\,6 +1657\,7 @@ #define PL_Tregint_start PL_regint_start #define PL_Tregint_string PL_regint_string #define PL_Treginterp_cnt PL_reginterp_cnt +#define PL_Treglastcloseparen PL_reglastcloseparen #define PL_Treglastparen PL_reglastparen #define PL_Tregnarrate PL_regnarrate #define PL_Tregnaughty PL_regnaughty diff -u -r .orig/gv.c ./gv.c --- .orig/gv.c Mon Jun 25 08:08:15 2001 +++ ./gv.c Tue Jun 26 12:35:10 2001 @@ -895\,6 +895\,7 @@ case '\006': /* $^F */ case '\010': /* $^H */ case '\011': /* $^I\, NOT \t in EBCDIC */ + case '\016': /* $^N */ case '\020': /* $^P */ case '\024': /* $^T */ if (len > 1) @@ -1764\,6 +1765\,7 @@ case '\010': /* $^H */ case '\011': /* $^I\, NOT \t in EBCDIC */ case '\014': /* $^L */ + case '\016': /* $^N */ case '\020': /* $^P */ case '\023': /* $^S */ case '\024': /* $^T */ diff -u -r .orig/mg.c ./mg.c --- .orig/mg.c Mon Jun 25 07:48:03 2001 +++ ./mg.c Tue Jun 26 12:36:16 2001 @@ -435\,6 +435\,13 @@ goto getparen; } return 0; + case '\016': /* ^N */ + if (PL_curpm && (rx = PM_GETRE(PL_curpm))) { + paren = rx->lastcloseparen; + if (paren) + goto getparen; + } + return 0; case '`': if (PL_curpm && (rx = PM_GETRE(PL_curpm))) { if (rx->startp[0] != -1) { @@ -655\,6 +662\,14 @@ case '+': if (PL_curpm && (rx = PM_GETRE(PL_curpm))) { paren = rx->lastparen; + if (paren) + goto getparen; + } + sv_setsv(sv\,&PL_sv_undef); + break; + case '\016': /* ^N */ + if (PL_curpm && (rx = PM_GETRE(PL_curpm))) { + paren = rx->lastcloseparen; if (paren) goto getparen; } diff -u -r .orig/perlapi.h ./perlapi.h --- .orig/perlapi.h Wed Jun 20 11:35:50 2001 +++ ./perlapi.h Tue Jun 26 12:16:35 2001 @@ -802\,6 +802\,8 @@ #define PL_regint_string (*Perl_Tregint_string_ptr(aTHXo)) #undef PL_reginterp_cnt #define PL_reginterp_cnt (*Perl_Treginterp_cnt_ptr(aTHXo)) +#undef PL_reglastcloseparen +#define PL_reglastcloseparen (*Perl_Treglastcloseparen_ptr(aTHXo)) #undef PL_reglastparen #define PL_reglastparen (*Perl_Treglastparen_ptr(aTHXo)) #undef PL_regnarrate diff -u -r .orig/regexec.c ./regexec.c --- .orig/regexec.c Thu Jun 21 07:16:49 2001 +++ ./regexec.c Tue Jun 26 12:28:14 2001 @@ -147\,7 +147\,7 @@ if (paren_elems_to_push \< 0) Perl_croak(aTHX_ "panic: paren_elems_to_push \< 0");
-#define REGCP_OTHER_ELEMS 5 +#define REGCP_OTHER_ELEMS 6 SSCHECK(paren_elems_to_push + REGCP_OTHER_ELEMS); for (p = PL_regsize; p > parenfloor; p--) { /* REGCP_PARENS_ELEMS are pushed per pairs of parentheses. */ @@ -159\,6 +159\,7 @@ /* REGCP_OTHER_ELEMS are pushed in any case\, parentheses or no. */ SSPUSHINT(PL_regsize); SSPUSHINT(*PL_reglastparen); + SSPUSHINT(*PL_reglastcloseparen); SSPUSHPTR(PL_reginput); #define REGCP_FRAME_ELEMS 2 /* REGCP_FRAME_ELEMS are part of the REGCP_OTHER_ELEMS and @@ -192\,6 +193\,7 @@ assert(i == SAVEt_REGCONTEXT); /* Check that the magic cookie is there. */ i = SSPOPINT; /* Parentheses elements to pop. */ input = (char *) SSPOPPTR; + *PL_reglastcloseparen = SSPOPINT; *PL_reglastparen = SSPOPINT; PL_regsize = SSPOPINT;
@@ -1871\,6 +1873\,7 @@ PL_regstartp = prog->startp; PL_regendp = prog->endp; PL_reglastparen = &prog->lastparen; + PL_reglastcloseparen = &prog->lastcloseparen; prog->lastparen = 0; PL_regsize = 0; DEBUG_r(PL_reg_starttry = startpos); @@ -2562\,6 +2565\,7 @@ cache_re(re); state.ss = PL_savestack_ix; *PL_reglastparen = 0; + *PL_reglastcloseparen = 0; PL_reg_call_cc = &state; PL_reginput = locinput;
@@ -2619\,6 +2623\,7 @@ PL_regendp[n] = locinput - PL_bostr; if (n > *PL_reglastparen) *PL_reglastparen = n; + *PL_reglastcloseparen = n; break; case GROUPP: n = ARG(scan); /* which paren pair */ diff -u -r .orig/regexp.h ./regexp.h --- .orig/regexp.h Sun Apr 22 09:12:37 2001 +++ ./regexp.h Tue Jun 26 12:10:53 2001 @@ -37\,6 +37\,7 @@ I32 prelen; /* length of precomp */ U32 nparens; /* number of parentheses */ U32 lastparen; /* last paren matched */ + U32 lastcloseparen; /* last paren matched */ U32 reganch; /* Internal use only + Tainted information used by regexec? */ regnode program[1]; /* Unwarranted chumminess with compiler. */ diff -u -r .orig/thrdvar.h ./thrdvar.h --- .orig/thrdvar.h Mon Apr 30 05:29:37 2001 +++ ./thrdvar.h Tue Jun 26 12:12:52 2001 @@ -182\,6 +182\,7 @@ PERLVAR(Tregstartp\, I32 *) /* Pointer to startp array. */ PERLVAR(Tregendp\, I32 *) /* Ditto for endp. */ PERLVAR(Treglastparen\, U32 *) /* Similarly for lastparen. */ +PERLVAR(Treglastcloseparen\, U32 *) /* Similarly for lastcloseparen. */ PERLVAR(Tregtill\, char *) /* How far we are required to go. */ PERLVAR(Tregcompat1\, char) /* used to be regprev1 */ PERLVAR(Treg_start_tmp\, char **) /* from regexec.c */
Thanks\, applied.
On Jun 30\, Jeffrey Friedl said:
+so that if the regexp matched\, e.g.\, C\<$2> would contain 'cd' or 'ef'. For +convenience\, perl sets C\<$+> to the string held by the highest numbered +C\<$1>\, C\<$2>\, ... that got assigned (and\, somewhat related\, C\<$^N> to the +value of the C\<$1>\, C\<$2>\, ... most-recently assigned; i.e. the C\<$1>\, +C\<$2>\, ... associated with the rightmost closing parenthesis used in the +match).
Does that make $^N equal to
substr(???\, $-[-1]\, $+[-1] - $-[-1]);
If so\, you might want to include that in the docs for @- and/or @+.
On Jul 26\, Jeff 'japhy/Marillion' Pinyan said:
On Jun 30\, Jeffrey Friedl said:
+so that if the regexp matched\, e.g.\, C\<$2> would contain 'cd' or 'ef'. For +convenience\, perl sets C\<$+> to the string held by the highest numbered +C\<$1>\, C\<$2>\, ... that got assigned (and\, somewhat related\, C\<$^N> to the +value of the C\<$1>\, C\<$2>\, ... most-recently assigned; i.e. the C\<$1>\, +C\<$2>\, ... associated with the rightmost closing parenthesis used in the +match).
Does that make $^N equal to
substr(???\, $-[-1]\, $+[-1] - $-[-1]);
Hmm\, that should be substr(???\, $-[-1]\, $+[$#-] - $-[-1]). Just in case my logic with $^N is wrong\, I've not included it in the following patch. Oh\, speaking of the following patch...
I just found gross misinformation in perlvar.pod. Patch after sig.
-- Jeff "japhy" Pinyan japhy@pobox.com http://www.pobox.com/~japhy/ I am Marillion\, the wielder of Ringril\, known as Hesinaur\, the Winter-Sun. Are you a Monk? http://www.perlmonks.com/ http://forums.perlguru.com/ Perl Programmer at RiskMetrics Group\, Inc. http://www.riskmetrics.com/ Acacia Fraternity\, Rensselaer Chapter. Brother #734 ** Manning Publications\, Co\, is publishing my Perl Regex book **
01-07-27 01.55\, skrev Jeff 'japhy/Marillion' Pinyan på jeffp@crusoe.net följande:
If so\, you might want to include that in the docs for @- and/or @+.
Aren't @- and @+ deprecated?
-- Arthur
On Jul 27\, Arthur Bergman said:
01-07-27 01.55\, skrev Jeff 'japhy/Marillion' Pinyan på jeffp@crusoe.net följande:
If so\, you might want to include that in the docs for @- and/or @+.
Aren't @- and @+ deprecated?
Whoa. When did that happen? They just got here. I'd think they're totally UN-deprecated. You can get at $` and $& and $' by using the offsets they hold.
-- Jeff "japhy" Pinyan japhy@pobox.com http://www.pobox.com/~japhy/ I am Marillion\, the wielder of Ringril\, known as Hesinaur\, the Winter-Sun. Are you a Monk? http://www.perlmonks.com/ http://forums.perlguru.com/ Perl Programmer at RiskMetrics Group\, Inc. http://www.riskmetrics.com/ Acacia Fraternity\, Rensselaer Chapter. Brother #734 ** Manning Publications\, Co\, is publishing my Perl Regex book **
Thanks\, applied.
@cwest - Status changed from 'open' to 'resolved'
Migrated from rt.perl.org#7190 (status was 'resolved')
Searchable as RT7190$