Perl / perl5

🐪 The Perl programming language
https://dev.perl.org/perl5/
Other
1.85k stars 527 forks source link

A substitution that never terminates #1975

Closed p5pRT closed 19 years ago

p5pRT commented 24 years ago

Migrated from rt.perl.org#3248 (status was 'resolved')

Searchable as RT3248$

p5pRT commented 24 years ago

From @shields

The following test script causes Mail​::Header to enter an infinite loop in Mail​::Header​::_fold_line. The header is malformed\, but the failure mode is severe.

---- cut here #!/usr/bin/perl

use Mail​::Internet;

my $mail = \<\<'EOF'; To​: xxxxxxx@​xxxxxxx.xxxxx.xxx'" \xxxxxxx@&#8203;xxxxxxx\.xxxxx\.xxx\, "'xxxxxxx.x@​xxxxxxxxx.xx'" \xxxxxxx\.x@&#8203;xxxxxxxxx\.xx\, "'xxxxxxx@​xxxxxxx.xxx'" \xxxxxxx@&#8203;xxxxxxx\.xxx\, "'xxxxxxxxxxx@​xxxxxxx.xxx'" \xxxxxxxxxxx@&#8203;xxxxxxx\.xxx\, "'xxxx@​xxxxxxx.xxx'" \xxxx@&#8203;xxxxxxx\.xxx

body EOF

my $message = new Mail​::Internet [split /\n/\, $mail];

exit 0; ---- cut here

The expression that fails to terminate is​:

  $x .= $2 . "\n "   while($_[0] =~ s/^(\s*(   [^"]{$min\,$max}?[\\,\;]   |[^"]{1\,$max}[\s\n]   |[^\s"]+[\s\n]   |([^\s"]+("[^"]*")?)+[\s\n]   ))   //x);

I pruned this down to the test case​:

---- cut here #!/usr/bin/perl

$foo = \<\<'EOF'; xxxxxxx@​xxxxxxx.xxxxx.xxx'" \xxxxxxx@&#8203;xxxxxxx\.xxxxx\.xxx\, EOF

print "Starting\n";

$foo =~ s/^(\s*(([^\s"]+("[^"]*")?)+[\s\n]))//;

print "Finished\n";

exit 0; ---- cut here

This single substitution never terminates on perl 5.005_03\, from Debian Linux package version 5.005.03-06 and also from FreeBSD 3.4-RELEASE.

p5pRT commented 24 years ago

From @tamias

On Tue\, May 16\, 2000 at 04​:40​:02AM +0000\, Michael Shields wrote​:

The following test script causes Mail​::Header to enter an infinite loop in Mail​::Header​::_fold_line. The header is malformed\, but the failure mode is severe.

---- cut here #!/usr/bin/perl

$foo = \<\<'EOF'; xxxxxxx@​xxxxxxx.xxxxx.xxx'" \xxxxxxx@&#8203;xxxxxxx\.xxxxx\.xxx\, EOF

print "Starting\n";

$foo =~ s/^(\s*(([^\s"]+("[^"]*")?)+[\s\n]))//;

print "Finished\n";

exit 0; ---- cut here

This single substitution never terminates on perl 5.005_03\, from Debian Linux package version 5.005.03-06 and also from FreeBSD 3.4-RELEASE.

How long did you wait?

The nested quantifiers make this regex take time exponentially relative to the length of the target string to fail to find a match. The regex should be fixed.

For example​:

s/^(\s*[^\s"]+(?​:"[^"]*"[^\s"]+)*[\s])//;

Ronald

p5pRT commented 24 years ago

From @shields

How long did you wait?

Three minutes on a PII-450.

p5pRT commented 24 years ago

From @shields

I will therefore suggest the following patch to Mail​::Header (v 1.17)\, based on your rewritten regex.

Inline Patch ```diff --- Header.pm.orig Tue May 16 05:17:25 2000 +++ Header.pm Tue May 16 05:26:15 2000 @@ -110,7 +110,7 @@ [^"]{$min,$max}?[\,\;] |[^"]{1,$max}[\s\n] |[^\s"]+[\s\n] - |([^\s"]+("[^"]*")?)+[\s\n] + |[^\s"]+(?:"[^"]*"[^\s"]+)*[\s] )) //x); $x .= $_[0]; Thanks. ```
p5pRT commented 24 years ago

From [Unknown Contact. See original ticket]

Michael Shields \shields@&#8203;msrl\.com wrote

This single substitution never terminates on perl 5.005_03\, from Debian Linux package version 5.005.03-06 and also from FreeBSD 3.4-RELEASE.

Ronald has pointed out that it's taking a long time rather than looping (it takes a few minutes on my platform.). But the optimisation seems to have been improved in perl5.6.0\, so it takes no time at all.

Mike Guy

p5pRT commented 24 years ago

From @tamias

On Tue\, May 16\, 2000 at 05​:29​:05AM +0000\, Michael Shields wrote​:

I will therefore suggest the following patch to Mail​::Header (v 1.17)\, based on your rewritten regex.

--- Header.pm.orig Tue May 16 05​:17​:25 2000 +++ Header.pm Tue May 16 05​:26​:15 2000 @​@​ -110\,7 +110\,7 @​@​ [^"]{$min\,$max}?[\\,\;] |[^"]{1\,$max}[\s\n] |[^\s"]+[\s\n] - |([^\s"]+("[^"]*")?)+[\s\n] + |[^\s"]+(?​:"[^"]*"[^\s"]+)*[\s]   ^^^^

Oops\, I should have written that as \s rather than [\s]\, don't need the brackets.

\s includes \n\, so [\s\n] is redundant; the same change can be made to the other alternatives.

Ronald

p5pRT commented 24 years ago

From @gbarr

On Tue\, May 16\, 2000 at 04​:40​:02AM +0000\, Michael Shields wrote​:

The following test script causes Mail​::Header to enter an infinite loop in Mail​::Header​::_fold_line. The header is malformed\, but the failure mode is severe.

Yes this regex does go on forever on malformed headers. But I have not had the time to redo the regex in question.

Code has been added to only perform this regex on "structured" headers\, but there is an assumption that these headers sre correctly formed.

If you have any suggestions let me know.

Thanks\, Graham.

p5pRT commented 24 years ago

From [Unknown Contact. See original ticket]

Ronald J Kimball wrote​:

How long did you wait?

The nested quantifiers make this regex take time exponentially relative to the length of the target string to fail to find a match. The regex should be fixed.

For example​:

s/^(\s*[^\s"]+(?​:"[^"]*"[^\s"]+)*[\s])//;

I think that the regex engine could and should be less sensitive to this problem. Yes\, it can be fooled. But something not dissimilar to the improvements that Ilya was just talking about in

http​://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2000-05/msg00509.html

could address this case. Take the original regex​:

s/^(\s*(([^\s"]+("[^"]*")?)+[\s\n]))//;

The key problem is here​:

...([^\s"]+("[^"]*")?)+...

because it searches over ways to partition a string on its way to failing when there is no way it can matter. Now stare at that until you go a little cross-eyed\, and let me label with numbers various key positions where you have just matched something and need to try to match a character​:

1([^\s"]2+("3[^"]4*"5)?)

Now at those points here is what you have to try in order​:

1​: Match [^\s"] and -> 2   Fail back

2​: Match [^\s"] and -> 2   Match " and go to 3   Match [^\s"] and -> 2 # Problem   Go on to the rest of the RE (not shown)   Fail back

3​: Match [^"] and -> 4   Match " and -> 5   Fail back

4​: Match [^"] and -> 4   Match " and -> 5   Fail back

5​: Match [^\s"] and -> 2   Go on to the rest of the RE (not shown)   Fail back

And\, of course\, it tries all of this recursively. So if it tries something and that eventually fails back\, then it will go on to the next item on the list.

Now the problem\, of course\, is in 2 which can match [^\s"]\, fail\, then try doing the same work. If it has to make this decision on every single character\, it gets to double its work each time\, which makes it exponentially bad and turns into our performance disaster.

Now how could Ilya's suggestion\, properly modified\, help? Well what he was talking about was looking for opportunities to do next character discrimination. Then if someone went to write a regular expression for all of the states\, they would only have to test "N" once\, at that point either eliminating NC ND NE NH NJ NM NV NY\, or else eliminating everything else. However once you have set up next character discrimination\, you have the opportunity to also identify (and fix) the most common cases of this kind of disaster - just check when your next character discrimination is going to result in reaching the same state in 2 ways\, and when you notice that eliminate the second way. (Note\, be careful not to do this if the exact thing matched by a subpattern you are finishing is later used as a backreference!)

Here is a simple form of another common case​:

(\s*\w\s*)+

Analyzing as above​:

1(\s2*\w3\s*)+

1​: Match \s and -> 2   Match \w and -> 3   Fail back

2​: Match \s and -> 2   Match \w and -> 3   Fail back

3​: Match \s and -> 3   Match \s and -> 2   Match \w and -> 3   Go on to the rest of the RE (not shown)   Fail back

Again\, when you think about doing next character discrimination\, you have the chance to spot the most common poorly written regular expressions and fix them on the fly.

Were my C as developed as my math skills\, I might try to do this.

But it is not. :-(

Cheers\, Ben

p5pRT commented 24 years ago

From [Unknown Contact. See original ticket]

M.J.T. Guy writes​:

Michael Shields \shields@&#8203;msrl\.com wrote

This single substitution never terminates on perl 5.005_03\, from Debian Linux package version 5.005.03-06 and also from FreeBSD 3.4-RELEASE.

Ronald has pointed out that it's taking a long time rather than looping (it takes a few minutes on my platform.). But the optimisation seems to have been improved in perl5.6.0\, so it takes no time at all.

From theoretical point of view this optimization is extremely buggy​: I deduced the conditions when the optimization can be applied\, but in the heat of coding forgot to implement a check for these conditions.

Unfortunately\, up to now I could not deduce a test case when the optimization fails. ;-) :-( :-( :-( This means that I do not know how to fix this code so that

  a) it always produces correct results;   b) in typical cases the optimization is enable.

The conditions I had in mind initially are very restrictive\, say\, there are not applicable to the documentation example

  ((\w{1\,5}){1\,5}){1\,5}

Here is the scoop if you want to help me​: if a long match is detected\, we create a rectangular table with cells

("position in the REx"\, "position in the input string")

and keep pairs of positions which were "visited" during the match. If you touch the same cell the second time\, this means that the first time we failed\, so we fail again (without trying the rest of the REx!).

Here is the problem​: there is more state than this. Say\, for (\w{1\,5}){1\,5}$

  If we are at \w again with the "external" count being 4. This   number 3 is a part of the state of the match\, say\, it dictates that   we should match \w{1\,5} 2 more times before $.

But we do not keep this number in the table. Now if we already visited this position in the string before with a different value of the external count and failed\, this does not mean that we will fail now.

This the optimization is not viable. Now if somebody could provide an example basing on the above hints...

Ilya

p5pRT commented 24 years ago

From [Unknown Contact. See original ticket]

Ben_Tilly@​trepp.com writes​:

Again\, when you think about doing next character discrimination\, you have the chance to spot the most common poorly written regular expressions and fix them on the fly.

Yes\, this is why I stole some time and put next character discrimination code into 5.005_6*. Currently it is enable for the first char only\, to enable STCLASS optimization (know which chars can start a REx).

In short​: in some cases we can go as quick as a DFA\, and it is easy to detect these cases\, and these cases are typical.

But what to do meanwhile? Answer​: use (?>) instead.

Ilya

p5pRT commented 24 years ago

From [Unknown Contact. See original ticket]

Ilya wrote​:

Ben_Tilly@​trepp.com writes​:

Again\, when you think about doing next character discrimination\, you have the chance to spot the most common poorly written regular expressions and fix them on the fly.

Yes\, this is why I stole some time and put next character discrimination code into 5.005_6*. Currently it is enable for the first char only\, to enable STCLASS optimization (know which chars can start a REx).

Why not enable it for arbitrary numbers of characters?

In short​: in some cases we can go as quick as a DFA\, and it is easy to detect these cases\, and these cases are typical.

About a year ago I figured out how to do a hybrid algorithm which can do virtually any construct an NFA can\, will not have performance problems on any RE that can be done by a DFA\, and removes potential problems with NFAs when you recurse too far. Unfortunately its average performance suffers significantly\, some of the features that Perl has (lookaheads and lookbehinds in particular) are problematic\, and it is sufficiently different that I am doubtful it is worth trying to fit into Perl.

But then again I remember a co-worker who was complaining about the abysmal performance of Perl reading comma separated data\, it turned out that he was parsing each line with constructs like /(.*)\,(.*)\,(.*)\,(.*)\,(.*)\,(.*)\,(.*)\,(.*)\,(.*)/

:-(

But what to do meanwhile? Answer​: use (?>) instead.

I suspect that virtually anyone with sufficient tuits to understand what (?>) means is likely to have enough tuits to write clear enough regular expressions that they don't need (?>) for the performance. (Though the fine control is useful.) The problem is how to get people who lack those tuits to stay out of trouble.

Personally I think that careful use of /g to write miniature parsing engines that grab tokens is in some ways an ideal approach. Among other things\, it allows for the expression to get far more complicated without becoming unmanagable\, and it allows for exception checking to be put in place quite smoothly. It also provides for a direct solution to problems that regular expressions by themselves do not. (eg Balanced tags.) But I (for one) do not have a good suggestion for how to get people to use this consistently..

Cheers\, Ben

p5pRT commented 24 years ago

From [Unknown Contact. See original ticket]

Ben_Tilly@​trepp.com writes​: > But what to do meanwhile? Answer​: use (?>) instead.
I suspect that virtually anyone with sufficient tuits to understand
what (?>) means is likely to have enough tuits to write clear enough
regular expressions that they don't need (?>) for the performance.
(Though the fine control is useful.) The problem is how to get
people who lack those tuits to stay out of trouble.
Personally I think that careful use of /g to write miniature parsing
engines that grab tokens is in some ways an ideal approach. Among
other things\, it allows for the expression to get far more
complicated without becoming unmanagable\, and it allows for exception
checking to be put in place quite smoothly. It also provides for a
direct solution to problems that regular expressions by themselves do
not. (eg Balanced tags.) But I (for one) do not have a good
suggestion for how to get people to use this consistently..

A module that defined a number of variables containing qr patterns for a number of purposes might help. It could also contains functions that construct a pattern based on arguments.

  use re​::pat;  
  ...  
  /$re​::pat​::balpar/  
  ...  
  $x = re​::pat​::balpar( '\<(['\, '>)]' );

Such a module would let non-experts​:

  1. avoid writing bad RE's for common tasks that are provided   in the module  
  2. provide examples of how to write good RE's\, as a hint for   when they have to deal with a less common task

p5pRT commented 24 years ago

From [Unknown Contact. See original ticket]

On Tue\, May 16\, 2000 at 03​:40​:57PM -0400\, Ben_Tilly@​trepp.com wrote​:

Yes\, this is why I stole some time and put next character discrimination code into 5.005_6*. Currently it is enable for the first char only\, to enable STCLASS optimization (know which chars can start a REx).

Why not enable it for arbitrary numbers of characters?

How would you use it? There was already an infrastructure to use STCLASS info. One needs to create the infrastructure for at least​:

  a) to optimize alternation basing on the next-char info;

  b) to (?>)-ize repetition

  A*B => (?>A*)B

  if A and B cannot match at the same place

  c) Run-time code similar to "b" if there are *some* next-chars which   allow A to match\, but B not or visa versa.

About a year ago I figured out how to do a hybrid algorithm which can do virtually any construct an NFA can\, will not have performance problems on any RE that can be done by a DFA\, and removes potential problems with NFAs when you recurse too far.

I think with a/b/c above (and better visited-position code) we can go very close to this with minimal changes to the current REx engine. (At least when one can optimize "a" to deeper trie-based dispatch trees).

But then again I remember a co-worker who was complaining about the abysmal performance of Perl reading comma separated data\, it turned out that he was parsing each line with constructs like /(.*)\,(.*)\,(.*)\,(.*)\,(.*)\,(.*)\,(.*)\,(.*)\,(.*)/

:-(

If one changes the code for fixed-substrings to support an arbitrary number of fixed substrings\, we can get enough info to optimize *this* too (optimizer will find all the 8 commas in the string). How to make REx runtime use this info from optimizer is a separate question...

But what to do meanwhile? Answer​: use (?>) instead.

I suspect that virtually anyone with sufficient tuits to understand what (?>) means is likely to have enough tuits to write clear enough regular expressions that they don't need (?>) for the performance.

What for? Just use (?>).

complicated without becoming unmanagable\,

How so? Why do you think a well-commented Perl code with 17 simple RExen scattered around is better than one equivalent well-commented REx?

Definitely /(.*)\,(.*)\,(.*)\,(.*)\,(.*)\,(.*)\,(.*)\,(.*)\,(.*)/ expresses one's intentions well enough.

and it allows for exception checking to be put in place quite smoothly.

This is an interesting observation. Can we do the same with (a modification to) (?{}) ?

It also provides for a direct solution to problems that regular expressions by themselves do not. (eg Balanced tags.)

Which are handled by REx engine without any problem.

Ilya

p5pRT commented 24 years ago

From @gbarr

On Tue\, May 16\, 2000 at 09​:40​:40AM -0400\, Ronald J Kimball wrote​:

On Tue\, May 16\, 2000 at 05​:29​:05AM +0000\, Michael Shields wrote​:

I will therefore suggest the following patch to Mail​::Header (v 1.17)\, based on your rewritten regex.

--- Header.pm.orig Tue May 16 05​:17​:25 2000 +++ Header.pm Tue May 16 05​:26​:15 2000 @​@​ -110\,7 +110\,7 @​@​ [^"]{$min\,$max}?[\\,\;] |[^"]{1\,$max}[\s\n] |[^\s"]+[\s\n] - |([^\s"]+("[^"]*")?)+[\s\n] + |[^\s"]+(?​:"[^"]*"[^\s"]+)*[\s]

While this my fix the forever loop\, it causes testcases to fail

Oops\, I should have written that as \s rather than [\s]\, don't need the brackets.

\s includes \n\, so [\s\n] is redundant; the same change can be made to the other alternatives.

Yes\, for some reason I always do that.

Graham.

p5pRT commented 24 years ago

From @tamias

On Tue\, May 16\, 2000 at 09​:55​:51PM +0100\, Graham Barr wrote​:

On Tue\, May 16\, 2000 at 09​:40​:40AM -0400\, Ronald J Kimball wrote​:

On Tue\, May 16\, 2000 at 05​:29​:05AM +0000\, Michael Shields wrote​:

I will therefore suggest the following patch to Mail​::Header (v 1.17)\, based on your rewritten regex.

--- Header.pm.orig Tue May 16 05​:17​:25 2000 +++ Header.pm Tue May 16 05​:26​:15 2000 @​@​ -110\,7 +110\,7 @​@​ [^"]{$min\,$max}?[\\,\;] |[^"]{1\,$max}[\s\n] |[^\s"]+[\s\n] - |([^\s"]+("[^"]*")?)+[\s\n] + |[^\s"]+(?​:"[^"]*"[^\s"]+)*[\s]

While this my fix the forever loop\, it causes testcases to fail

I think it fails because the new regex requires at least one [^\s"] after the last quoted string\, where the old one didn't.

Try this instead​:

[^\s"]+(?​:"[^"]*"[^\s"]*)*\s

That should still avoid the exponential failure time.

Oops\, I should have written that as \s rather than [\s]\, don't need the brackets.

\s includes \n\, so [\s\n] is redundant; the same change can be made to the other alternatives.

Yes\, for some reason I always do that.

It's redundant\, but it's also explicit. Works either way. :)

Ronald

p5pRT commented 24 years ago

From [Unknown Contact. See original ticket]

On Tue\, May 16\, 2000 at 07​:27​:51PM -0400\, Ben_Tilly@​trepp.com wrote​:

One needs to create the infrastructure for at least​:

a) to optimize alternation basing on the next-char info;

b) to (?>)-ize repetition

A*B => (?>A*)B

 if A and B cannot match at the same place

c) Run-time code similar to "b" if there are *some* next-chars which allow A to match\, but B not or visa versa.

You only need a) to be useful.

Do you mean "a) alone would be useful too"?

Just make it possible to put off the decision about whether you are following path A or B as long as feasible.

Exactly the opposite. Make this decision as early as possible\, so that you do not need to check A\,B\,...\,C in (A|B|...|C|M|...) if they cannot match.

As for b) and c)\, I presume you mean that B will match there only if A does\, thereby making a run-time test for B matching superfluous...

Exactly the opposite. If A and B cannot match at the same position\, then A*B is equivalent to (?>A*)B.

I think with a/b/c above (and better visited-position code) we can go very close to this with minimal changes to the current REx engine. (At least when one can optimize "a" to deeper trie-based dispatch trees).

DANGER. Better visited position code needs to know for each position in the RE the full relevant state that can matter. This includes for fixed repetitions the repetition count\, and for backreferences that actually get used\, the backreference. This information can get prohibitive very fast if you have a long match which is long because of back references. (Read exponential memory usage!)

This is not how I see it. You determine (in compile time) whether the repetition counts are going to be needed. Say\, I would guess they are not needed for {n\,infty} limits and for {n\,n+1} limits (thus for all the simple cases *+? !). With backreferences it is even easier​: in the current implementation when \1 is accessed\, all the visited-position data is marked as invalid.

I would suggest a fix that consists of unravelling all fixed repetitions before compiling so that the only state that matters is the backreferences that get used in the RE

Easier​: just keep the flag that "we are inside a fixed run of {n\,m}". For {n\,infty} we can collect the matched position info even when this flag is present\, but we do not *use* this position unless this flag is not set.

But then again I remember a co-worker who was complaining about the abysmal performance of Perl reading comma separated data\, it turned out that he was parsing each line with constructs like /(.*)\,(.*)\,(.*)\,(.*)\,(.*)\,(.*)\,(.*)\,(.*)\,(.*)/

:-(

If one changes the code for fixed-substrings to support an arbitrary number of fixed substrings\, we can get enough info to optimize *this* too (optimizer will find all the 8 commas in the string). How to make REx runtime use this info from optimizer is a separate question...

With sufficient analysis you can optimize anything. Do you want to put that effort out for a special case like this?

Multiple-fixed-substring is not a special case. We already do 2 fixed substrings (one anchored\, another floating)\, and the code for a general case may be even simpler (as it happens time to time). The problem is how to communicate this info to the REx engine\, so it can optimize some A*B if it knows that B cannot match\, say\, for the next 200 positions.

Besides which your visited positions optimization will optimize the above case as well.

No. We do not use this for SIMPLE* - the code for SIMPLE* is very quick now\, and this could make it significantly slower.

What for? Just use (?>).

What tool do I reach for first? I mostly write simple REs so I would rarely benefit.

?? Whenever you want * to be maximally-greedy\, use (?>). As easy as that.

How so? Why do you think a well-commented Perl code with 17 simple RExen scattered around is better than one equivalent well-commented REx?

Different mode of thought. Additionally the best goal is frequently to populate an arbitrarily complex data structure and then work off of that. RE engines are not generally designed to do that\, and even if they were\, that would effectively mean learning a second language to figure out how to get that capability.

If you use RExen\, it is not a second language. You call RExen from Perl anyway\, so calling Perl from RExen would come as a natural extension. Well\, some people even think that s/a/ EXPR /e makes REx engine call EXPR\, so they would be an easy fodder... ;-)

The basic mode of thought is that regular expressions are a great way to identify the next token in a data stream\, but to manipulate said tokens in complex ways\, you really want to have a full programming language.

Time-to-time​: yes. So use (?{}). ;-)

And if you need to backtrack\, then having one REx would be much easier than calling many RExen from a parser...

Ilya

p5pRT commented 24 years ago

From [Unknown Contact. See original ticket]

On Tue\, May 16\, 2000 at 03​:40​:57PM -0400\, Ben_Tilly@​trepp.com wrote​:

About a year ago I figured out how to do a hybrid algorithm which can do virtually any construct an NFA can\, will not have performance problems on any RE that can be done by a DFA\, and removes potential problems with NFAs when you recurse too far.

May be *you* can clear it for me​: who do DFAs without memory implement the POSIX semantic? All I can invent without using memory is​: color chars of the string which are inside possible matches red. Make DFA output the end of the first red region.

Ilya

p5pRT commented 24 years ago

From [Unknown Contact. See original ticket]

Ronald J Kimball writes​:

\s includes \n\, so [\s\n] is redundant; the same change can be made to the other alternatives.

Yes\, for some reason I always do that.

It's redundant\, but it's also explicit. Works either way. :)

And without `use locale' the produce identical RExen...

Ilya

p5pRT commented 24 years ago

From [Unknown Contact. See original ticket]

Ilya wrote​:

On Tue\, May 16\, 2000 at 03​:40​:57PM -0400\, Ben_Tilly@​trepp.com wrote​:

Yes\, this is why I stole some time and put next character discrimination code into 5.005_6*. Currently it is enable for the first char only\, to enable STCLASS optimization (know which chars can start a REx).

Why not enable it for arbitrary numbers of characters?

How would you use it? There was already an infrastructure to use STCLASS info. One needs to create the infrastructure for at least​:

a) to optimize alternation basing on the next-char info;

b) to (?>)-ize repetition

A*B => (?>A*)B

 if A and B cannot match at the same place

c) Run-time code similar to "b" if there are *some* next-chars which allow A to match\, but B not or visa versa.

You only need a) to be useful. Just make it possible to put off the decision about whether you are following path A or B as long as feasible.

As for b) and c)\, I presume you mean that B will match there only if A does\, thereby making a run-time test for B matching superfluous...

About a year ago I figured out how to do a hybrid algorithm which can do virtually any construct an NFA can\, will not have performance problems on any RE that can be done by a DFA\, and removes potential problems with NFAs when you recurse too far.

I think with a/b/c above (and better visited-position code) we can go very close to this with minimal changes to the current REx engine. (At least when one can optimize "a" to deeper trie-based dispatch trees).

DANGER. Better visited position code needs to know for each position in the RE the full relevant state that can matter. This includes for fixed repetitions the repetition count\, and for backreferences that actually get used\, the backreference. This information can get prohibitive very fast if you have a long match which is long because of back references. (Read exponential memory usage!)

I would suggest a fix that consists of unravelling all fixed repetitions before compiling so that the only state that matters is the backreferences that get used in the RE\, and then throwing away visited position information whenever a backreference that appears anywhere in the RE gets modified. That is both safe and it covers almost all of the common cases that could arise.

But then again I remember a co-worker who was complaining about the abysmal performance of Perl reading comma separated data\, it turned out that he was parsing each line with constructs like /(.*)\,(.*)\,(.*)\,(.*)\,(.*)\,(.*)\,(.*)\,(.*)\,(.*)/

:-(

If one changes the code for fixed-substrings to support an arbitrary number of fixed substrings\, we can get enough info to optimize *this* too (optimizer will find all the 8 commas in the string). How to make REx runtime use this info from optimizer is a separate question...

With sufficient analysis you can optimize anything. Do you want to put that effort out for a special case like this? I think that it was best in the long run that I had the opportunity to explain the joys of split\, which made his code not only faster but a lot clearer as well.

Besides which your visited positions optimization will optimize the above case as well.

But what to do meanwhile? Answer​: use (?>) instead.

I suspect that virtually anyone with sufficient tuits to understand what (?>) means is likely to have enough tuits to write clear enough regular expressions that they don't need (?>) for the performance.

What for? Just use (?>).

What tool do I reach for first? I mostly write simple REs so I would rarely benefit. Therefore when I do need to think about the potential for a problem\, I find it easier to just be unambiguous than to go looking in the documentation for an option that I don't remember\, which other people who need to read my code are unlikely to have ever heard about either.

Besides which the simple expedient of writing clear REs is a skill that transfers to all other tools where you might need it. Somehow I find that even "Perl compatible regular expressions" are not truly Ilya compatible. :-)

complicated without becoming unmanagable\,

How so? Why do you think a well-commented Perl code with 17 simple RExen scattered around is better than one equivalent well-commented REx?

Different mode of thought. Additionally the best goal is frequently to populate an arbitrarily complex data structure and then work off of that. RE engines are not generally designed to do that\, and even if they were\, that would effectively mean learning a second language to figure out how to get that capability.

The basic mode of thought is that regular expressions are a great way to identify the next token in a data stream\, but to manipulate said tokens in complex ways\, you really want to have a full programming language.

Definitely /(.*)\,(.*)\,(.*)\,(.*)\,(.*)\,(.*)\,(.*)\,(.*)\,(.*)/ expresses one's intentions well enough.

Split does it better.

and it allows for exception checking to be put in place quite smoothly.

This is an interesting observation. Can we do the same with (a modification to) (?{}) ?

Not that I can see. A trivial example of why not would be an exception that means you need to read more data in. This arises naturally if\, for instance\, there turned out to be an embedded return in a quoted field in a CSV file.

It also provides for a direct solution to problems that regular expressions by themselves do not. (eg Balanced tags.)

Which are handled by REx engine without any problem.

Right\, another of those extensions that I don't know to reach for.

I don't believe that it existed in 5.005 either\, which makes it an extension that does not exist in any version of Perl I would feel comfortable putting in production at the moment.

The idea of a parse engine is another idea that I can take with me and use in other languages...

Cheers\, Ben

p5pRT commented 24 years ago

From [Unknown Contact. See original ticket]

Ilya Zakharevich wrote​:

On Tue\, May 16\, 2000 at 07​:27​:51PM -0400\, Ben_Tilly@​trepp.com wrote​:

One needs to create the infrastructure for at least​:

a) to optimize alternation basing on the next-char info;

b) to (?>)-ize repetition

A*B => (?>A*)B

 if A and B cannot match at the same place

c) Run-time code similar to "b" if there are *some* next-chars which allow A to match\, but B not or visa versa.

You only need a) to be useful.

Do you mean "a) alone would be useful too"?

Yup.

Just make it possible to put off the decision about whether you are following path A or B as long as feasible.

Exactly the opposite. Make this decision as early as possible\, so that you do not need to check A\,B\,...\,C in (A|B|...|C|M|...) if they cannot match.

We are talking past each other. If someone writes /Alaska|Alabama|Indiana/\, it would be a good thing to effectively make this the same as /(?​:Ala(?​:ska|bama))|Indiana/. Discriminate between A.* and Indiana ASAP\, but also don't bother needlessly backtracking on "Allah".

As for b) and c)\, I presume you mean that B will match there only if A does\, thereby making a run-time test for B matching superfluous...

Exactly the opposite. If A and B cannot match at the same position\, then A*B is equivalent to (?>A*)B.

I think you missed my point here. Consider the simple expression​: /^(\s*yada\s*)+$/. This is poorly written. When you analyze it carefully\, after matching yada\, if you encounter a space\, that space can match either after yada\, or before the next yada. If it is possible for that space to match before yada and successfully complete that match\, then it is possible for that space to match after yada and complete that match. Which match will the engine actually find?

Knowing this fact\, we can know that after matching yada\, we should never bother backtracking from trying to match a space after yada to trying to match one before - the second is wasted work. Guaranteed.

I think with a/b/c above (and better visited-position code) we can go very close to this with minimal changes to the current REx engine. (At least when one can optimize "a" to deeper trie-based dispatch trees).

DANGER. Better visited position code needs to know for each position in the RE the full relevant state that can matter. This includes for fixed repetitions the repetition count\, and for backreferences that actually get used\, the backreference. This information can get prohibitive very fast if you have a long match which is long because of back references. (Read exponential memory usage!)

This is not how I see it. You determine (in compile time) whether the repetition counts are going to be needed. Say\, I would guess they are not needed for {n\,infty} limits and for {n\,n+1} limits (thus for all the simple cases *+? !). With backreferences it is even easier​: in the current implementation when \1 is accessed\, all the visited-position data is marked as invalid.

Ack! No!

Throwing away the optimization when you notice that \1 is used is fine\, though throwing it away at runtime when you notice a change instead of at compile time will make the engine more robust. I agree that you do not need it for +\, ?\, and *. But neither generalization holds\, and here are simple test cases that show it​:

  "yyyx" =~ /y{1\,2}x/   "yyyx" =~ /y*y{2\,}x/

I would suggest a fix that consists of unravelling all fixed repetitions before compiling so that the only state that matters is the backreferences that get used in the RE

Easier​: just keep the flag that "we are inside a fixed run of {n\,m}". For {n\,infty} we can collect the matched position info even when this flag is present\, but we do not *use* this position unless this flag is not set.

Good point. :-)

[ ...   Discussion on /(.*)\,(.*)\,(.*)\,(.*)\,(.*)\,(.*)\,(.*)\,(.*)\,(.*)/   truncated   ...]

Multiple-fixed-substring is not a special case. We already do 2 fixed substrings (one anchored\, another floating)\, and the code for a general case may be even simpler (as it happens time to time). The problem is how to communicate this info to the REx engine\, so it can optimize some A*B if it knows that B cannot match\, say\, for the next 200 positions.

I see your point but I will have to think about whether I agree.

My understanding is that searching for fixed strings is done first because you can search for long fixed sub-strings faster than you can scan through a string..?

Besides which your visited positions optimization will optimize the above case as well.

No. We do not use this for SIMPLE* - the code for SIMPLE* is very quick now\, and this could make it significantly slower.

Where is the change-over? The code above is pretty slow\, and if you use it for scanning a file it becomes abysmally so. No matter how much you optimize it\, I am glad I got to lecture about the joys of split earlier rather than later.

What for? Just use (?>).

What tool do I reach for first? I mostly write simple REs so I would rarely benefit.

?? Whenever you want * to be maximally-greedy\, use (?>). As easy as that.

Whenever you want * to be maximally-greedy\, make it unambiguously not match anything that what follows can match. As easy as that\, and it works in every tool that provides REs.

[...]

Different mode of thought. Additionally the best goal is frequently to populate an arbitrarily complex data structure and then work off of that. RE engines are not generally designed to do that\, and even if they were\, that would effectively mean learning a second language to figure out how to get that capability.

If you use RExen\, it is not a second language. You call RExen from Perl anyway\, so calling Perl from RExen would come as a natural extension. Well\, some people even think that s/a/ EXPR /e makes REx engine call EXPR\, so they would be an easy fodder... ;-)

I agree with something Larry once said\, the point of REs is to provide nicely encapsulated line-noise for pattern matching. I am not a fan of embedding complex logic in line-noise.

The basic mode of thought is that regular expressions are a great way to identify the next token in a data stream\, but to manipulate said tokens in complex ways\, you really want to have a full programming language.

Time-to-time​: yes. So use (?{}). ;-)

And if you need to backtrack\, then having one REx would be much easier than calling many RExen from a parser...

And there I see what made me uncomfortable with the idea.

If you wish to extend (?{}) to be useful\, here are the things that I would need to see​:

1) (?{forward_code}{backtrack_code}). If you have to backtrack\, give   a way for people to undo what they just did with the forward code.

2) (?\/) No backtracking past here. Many parsing problems come down   to identifying tokens\, and then doing things with them. One of the   strengths of a parsing engine like I described is that it lets you   execute code safely in the conviction that you can view a unit of   work as being complete. This greatly simplifies the thinking   process - indeed not understanding backtracking is the cause of   most poorly designed regular expressions.

Without the ability for embedded code to either execute knowing that it does not have to worry about backtracking\, or to execute knowing that it can undo what it did if backtracking happens\, it is very hard to meaningfully use intermediate information in the execution of RExen.

Even then I am not sure how often I would use it since I don't face many complex parsing problems.

Cheers\, Ben

p5pRT commented 24 years ago

From [Unknown Contact. See original ticket]

On Wed\, May 17\, 2000 at 05​:21​:39PM -0400\, Ben_Tilly@​trepp.com wrote​:

Just make it possible to put off the decision about whether you are following path A or B as long as feasible.

Exactly the opposite. Make this decision as early as possible\, so that you do not need to check A\,B\,...\,C in (A|B|...|C|M|...) if they cannot match.

We are talking past each other. If someone writes /Alaska|Alabama|Indiana/\, it would be a good thing to effectively make this the same as /(?​:Ala(?​:ska|bama))|Indiana/. Discriminate between A.* and Indiana ASAP\, but also don't bother needlessly backtracking on "Allah".

This is the target\, but I do not see any relationship to your sentence "put off decision as long as possible".

As for b) and c)\, I presume you mean that B will match there only if A does\, thereby making a run-time test for B matching superfluous...

Exactly the opposite. If A and B cannot match at the same position\, then A*B is equivalent to (?>A*)B.

I think you missed my point here. Consider the simple expression​: /^(\s*yada\s*)+$/. This is poorly written. When you analyze it carefully\, after matching yada\, if you encounter a space\, that space can match either after yada\, or before the next yada. If it is possible for that space to match before yada and successfully complete that match\, then it is possible for that space to match after yada and complete that match. Which match will the engine actually find?

Finally\, this is documented in 5.6.0 ;-) (if you do not consider fairy tales about "Backtracking" which worked as documenation substitutes before).

Knowing this fact\, we can know that after matching yada\, we should never bother backtracking from trying to match a space after yada to trying to match one before - the second is wasted work. Guaranteed.

I think this is a fairly complicated optimization. The sniffer would need to know *a lot* to deduce this.

DANGER. Better visited position code needs to know for each position in the RE the full relevant state that can matter. This includes for fixed repetitions the repetition count\, and for backreferences that actually get used\, the backreference. This information can get prohibitive very fast if you have a long match which is long because of back references. (Read exponential memory usage!)

This is not how I see it. You determine (in compile time) whether the repetition counts are going to be needed. Say\, I would guess they are not needed for {n\,infty} limits and for {n\,n+1} limits (thus for all the simple cases *+? !). With backreferences it is even easier​: in the current implementation when \1 is accessed\, all the visited-position data is marked as invalid.

Ack! No!

Throwing away the optimization when you notice that \1 is used is fine\, though throwing it away at runtime when you notice a change instead of at compile time will make the engine more robust.

"Will make"? As I said\, it is doing it now (note "accessed").

I agree that you do not need it for +\, ?\, and *. But neither generalization holds\, and here are simple test cases that show it​:

"yyyx" =~ /y{1\,2}x/ "yyyx" =~ /y*y{2\,}x/

Shows what?

My understanding is that searching for fixed strings is done first because you can search for long fixed sub-strings faster than you can scan through a string..?

I think this is more than an order of magnitude faster. Probably two.

Besides which your visited positions optimization will optimize the above case as well.

No. We do not use this for SIMPLE* - the code for SIMPLE* is very quick now\, and this could make it significantly slower.

Where is the change-over? The code above is pretty slow\,

But polynomial. Well\, I agree that for all practical purposes n^9 is not a polynomial. ;-)

The changeover is governed by what is repeated by {n\,m}. If it constant-length\, no-side-effects-except-setting-a-group\, then we can match it without any real-backtracking. Such cases are quick\, and they are not involved in the above optimization.

Of course\, if you have 9 such groups in turn\, this gives you n^9 time...

?? Whenever you want * to be maximally-greedy\, use (?>). As easy as that.

Whenever you want * to be maximally-greedy\, make it unambiguously not match anything that what follows can match.

Just compare which recipe is easier-to-use and expresses your intentions better.

If you wish to extend (?{}) to be useful\, here are the things that I would need to see​:

1) (?{forward_code}{backtrack_code}). If you have to backtrack\, give a way for people to undo what they just did with the forward code.

There is a way​: use localization and DESTROY\, or DESTROY on $^R (which is autolocalized). It is not pretty\, but it is there. If a consistent usage mode appears\, one can make a syntactic sugar for this. Without proof that it is needed\, I do not want to make things harder.

Some functionality like this is badly needed\, but I'm not sure how it should actually look from the user point of view.

2) (?\/) No backtracking past here.

This is (?>)\, but (?>) is better​: nobody could explain what the above sentence *actually means*. (?>) provides a scope to limit backtracking-prohibition to\, thus makes things defined.

Ilya

p5pRT commented 24 years ago

From [Unknown Contact. See original ticket]

Ilya wrote​:

On Wed\, May 17\, 2000 at 05​:21​:39PM -0400\, Ben_Tilly@​trepp.com wrote​: [...]

We are talking past each other. If someone writes /Alaska|Alabama|Indiana/\, it would be a good thing to effectively make this the same as /(?​:Ala(?​:ska|bama))|Indiana/. Discriminate between A.* and Indiana ASAP\, but also don't bother needlessly backtracking on "Allah".

This is the target\, but I do not see any relationship to your sentence "put off decision as long as possible".

While matching /Ala/ you are putting off the decision about whether you are doing the work to wind up matching Alaska or Alabama for 3 characters.

I interpreted you claim about only optimizing one character of next character discrimination as in this situation making the match look like /(?​:A(?​:laska|labama))|Indiana/ which is an improvement but not a big one.

[...]

I think you missed my point here. Consider the simple expression​: /^(\s*yada\s*)+$/. This is poorly written. When you analyze it carefully\, after matching yada\, if you encounter a space\, that space can match either after yada\, or before the next yada. If it is possible for that space to match before yada and successfully complete that match\, then it is possible for that space to match after yada and complete that match. Which match will the engine actually find?

Finally\, this is documented in 5.6.0 ;-) (if you do not consider fairy tales about "Backtracking" which worked as documenation substitutes before).

It was always documented..in the source-code.

Some forms of documentation are easier to read\, that is all.

Knowing this fact\, we can know that after matching yada\, we should never bother backtracking from trying to match a space after yada to trying to match one before - the second is wasted work. Guaranteed.

I think this is a fairly complicated optimization. The sniffer would need to know *a lot* to deduce this.

Read back a few emails.

If it tried to deduce this from scratch\, yes. But if you aggressively pursue trieing first\, the internal representation of this RE should reduce down to a representation in which it is a trivial deduction. (One which does not\, unfortunately\, directly get represented by any RE.)

I laid out detailed examples before. The point is that in trying to optimize the match\, you will wind up merging the two choices of how to match \s* into one state where you are matching \s* and then trying to figure out where you put the parentheses - then it is trivial to notice the optimization.

Try it by hand and you should see what I mean.

[...]

This is not how I see it. You determine (in compile time) whether the repetition counts are going to be needed. Say\, I would guess they are not needed for {n\,infty} limits and for {n\,n+1} limits (thus for all the simple cases *+? !). With backreferences it is even easier​: in the current implementation when \1 is accessed\, all the visited-position data is marked as invalid.

Ack! No!

Throwing away the optimization when you notice that \1 is used is fine\, though throwing it away at runtime when you notice a change instead of at compile time will make the engine more robust.

"Will make"? As I said\, it is doing it now (note "accessed").

I meant that were you to still attempt the optimization on RExen where \1 gets accessed\, but throw away the collected information every time \1 changes value\, then some performance disasters would be fixed.

This is worthwhile.

I agree that you do not need it for +\, ?\, and *. But neither generalization holds\, and here are simple test cases that show it​:

"yyyx" =~ /y{1\,2}x/

Whoops\, I had a braino. My example above should have been

  "yyyx" =~/y*y{2\,3}x/

"yyyx" =~ /y*y{2\,}x/

Shows what?

They are regular expressions that take the forms you thought you could use a visited-position optimization on without repetition counts which match without the optimization and do not match with it.

In both of them the issue is that when you backtrack on the y* you will fail y{2\,} at first because y* matched all but none\, and all but 1\, and after that you should succeed but the optimization will stop you dead in your tracks.

[...]

The changeover is governed by what is repeated by {n\,m}. If it constant-length\, no-side-effects-except-setting-a-group\, then we can match it without any real-backtracking. Such cases are quick\, and they are not involved in the above optimization.

Of course\, if you have 9 such groups in turn\, this gives you n^9 time...

From your description I understood that if the RE took above some runtime threshold\, then you retried with the optimization on. The example /(.*)\,(.*)\,(.*)\,(.*)\,(.*)\,(.*)\,(.*)\,(.*)\,(.*)/ will be sped up by the optimization from n^9 behaviour to n^2. While this is not as good as the linear split\, in many cases it is acceptable.

So\, does the optimization hit in this case?

?? Whenever you want * to be maximally-greedy\, use (?>). As easy as that.

Whenever you want * to be maximally-greedy\, make it unambiguously not match anything that what follows can match.

Just compare which recipe is easier-to-use and expresses your intentions better.

To the computer\, to me\, or to whoever is unlucky enough to read my code? The computer probably likes the cleaner (?>)\, I forget what it means so the clean RE is probably easier for me\, and whoever is unlucky enough to read my code is unlikely to know about (?>) unless they do little but hack Perl.

Plus the skill of writing unambiguous RExen is of general use.

If you wish to extend (?{}) to be useful\, here are the things that I would need to see​:

1) (?{forward_code}{backtrack_code}). If you have to backtrack\, give a way for people to undo what they just did with the forward code.

There is a way​: use localization and DESTROY\, or DESTROY on $^R (which is autolocalized). It is not pretty\, but it is there. If a consistent usage mode appears\, one can make a syntactic sugar for this. Without proof that it is needed\, I do not want to make things harder.

AFAICS localization of variables is insufficient if your goal is to populate a parse tree. Plus it is inefficient to produce and destroy stacks of variables when you could just increment and decrement a counter.

Some functionality like this is badly needed\, but I'm not sure how it should actually look from the user point of view.

2) (?\/) No backtracking past here.

This is (?>)\, but (?>) is better​: nobody could explain what the above sentence *actually means*. (?>) provides a scope to limit backtracking-prohibition to\, thus makes things defined.

This is an independent idea from (?>). Consider this​:

  /   \G   (   (> # Hmmm...the shoe fits :-)   "(?\/)(?​:[^"]|"")*"   |   [^\,\n]*   )   )   (?=\,|\n|$)   /gx

This will match the next field in a .csv format. The point is that if the field starts with a "\, then it is a quoted field\, otherwise it is an unquoted field. How can I say that now?

  /   \G   (   (> # Hmmm...the shoe fits :-)   "(?​:[^"]|"")*"   |   (?!")[^\,\n]*   )   )   (?=\,|\n|$)   /gx

Oops\, I have to embed the rule about what started a quoted field everywhere except where I actually wanted to see it! This is an inefficient way to express myself\, and for a complex format this could quickly get prohibitively complex to keep putting all of the negative lookaheads everywhere that they are needed.

Providing an easy way to guarantee spots in an RE where you will not backtrack can simplify your life\, and the spots I would want to put code would be points where you have just recognized some meaningful token\, which are also points where you would want to not backtrack from...

Programming is not so much the art of being able to do complex things as it is the art of being able to make complexity managable by humans. Beyond a narrow horizon\, backtracking is not very managable for most humans.

Cheers\, Ben

p5pRT commented 24 years ago

From [Unknown Contact. See original ticket]

On Wed\, May 17\, 2000 at 07​:24​:12PM -0400\, Ben_Tilly@​trepp.com wrote​:

While matching /Ala/ you are putting off the decision about whether you are doing the work to wind up matching Alaska or Alabama for 3 characters.

Thanks\, I think I understand your language now.

I interpreted you claim about only optimizing one character of next character discrimination as in this situation making the match look like /(?​:A(?​:laska|labama))|Indiana/ which is an improvement but not a big one.

Why? Now do the same with (?​:laska|labama) etc.

Finally\, this is documented in 5.6.0 ;-) (if you do not consider fairy tales about "Backtracking" which worked as documenation substitutes before).

It was always documented..in the source-code. Some forms of documentation are easier to read\, that is all.

Source code is good as documentation only until the next patch in the queue.

If it tried to deduce this from scratch\, yes. But if you aggressively pursue trieing first\, the internal representation of this RE should reduce down to a representation in which it is a trivial deduction.

Do not see how this would work.

I meant that were you to still attempt the optimization on RExen where \1 gets accessed\, but throw away the collected information every time \1 changes value\, then some performance disasters would be fixed.

An interesting idea indeed.

  "yyyx" =~/y\*y\{2\,3\}x/

"yyyx" =~ /y*y{2\,}x/

Shows what?

In both of them the issue is that when you backtrack on the y* you will fail y{2\,} at first because y* matched all but none\, and all but 1\, and after that you should succeed but the optimization will stop you dead in your tracks.

You mean the second y​: probably you suppose that after

  yyyx   112 # First two chars matched by y*\, the third one by y in y{2\,3}

the position 3 is marked as "Cannot match y in y{2\,3}"\, right? It would not\, since the "required" count 2 of y's is not fulfilled yet.

From your description I understood that if the RE took above some runtime threshold\, then you retried with the optimization on. The example /(.*)\,(.*)\,(.*)\,(.*)\,(.*)\,(.*)\,(.*)\,(.*)\,(.*)/ will be sped up by the optimization from n^9 behaviour to n^2. While this is not as good as the linear split\, in many cases it is acceptable.

So\, does the optimization hit in this case?

There is no instrumentingn of MEDIUM{n\,m} or SIMPLE{n\,m} for such an optimization. Just collecting statistic for the switchover might make things significantly slower.

There is a way​: use localization and DESTROY\, or DESTROY on $^R (which is autolocalized). It is not pretty\, but it is there. If a consistent usage mode appears\, one can make a syntactic sugar for this. Without proof that it is needed\, I do not want to make things harder.

AFAICS localization of variables is insufficient if your goal is to populate a parse tree.

Given DESTROY\, you have enough rope for anything you want.

Plus it is inefficient to produce and destroy stacks of variables when you could just increment and decrement a counter.

I do not think questions of efficiency should enter the discussion when we want to get a feeling for what is *desired* to be there.

This is an independent idea from (?>). Consider this​:

/ \G ( (> # Hmmm...the shoe fits :-) "(?\/)(?​:[^"] "")*"
 \[^\,\\n\]\*
  \)
\)
\(?=\,|\\n|$\)

/gx

This will match the next field in a .csv format. The point is that if the field starts with a "\, then it is a quoted field\, otherwise it is an unquoted field. How can I say that now?

Since I do not know what (?\/) is supposed to do\, it is very hard to guess. But why won't you put parens around the groups separately\, and check which one matched? And put (?>) around (?​:[^"]|"")* and [^\,\n]* ...

Ilya

p5pRT commented 24 years ago

From [Unknown Contact. See original ticket]

Ilya Zakharevich wrote​:

On Wed\, May 17\, 2000 at 07​:24​:12PM -0400\, Ben_Tilly@​trepp.com wrote​:

While matching /Ala/ you are putting off the decision about whether you are doing the work to wind up matching Alaska or Alabama for 3 characters.

Thanks\, I think I understand your language now.

:-)

I interpreted you claim about only optimizing one character of next character discrimination as in this situation making the match look like /(?​:A(?​:laska|labama))|Indiana/ which is an improvement but not a big one.

Why? Now do the same with (?​:laska|labama) etc.

Now that you understand what I was thinking\, you can see why I was wondering exactly that.

OTOH now I am convinced it is not so simple\, when you open yourself up for this form of optimization\, there are a lot more that come to mind along the same lines...

Also you cannot just trie it arbitrarily far without some extra rules\, else trying to trie \s*\s* would be very dangerous. (I feel in my gut that while trieing the expression this should be recognizable as just \s*\, but I am not sure how to say the rule that implies that.)

Finally\, this is documented in 5.6.0 ;-) (if you do not consider fairy tales about "Backtracking" which worked as documenation substitutes before).

It was always documented..in the source-code. Some forms of documentation are easier to read\, that is all.

Source code is good as documentation only until the next patch in the queue.

True. But the documentation is always less accurate about what is currently true. :-)

If it tried to deduce this from scratch\, yes. But if you aggressively pursue trieing first\, the internal representation of this RE should reduce down to a representation in which it is a trivial deduction.

Do not see how this would work.

Now that I examine it in detail\, I was implicitly first applying a rather complex optimization. :-( If I can figure out how to verbalize it\, I will probably mention it\, but at the moment I cannot. :-( :-(

I meant that were you to still attempt the optimization on RExen where \1 gets accessed\, but throw away the collected information every time \1 changes value\, then some performance disasters would be fixed.

An interesting idea indeed.

Now that I think about it though\, any performance problems which had the cached RExen in place would slow down...

Does the "visited position" optimization only apply to positions where you have some sort of choice that could have been just made? If not then it should since this reduces the tracking work.

[...]

Shows what?

In both of them the issue is that when you backtrack on the y* you will fail y{2\,} at first because y* matched all but none\, and all but 1\, and after that you should succeed but the optimization will stop you dead in your tracks.

You mean the second y​: probably you suppose that after

yyyx 112 # First two chars matched by y*\, the third one by y in y{2\,3}

the position 3 is marked as "Cannot match y in y{2\,3}"\, right? It would not\, since the "required" count 2 of y's is not fulfilled yet.

Ah\, I misunderstood your visited position optimization rule. That is quite slick. :-)

From your description I understood that if the RE took above some runtime threshold\, then you retried with the optimization on. The example /(.*)\,(.*)\,(.*)\,(.*)\,(.*)\,(.*)\,(.*)\,(.*)\,(.*)/ will be sped up by the optimization from n^9 behaviour to n^2. While this is not as good as the linear split\, in many cases it is acceptable.

So\, does the optimization hit in this case?

There is no instrumentingn of MEDIUM{n\,m} or SIMPLE{n\,m} for such an optimization. Just collecting statistic for the switchover might make things significantly slower.

Thinking more about this example it gets very frustrating. I feel in my gut that this expression can be extremely nicely optimized by something like the trieing. You take the expression and rewrite it so that you have states where you are trying the first greedy match\, possibly are on the second or the first\, possibly any of the first 3\, etc. Essentially note the passing commas as you encounter them\, and put off the question of which .* you are trying to match when you encounter them as late as possible (ie the end of the line).

My thinking feels a little muddy on how you would want to do this\, but I am morally convinced that having a careful way of trieing would let you optimize many regular expressions to the most efficient way that people could do them. (eg In the above turn it into a scan for "\,"\, keeping track of where they are and how many they are until you reach "enough" (ie 8) and then just keep track of where they are so you can reconstruct the answer when you are done.

[...]

Given DESTROY\, you have enough rope for anything you want.

You mean I can theoretically write it that way. I would not want to though...

[...]

This is an independent idea from (?>). Consider this​:

/ \G ( (> # Hmmm...the shoe fits :-) "(?\/)(?​:[^"] "")*"
 \[^\,\\n\]\*
  \)
\)
\(?=\,|\\n|$\)

/gx

This will match the next field in a .csv format. The point is that if the field starts with a "\, then it is a quoted field\, otherwise it is an unquoted field. How can I say that now?

Since I do not know what (?\/) is supposed to do\, it is very hard to guess. But why won't you put parens around the groups separately\, and check which one matched? And put (?>) around (?​:[^"]|"")* and [^\,\n]* ...

Note that with good trieing the above (>..) construct should be superfluous\, the optimizer should notice the fact that it applies without the hint. :-)

As for what it is supposed to do\, that is simple. The engine should never backtrack past (?\/). This is a point of no return. If the pattern reaches this point it either matches from this point forward or else it fails.

A regular expression engine notes points where it *might* make a decision. It has no facility to directly state that you *have finalized* that decision. Here\, for your reference\, is what a *.csv file should look like. It consists of rows of data. Each row consists of fields separated by commas\, and the row ends either with the end of the file or "\n". A field comes in two forms\, quoted and unquoted. If a field begins with a '"' it is a quoted field. A quoted field can may contain embedded commas and newlines\, and may also contain '""'s to represent '"'s. It ends at the first unpaired '"'. An unquoted field may not start with '"'\, and extends to the next '\,'\, "\n"\, or the end of the file.

If you have a parsing engine of the form I described\, your work is divided into concrete components and stages. Generally you are searching for some sort of token\, and once you identify a token you will never turn back. In parsing a *.csv file properly you should never turn back after deciding that a field is quoted or unquoted\, after locating where a field finishes\, or after locating where a row finishes.

Cheers\, Ben

p5pRT commented 24 years ago

From [Unknown Contact. See original ticket]

On Thu\, May 18\, 2000 at 03​:44​:40PM -0400\, Ben_Tilly@​trepp.com wrote​:

Also you cannot just trie it arbitrarily far without some extra rules\, else trying to trie \s*\s* would be very dangerous.

Why? (Unless you suppose a wrong implementation. ;-) \s*\s*\w+ should be trie()ed as having a leading [\s\w]. As I said\, this is already implemented.

Source code is good as documentation only until the next patch in the queue.

True. But the documentation is always less accurate about what is currently true. :-)

Personally\, I do not care much about what is currently true (unless in a one-liner). I prefer to know when "If this breaks\, I would report it as a bug".

Does the "visited position" optimization only apply to positions where you have some sort of choice that could have been just made? If not then it should since this reduces the tracking work.

Only before 'H' in (?​:HERE){n\,m}.

Thinking more about this example it gets very frustrating. I feel in my gut that this expression can be extremely nicely optimized by something like the trieing.

It is called DFA. ;-)

Given DESTROY\, you have enough rope for anything you want.

You mean I can theoretically write it that way. I would not want to though...

Just write a custom REx engine (which see) which converts

  (?{ CODE1 }{ CODE2 })

to

  (?{ local $re​::unwind = new re​::unwind sub {CODE2}; CODE1 })

with properly defined re​::unwind​::new and re​::unwind​::DESTROY.

\(
  \(> \# Hmmm\.\.\.the shoe fits :\-\)
    "\(?\\/\)\(?&#8203;:\[^"\]|""\)\*"
   |
 \[^\,\\n\]\*
  \)
\)
\(?=\,|\\n|$\)

Note that with good trieing the above (>..) construct should be superfluous\, the optimizer should notice the fact that it applies without the hint. :-)

No it would not. Note "" and ".

As for what it is supposed to do\, that is simple. The engine should never backtrack past (?\/). This is a point of no return.

I consider things which work in REX1 but would not work in

  WHATEVER (REX1) WHATEVER_ELSE

not that much useful. If you can design a syntax where you can specify the scope up to which backtracking is prohibited\, they YES I WISH THIS TOO.

Ilya

p5pRT commented 24 years ago

From [Unknown Contact. See original ticket]

On Thu\, May 18\, 2000 at 03​:44​:40PM -0400\, Ben_Tilly@​trepp.com wrote​:

Also you cannot just trie it arbitrarily far without some extra rules\, else trying to trie \s*\s* would be very dangerous.

Why? (Unless you suppose a wrong implementation. ;-) \s*\s*\w+ should be trie()ed as having a leading [\s\w]. As I said\, this is already implemented.

I was pointing out that in this case naive can easily be wrong. :-)

[...]

Personally\, I do not care much about what is currently true (unless in a one-liner). I prefer to know when "If this breaks\, I would report it as a bug".

Might I suggest changing that standard to\, "If this breaks\, Tom would report it as a bug"?

(Just teasing!)

Does the "visited position" optimization only apply to positions where you have some sort of choice that could have been just made? If not then it should since this reduces the tracking work.

Only before 'H' in (?​:HERE){n\,m}.

:-)

Thinking more about this example it gets very frustrating. I feel in my gut that this expression can be extremely nicely optimized by something like the trieing.

It is called DFA. ;-)

That is an interesting insight. If you take an RE and aggressively trie it\, you get something for that section that really is extremely similar to a DFA with the minor detail that it matches what an NFA does\, and it becomes natural to be able to keep a history. (As we discussed privately\, that is something that DFAs cannot easily do.)

I wonder if this idea can be somehow extended to make NFAs in practice nearly as reliable as DFAs? Somehow I suspect that\, modulo arbitrary limits on how big you will let an RE get\, you probably could get some amazing improvements.

Given DESTROY\, you have enough rope for anything you want.

You mean I can theoretically write it that way. I would not want to though...

Just write a custom REx engine (which see) which converts

(?{ CODE1 }{ CODE2 })

to

(?{ local $re​::unwind = new re​::unwind sub {CODE2}; CODE1 })

with properly defined re​::unwind​::new and re​::unwind​::DESTROY.

\(
  \(> \# Hmmm\.\.\.the shoe fits :\-\)
    "\(?\\/\)\(?&#8203;:\[^"\]|""\)\*"
   |
 \[^\,\\n\]\*
  \)
\)
\(?=\,|\\n|$\)

Note that with good trieing the above (>..) construct should be superfluous\, the optimizer should notice the fact that it applies without the hint. :-)

No it would not. Note "" and ".

Check again. The anchored subpattern terminates either at a '"' or at something else that is not a "\," or "\n". The next thing afterwards is

As for what it is supposed to do\, that is simple. The engine should never backtrack past (?\/). This is a point of no return.

I consider things which work in REX1 but would not work in

WHATEVER (REX1) WHATEVER_ELSE

not that much useful. If you can design a syntax where you can specify the scope up to which backtracking is prohibited\, they YES I WISH THIS TOO.

This might not meet even a camel's beauty standard\, but...

Imitate the same block labelling mechanism as in loop labels. Define a block some natural way. For instance independent subpatterns\, compiled REs\, or anything delimited by the (?​:pattern) construct. Add optional labelled blocks that look like (?%LABEL​:pattern). Then use (?\/) to lock down the current block\, and (?\/LABEL) to lock down the current iteration of block LABEL.

Voila.

(Yes Tom\, lacking aesthetic taste is an occupational hazard among mathematicians...)

Ben

p5pRT commented 24 years ago

From [Unknown Contact. See original ticket]

On Thu\, May 18\, 2000 at 05​:55​:24PM -0400\, Ben_Tilly@​trepp.com wrote​:

\(
  \(> \# Hmmm\.\.\.the shoe fits :\-\)
    "\(?\\/\)\(?&#8203;:\[^"\]|""\)\*"
   |
 \[^\,\\n\]\*
  \)
\)
\(?=\,|\\n|$\)

Note that with good trieing the above (>..) construct should be superfluous\, the optimizer should notice the fact that it applies without the hint. :-)

No it would not. Note "" and ".

Check again. The anchored subpattern terminates either at a '"' or at something else that is not a "\," or "\n". The next thing afterwards is

You cannot omit (?>) in

  (?> (?​:[^"]|"")* )"

Since it changes the semantic. Deeper lookahead on the larger REx would see that the change would not affect the larger thing\, but I do not think we want to have a deep lookahead so early in the Perl life. ;-)

This might not meet even a camel's beauty standard\, but... Imitate the same block labelling mechanism as in loop labels.

E2BIZZARE.

Ilya

p5pRT commented 24 years ago

From [Unknown Contact. See original ticket]

(Yes Tom\, lacking aesthetic taste is an occupational hazard among mathematicians...)

To the contrary​: elegance and creativity are essential. See "The Art of Mathematics"\, by Jerry King.

--tom

p5pRT commented 24 years ago

From [Unknown Contact. See original ticket]

\(
  \(> \# Hmmm\.\.\.the shoe fits :\-\)
    "\(?\\/\)\(?&#8203;:\[^"\]|""\)\*"
   |
 \[^\,\\n\]\*
  \)
\)
\(?=\,|\\n|$\)

Note that with good trieing the above (>..) construct should be superfluous\, the optimizer should notice the fact that it applies without the hint. :-)

No it would not. Note "" and ".

Check again. The anchored subpattern terminates either at a '"' or at something else that is not a "\," or "\n". The next thing afterwards is

You cannot omit (?>) in

(?> (?​:[^"]|"")* )"

Since it changes the semantic. Deeper lookahead on the larger REx would see that the change would not affect the larger thing\, but I do not think we want to have a deep lookahead so early in the Perl life. ; -)

One of us is miscounting parens?

It should not take a particularly deep trieing to trie together the initial '"' of a '""' pair with the attempted match of '"(?=\,|\n|$)' after which point you have implicitly found the optimization since there is now no point in the regular expression in which you can find a place where your next character gives you a choice on what to do if you match. Ergo you will now reduce down to a single scan of the string.

This might not meet even a camel's beauty standard\, but... Imitate the same block labelling mechanism as in loop labels.

E2BIZZARE.

From you I will take that as a high compliment. :-)

Cheers\, Ben

p5pRT commented 24 years ago

From [Unknown Contact. See original ticket]

Tom Christiansen wrote​:

(Yes Tom\, lacking aesthetic taste is an occupational hazard among mathematicians...)

To the contrary​: elegance and creativity are essential. See "The Art of Mathematics"\, by Jerry King.

But mathematicians often find that the most elegant way to express themselves is to create obscure and precise languages in which they can directly express very obfuscated concepts...

Cheers\, Ben

p5pRT commented 23 years ago

From [Unknown Contact. See original ticket]

On Wed\, May 17\, 2000 at 07​:24​:12PM -0400\, Ben_Tilly@​trepp.com wrote​:

2) (?\/) No backtracking past here.

This is (?>)\, but (?>) is better​: nobody could explain what the above sentence *actually means*. (?>) provides a scope to limit backtracking-prohibition to\, thus makes things defined.

This is an independent idea from (?>). Consider this​:

Apparently\, you something like (?x​: | (?{ die })) and putting the match inside eval {}?

Ilya

p5pRT commented 23 years ago

From [Unknown Contact. See original ticket]

Ilya Zakharevich wrote​:

On Wed\, May 17\, 2000 at 07​:24​:12PM -0400\, Ben_Tilly@​trepp.com wrote​:

2) (?\/) No backtracking past here.

This is (?>)\, but (?>) is better​: nobody could explain what the above sentence *actually means*. (?>) provides a scope to limit backtracking-prohibition to\, thus makes things defined.

This is an independent idea from (?>). Consider this​:

Apparently\, you something like (?x​: | (?{ die })) and putting the match inside eval {}?

Ick.

After some puzzling I realize that that achieves the effect.

But\, ick.

Ilya\, I have to hand it to you. I would never have come up with that on my own. Be careful what you ask for and all that...

Cheers\, Ben

p5pRT commented 23 years ago

From [Unknown Contact. See original ticket]

On Fri\, May 26\, 2000 at 08​:57​:37AM -0400\, Ben_Tilly@​trepp.com wrote​:

On Wed\, May 17\, 2000 at 07​:24​:12PM -0400\, Ben_Tilly@​trepp.com wrote​:

2) (?\/) No backtracking past here.

Apparently\, you something like (?x​: | (?{ die })) and putting the match inside eval {}?

Ick.

After some puzzling I realize that that achieves the effect.

But\, ick.

Ilya\, I have to hand it to you. I would never have come up with that on my own. Be careful what you ask for and all that...

Well\, I'm not sure that it would work as wanted from inside

@​matches = /$REx/g;

But if *this* is the semantic you want\, we can at least think whether we want to make work as expected etc.

Btw\, as I said\, the main problem with a scalable approach is how to specify up to which level to stop backtracking. But we already have *some* structure to name groups\, witness $2\, $33 etc. So one can have

  (... ( ... \g{!2} ... ) ...)

which would make the group 2 fail.

Additionally\, I'm thinking about whether we want an escape to disallow forward-tracking up until the given point. Say\,

  / (   " [\"]* " \g! \g{!1} (?!) # Do not retry until this point   |   /\* .*? \*/   )   /

would match comments in the C file without ''-literals\, #-directives\, and backwacked " inside strings. Here \g! is the hypothetical "do not retry the whole match" until this point".

Ilya

p5pRT commented 23 years ago

From [Unknown Contact. See original ticket]

Ilya Zaharevich wrote​:

On Fri\, May 26\, 2000 at 08​:57​:37AM -0400\, Ben_Tilly@​trepp.com wrote​:

On Wed\, May 17\, 2000 at 07​:24​:12PM -0400\, Ben_Tilly@​trepp.com wrote​: [...] But if *this* is the semantic you want\, we can at least think whether we want to make work as expected etc.

After seeing that die construct\, I am getting more dubious by the minute...

Btw\, as I said\, the main problem with a scalable approach is how to specify up to which level to stop backtracking. But we already have *some* structure to name groups\, witness $2\, $33 etc. So one can have

(... ( ... \g{!2} ... ) ...)

which would make the group 2 fail.

Hmmm...

That brings up one absolutely huge request.

$yyyymmdd =~ /(?$year​: \d{4} )   (?$month​: \d\d )   (?$day​: \d\d )/x;

That both gives you better naming on groups\, as well as giving a nice way to handle referring to named references. (After all the first thing that people usually do with $1\, etc is to drop it into a named variable for later use...)

Additionally\, I'm thinking about whether we want an escape to disallow forward-tracking up until the given point. Say\,

/ ( " [\"]* " \g! \g{!1} (?!) # Do not retry until this point | /\* .*? \*/ ) /

would match comments in the C file without ''-literals\, #-directives\, and backwacked " inside strings. Here \g! is the hypothetical "do not retry the whole match" until this point".

OK\, now you have *me* confused...

Cheers\, Ben

p5pRT commented 23 years ago

From [Unknown Contact. See original ticket]

On Fri\, May 26\, 2000 at 12​:51​:29PM -0400\, Ben_Tilly@​trepp.com wrote​:

That brings up one absolutely huge request.

$yyyymmdd =~ /(?$year​: \d{4} ) (?$month​: \d\d ) (?$day​: \d\d )/x;

I do not think we want this. You are using the symbol table as a hash. I discussed the approach I prefer​:

  $yyyymmdd =~ /(?{year}​: \d{4} )   (?{month}​: \d\d )   (?{day}​: \d\d )/x;

which would set $^R{year} etc. Or $^R{date}{year} if it is inside another (?{date}​:) group. Plus additionally there should be a way to replace %^R by a different hash​:

  (?%RESULT​: )

Plus a way to put *all* the matches of GROUP* into an anonymous array​:

  (?[]*​: GROUP)*

Plus a way to put groups inside this one into an anonymous array

  (?[]​: PATTERN )

Ilya

p5pRT commented 23 years ago

From [Unknown Contact. See original ticket]

[I'm branching different questions on this thread.]

On Fri\, May 26\, 2000 at 12​:51​:29PM -0400\, Ben_Tilly@​trepp.com wrote​:

/ ( " [\"]* " \g! \g{!1} (?!) # Do not retry until this point | /\* .*? \*/ ) /

would match comments in the C file without ''-literals\, #-directives\, and backwacked " inside strings. Here \g! is the hypothetical "do not retry the whole match" until this point".

OK\, now you have *me* confused...

How to look for C comments? Of course\, the FAQ answer is completely wrong\, since you cannot look for comments separately from all the other constructs. Your REx should recognize char literals\, string literals\, and (possibly) preprocessor constructs.

But when you found a literal\, you need

  a) to fail;

  b) before failing to tell the REx engine that what is currently in   $& cannot contain the start of a valid match\, so the engine   should not retry looking for a match until "this" point.

a) is done by (?!)\, b) is done by \g!.

\g{!1} is just an optimization telling that you cannot match / if you matched "\, so one should not retry anything in the group 1. This optimization can be made automatic if I suddently decide I want to continue doing anything with Perl -- or if somebody else is ready to do it.

Ilya

p5pRT commented 23 years ago

From [Unknown Contact. See original ticket]

On May 26\, Ilya Zakharevich said​:

   "  \[\\"\]\* " \\g\! \\g\{\!1\} \(?\!\)  \# Do not retry until this point

b) before failing to tell the REx engine that what is currently in $& cannot contain the start of a valid match\, so the engine should not retry looking for a match until "this" point.

a) is done by (?!)\, b) is done by \g!.

So -- correct me if I'm wrong (and I'm sure that would be done) -- \g! is kind of like placing where \G would match\, if you were doing /\G.../g?

p5pRT commented 23 years ago

From [Unknown Contact. See original ticket]

On Fri\, May 26\, 2000 at 02​:03​:41PM -0400\, Jeff Pinyan wrote​:

On May 26\, Ilya Zakharevich said​:

   "  \[\\"\]\* " \\g\! \\g\{\!1\} \(?\!\)  \# Do not retry until this point

b) before failing to tell the REx engine that what is currently in $& cannot contain the start of a valid match\, so the engine should not retry looking for a match until "this" point.

a) is done by (?!)\, b) is done by \g!.

So -- correct me if I'm wrong (and I'm sure that would be done) -- \g! is kind of like placing where \G would match\, if you were doing /\G.../g?

A goold (gold/good) point. Set pos() here. Yes\, indeed.

Hmm\, we need more​: do not unset it in the case of backtracking. So in fact it *is* different than what you propose.

What you propose is closer to two other escapes \g\< and \g> we badly need. They would start/end what is considered to be $& of the match. As I wrote it before​:

  / ab \g\< cd \g> ef /x

is equivalent to the current

  / (?\< ab ) cd (?= ef ) /x

but can be used in many other situations as well.

So what you propose is basically \g>. But we *want* \g\< and \g> to be reset by backtracking\, and is clear that \g! should behave differently...

Ilya

p5pRT commented 23 years ago

From [Unknown Contact. See original ticket]

Ilya Zakharevich wrote​:

On Fri\, May 26\, 2000 at 12​:51​:29PM -0400\, Ben_Tilly@​trepp.com wrote​:

That brings up one absolutely huge request.

$yyyymmdd =~ /(?$year​: \d{4} ) (?$month​: \d\d ) (?$day​: \d\d )/x;

I do not think we want this. You are using the symbol table as a hash. I discussed the approach I prefer​:

No\, I would want this to work with whatever variables I want. Lexical or global.

$yyyymmdd =~ /(?{year}​: \d{4} ) (?{month}​: \d\d ) (?{day}​: \d\d )/x;

which would set $^R{year} etc. Or $^R{date}{year} if it is inside another (?{date}​:) group. Plus additionally there should be a way to replace %^R by a different hash​:

While slightly nicer than the current\, I would find myself using the good old​:

  $yyyymmdd =~ /(\d{4})(\d\d)(\d\d)/;   my ($year\, $month\, $day) = ($1\, $2\, $3);

and cursing and swearing if I ever want to get this or that block\, depending upon how it matched.

If I am using $year\, $month\, and $day in my code\, those are going to be the variables that I want the RE to populate. Populating a more complex structure is good\, but you should be able to choose what structure that is...

(?%RESULT​: )

Plus a way to put *all* the matches of GROUP* into an anonymous array​:

(?[]*​: GROUP)*

Plus a way to put groups inside this one into an anonymous array

(?[]​: PATTERN )

Hmmm...

What if I wanted an array of arrays\, for each match of the larger group\, everything in the smaller?

What if I wanted an array of hashes\, for each match of the larger group\, a hash of the smaller?

etc.

There is clearly a need for some sort of generic "use this RE to extract structured data into a native Perl data structure" but the need and syntax need careful thought.

Here is a proposal. Run through the RE. Once you are through the RE and understand the match\, then add an optional phase where you run through and apply structured actions. Say use a (?~..) type construct to indicate that this action is tied to the position of said string\, once the match is completed. Then with a bit of syntax you could decide to accept the data as a post-match bit of cleanup​:

(?~$year​: \d{4})

(?~{push @​year}​: \d{4})

The first is good enough for most basic needs. The second is flexible enough to handle extracting any kind of data structure that you want in virtually any form.

Cheers\, Ben

p5pRT commented 23 years ago

From [Unknown Contact. See original ticket]

On Fri\, May 26\, 2000 at 02​:49​:26PM -0400\, Ben_Tilly@​trepp.com wrote​:

Plus a way to put *all* the matches of GROUP* into an anonymous array​:

(?[]*​: GROUP)*

Plus a way to put groups inside this one into an anonymous array

(?[]​: PATTERN )

What if I wanted an array of arrays\, for each match of the larger group\, everything in the smaller?

Put one such group inside another.

What if I wanted an array of hashes\, for each match of the larger group\, a hash of the smaller?

Likewise.

Ilya

p5pRT commented 23 years ago

From [Unknown Contact. See original ticket]

On Fri\, May 26\, 2000 at 02​:49​:26PM -0400\, Ben_Tilly@​trepp.com wrote​:

Here is a proposal. Run through the RE. Once you are through the RE and understand the match\, then add an optional phase where you run through and apply structured actions. Say use a (?~..) type construct to indicate that this action is tied to the position of said string\, once the match is completed. Then with a bit of syntax you could decide to accept the data as a post-match bit of cleanup​:

(?~$year​: \d{4})

(?~{push @​year}​: \d{4})

The first is good enough for most basic needs. The second is flexible enough to handle extracting any kind of data structure that you want in virtually any form.

You can do it now with (?{}). If you want syntactic sugar\, use a custom REx engine.

Ilya

p5pRT commented 23 years ago

From [Unknown Contact. See original ticket]

Ilya Zakharevich wrote​:

[I'm branching different questions on this thread.]

On Fri\, May 26\, 2000 at 12​:51​:29PM -0400\, Ben_Tilly@​trepp.com wrote​:

/ ( " [\"]* " \g! \g{!1} (?!) # Do not retry until this point | /\* .*? \*/ ) /

would match comments in the C file without ''-literals\, #-directives\, and backwacked " inside strings. Here \g! is the hypothetical "do not retry the whole match" until this point".

OK\, now you have *me* confused...

How to look for C comments? Of course\, the FAQ answer is completely wrong\, since you cannot look for comments separately from all the other constructs. Your REx should recognize char literals\, string literals\, and (possibly) preprocessor constructs.

But when you found a literal\, you need

a) to fail;

b) before failing to tell the REx engine that what is currently in $& cannot contain the start of a valid match\, so the engine should not retry looking for a match until "this" point.

a) is done by (?!)\, b) is done by \g!.

*lightbulbs go off*

Gotcha.

(?!) fails because it is the negative of an trivially true assertion.

\g! is a note to the RE that this is where it should restart trying the match.

\g{!1} is just an optimization telling that you cannot match / if you matched "\, so one should not retry anything in the group 1. This optimization can be made automatic if I suddently decide I want to continue doing anything with Perl -- or if somebody else is ready to do it.

Please?

BTW I suspect that a careful job of next-character discrimination along the lines that I came up with earlier would make both of those optimizations automatic. At the least the syntactical hint should not be added until you know that said optimization cannot be made to happen automatically.

Cheers\, Ben

p5pRT commented 23 years ago

From [Unknown Contact. See original ticket]

Ilya Zakharevich wrote​:

On Fri\, May 26\, 2000 at 02​:49​:26PM -0400\, Ben_Tilly@​trepp.com wrote​:

Here is a proposal. Run through the RE. Once you are through the RE and understand the match\, then add an optional phase where you run through and apply structured actions. Say use a (?~..) type construct to indicate that this action is tied to the position of said string\, once the match is completed. Then with a bit of syntax you could decide to accept the data as a post-match bit of cleanup​:

(?~$year​: \d{4})

(?~{push @​year}​: \d{4})

The first is good enough for most basic needs. The second is flexible enough to handle extracting any kind of data structure that you want in virtually any form.

You can do it now with (?{}). If you want syntactic sugar\, use a custom REx engine.

Yes\, but doing so during the match is considerably more complex\, and considerably slower\, than it would be to do so when the match is actually done.

Nevermind. I can live with $1\, $2\, $3.

I will try some playing with custom REx engines though.

Cheers\, Ben

p5pRT commented 23 years ago

From @RandalSchwartz

"Ben" == Ben Tilly \Ben\_Tilly@&#8203;trepp\.com writes​:

Ben> $yyyymmdd =~ /(\d{4})(\d\d)(\d\d)/; Ben> my ($year\, $month\, $day) = ($1\, $2\, $3);

Just for the record\, this is dangerous code. Never use $1 unles you have checked the success of the match you think has set it. Because if it hasn't\, you are getting the *previous* match.

*You* may know that\, but your code surely didn't\, so I judge you on the code you wrote. :)

p5pRT commented 23 years ago

From [Unknown Contact. See original ticket]

On Fri\, May 26\, 2000 at 12​:49​:05PM -0700\, "Randal L. Schwartz" wrote​:

"Ben" == Ben Tilly \Ben\_Tilly@&#8203;trepp\.com writes​: Ben> $yyyymmdd =~ /(\d{4})(\d\d)(\d\d)/; Ben> my ($year\, $month\, $day) = ($1\, $2\, $3); Just for the record\, this is dangerous code. Never use $1 unles you have checked the success of the match you think has set it. Because if it hasn't\, you are getting the *previous* match.

That's why we have​:

  my($year\, $month\, $day) = ($yyyymmdd =~ /^(\d{4})(\d\d)(\d\d)$/)   or die "Date could not be parsed​: $yyyymmdd\n";

:-)

mark

-- markm@​nortelnetworks.com/mark@​mielke.cc/markm@​ncf.ca __________________________ . . _ ._ . . .__ . . ._. .__ . . . .__ | SIR Tools (7H12) |\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ | Nortel Networks | | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa\, Ontario\, Canada

  One ring to rule them all\, one ring to find them\, one ring to bring them all   and in the darkness bind them...

  http​://mark.mielke.cc/

p5pRT commented 23 years ago

From [Unknown Contact. See original ticket]

Randal Schwartz wrote​:

"Ben" == Ben Tilly \Ben\_Tilly@&#8203;trepp\.com writes​:

Ben> $yyyymmdd =~ /(\d{4})(\d\d)(\d\d)/; Ben> my ($year\, $month\, $day) = ($1\, $2\, $3);

Just for the record\, this is dangerous code. Never use $1 unles you have checked the success of the match you think has set it. Because if it hasn't\, you are getting the *previous* match.

Only sometimes. :-)

$1\, etc are localized so in many cases you may be sure that you are not getting a previous match. Likewise if you precede that by a trivially successful match\, then you know they are not populated. And conversely you may know from the code that you will never fail to match.

*You* may know that\, but your code surely didn't\, so I judge you on the code you wrote. :)

You are right that I was being lazy in the worst possible way in my example. Normally I do some sort of error check\, but I was being lazy. (The wrong kind of lazy.)

Cheers\, Ben

p5pRT commented 23 years ago

From [Unknown Contact. See original ticket]

At 04​:26 PM 5/26/00 -0400\, Ben_Tilly@​trepp.com wrote​:

$1\, etc are localized

No they aren't. They're mutant pseudo-scalars attached to the underlying optree\, and you may well get a $1 from the last time you matched that regex. (I've got a trivial program that shows the problem with a recursive call)

Don't Do That. :)

  Dan

--------------------------------------"it's like this"------------------- Dan Sugalski even samurai dan@​sidhe.org have teddy bears and even   teddy bears get drunk

p5pRT commented 23 years ago

From [Unknown Contact. See original ticket]

On Fri\, May 26\, 2000 at 03​:09​:48PM -0400\, Ben_Tilly@​trepp.com wrote​:

You can do it now with (?{}). If you want syntactic sugar\, use a custom REx engine.

Yes\, but doing so during the match is considerably more complex\, and considerably slower\, than it would be to do so when the match is actually done.

Just make your assignments the last thing in the pattern.

Ilya

p5pRT commented 19 years ago

From @smpeters

Using the code with the substitution that never terminates​:

steve@​kirk sandbox $ perl subs.pl Starting Finished

It appears that this has been fixed by 5.8.5 or earlier.

p5pRT commented 19 years ago

@smpeters - Status changed from 'open' to 'resolved'