Raku / doc

🦋 Raku documentation
https://docs.raku.org/
Artistic License 2.0
287 stars 291 forks source link

Second backtracking example is comprehensivly wrong? #4369

Open codesections opened 1 year ago

codesections commented 1 year ago

The second example in the Backtracking section of the Regex page appears to be entirely wrong. I'm referring to the example that starts:

It is worth noting that disabling backtracking will not prevent the engine to try several ways to match the regular expression.

This example provides code that, according to the text, will result in only two iterations (without :ratchet). Running that code, however, results in 30 iterations. Additionally, the text incorrectly describes how the match proceeds by saying that it "will match against the rightmost word SQL and go forward from there" – which it won't. I cannot tell what the example is trying to communicate. It appears to be ~6 years old; maybe it was documenting old/buggy behavior? I am inclined to favor deleting this second example entirely rather than trying to salvage it.

However, since it's fairly long and well-written, I thought I'd check if anyone else knows what it's trying to do and/or how to get it to do so.

pmichaud commented 1 year ago

I can't speak for the rest of the example (yet), but the author misreads the result of the second example entirely and perhaps that confuses the entire answer.

The reason the example cited fails to match is not because there's not a second "SQL" to be matched, but rather because the first SQL that got matched is now in $0 instead of $1. The author went from these prior examples:

my $string = 'PostgreSQL is an SQL database!';
say $string ~~ /(.+)(SQL) (.+) $1/;      # OUTPUT: «「PostgreSQL is an SQL」␤» 
say $string ~~ / :r (.+)(SQL) (.+) $1/;  # OUTPUT: «Nil␤» 

to this one

my $string = 'PostgreSQL is an SQL database!';
say $string ~~ / (SQL) (.+) $1 /; # OUTPUT: «Nil␤» 

and assumes that the failed match in the last one is because of backtracking, or that the first capture grabbed the second SQL, and other incorrect reasoning.

The reason the match fails in this last example is because the (SQL) capture now appears in $0 instead of $1 as it did in the previous examples. If you correct the backreference everything works as expected, including backtracking:

[0] > my $string = 'PostgreSQL is an SQL database!';
PostgreSQL is an SQL database!
[1] > say $string ~~ / (SQL) (.+) $0 /;
「SQL is an SQL」
 0 => 「SQL」
 1 => 「 is an 」
[1] > 

Because of that fundamental misunderstanding, that the $1 in these later examples is now attempting to match the contents of (.+) instead of (SQL), the analysis and interpretation given just doesn't make any sense.

I agree this example needs to be revisited, and likely just removed, because it's just not an accurate description of what's happening in the regex match. The results have little to do with backtracking ratcheting, and more to do with a pattern that couldn't possibly match with the given input.

Pm

pmichaud commented 1 year ago

I can now speak for more of the example. My observation that the later examples use $1 where a match against (SQL) (in $0) was intended remains true, but I now see where the analysis given is incorrect.

For the code that appears after "It is worth noting that disabling backtracking...", it appears that the author is trying to set up an example where :r doesn't prevent backtracking. The first part is to show results for a regex that doesn't have :r and try to explain it, but the explanation is wrong and that sets an incorrect stage for what follows.

For the code

my $string = 'PostgreSQL is an SQL database!';
say $string ~~ / (SQL) (.+) $1 /; # OUTPUT: «Nil␤» 

the author attributes the the match to "there's no specification for a character before the word SQL", but it's actually the $1 that is controlling things here. The match proceeds by finding and capturing the SQL of "PostgreSQL" into $0, then captures the remainder of the string (" is an SQL database!") into $1. When the author writes "Since there is no repetition of SQL remaining", they are misinterpeting the result. The $1 is going to attempt to match the (.+) capture that is in $1, not the (SQL) capture that is in $0.

Since there's no remaining string left to match, the attempt to match $1 (" is an SQL database!") fails and the $1 backtracks one character to try again. This keeps happening without any match (the string isn't duplicated anywhere), so ultimately the (.+) capture fails and the regex backtracks to where the first SQL was captured in $0.

Unanchored regexes using the slash delimiters have an implicit .*? at the beginning. When the match fails with the $0 set to the first SQL (in "PostgreSQL"), the implicit .*? backtracks until it has consumed all of the characters up to the second instance of SQL in the string. The author claims the regex gets there because of "no specification for a character before the SQL", but that's misleading. It's really that the first SQL failed to complete a match and so now we've backtracked to the second one. Once this second (standalone) instance of SQL is placed in $0, the regex again matches the remainder of the string (" database!") into $1 and attempts to find a second copy to match against. This fails again, backtracking the $1 capture one character at a time down to nothing, at which point it fails this iteration of the regex. The unanchored search at the beginning of the string against scans for another SQL and not finding one fails the match.

I don't know how the author got the claimed output when show-captures is used. It should have shown repeated iterations with the $1 capture getting progressively smaller until the match backtracks on SQL and then ultimately fails. This is what OP claims happened for them, and when I try it I also get 30 iterations, exactly as OP claims. I think the author of this section must have had some other error somewhere that caused them to get the output they wrote in the documentation.

Now, having set up this example (and incorrectly interpreted how the machine processed it), the author then tries the same example adding a :r modifier, sees the same output, and claims that it shows that backtracking still occurs. There is indeed some backtracking going on, but the result is in fact very different and not at all what the author concludes.

With the :r modifier added the regex becomes

my $string = 'PostgreSQL is an SQL database!';
say $string ~~ / :r (SQL) (.+) $1 /; # OUTPUT: «Nil␤» 

If we run this code we get two iterations instead of 30, just one for each time that (SQL) is matched in the string. Note that this shows that :r indeed does modify the backtracking, because instead of 30 iterations we now have 2. The (.+) capture no longer tries all possibilities -- it grabs as many characters as it can and sends it on to the next component to match, and if that doesn't work, the (.+) capture doesn't try any other shorter possibilities.

But, I hear people saying "if the regex immediately fails and :r prevents backtracking, why are there two iterations? Shouldn't there be just one?"

Remember that implicit .*? at the beginning of an unanchored regex match? The :r flag doesn't apply to that implicit search -- it still "backtracks". So, after the match with $0 matching the first SQL fails, the implicit search "backtracks" until $0 matches the second SQL of the target string. When that try fails (with no backtracking taking place for $1 because of :r), there are no more SQLs left and the overall match fails.

So, the :r to prevent backtracking is in fact changing things, but the author didn't notice what had changed. The failed match without :r results in 30 invocations of show-captures, while the one with :r results in 2. Yes, there's some "backtracking" that takes place even with ratchet enabled inside the regex... in this case it's the implicit .*? of the regex that backtracks as it isn't affected by the :r flag.

Beyond that, the :ratchet modifier doesn't actually mean "no backtracking anywhere" -- it just means that quantifiers and alternatives it affects will try things until the first time they succeed, at which point they discard any remaining alternatives to try. The :ratchet just acts as if a : is placed after any quantifier, alternative, or subrule that doesn't have its own modifier to force backtracking.

If you want to fail an unanchored regex on the first time it matches SQL and not backtrack to try additional instances, I believe the way to do that is by using <!> in an alternative to fail a match altogether. I'm rusty on my regex notation, and I can't find <!> in the docs, so hopefully someone can confirm or correct this.

At any rate, if the point of the latter part of that section is to point out that unanchored regexes will still backtrack on the initial string position when searching for a match... it's probably easier to just say that directly and not try to demonstrate it with an example. (Much less an incorrectly presented and interpreted one.)

Hope this helps,

Pm

pmichaud commented 1 year ago

I was incorrect about <!> failing a match altogether, all it does is cause the current regex match path to fail (forcing the engine to start any available backtracking).

The syntax I'm remembering is the triple-colon :::, which appears in roast but all of the tests seem to be marked "skip" as NYI. for rakudo. With the colon, a regex would be able to fail with something like

$string ~~ / (SQL) ::: (.*) $0 /

which means that once the regex matches SQL, you complete a match or fail without trying a later instance of SQL in the string. Again, roast suggests this is NYI in rakudo, and attempting to use it in a regex also gives a NYI error message.

Pm