Raku / problem-solving

🦋 Problem Solving, a repo for handling problems that require review, deliberation and possibly debate
Artistic License 2.0
69 stars 16 forks source link

Should :sigspace enable backtracking on the atom preceeding <.ws>? #390

Open codesections opened 9 months ago

codesections commented 9 months ago

This Problem Solving issue is an attempt to reach consensus about the behavior described in rakudo/rakudo#3726 and discussed further in raku/doc#4356; see also a related StackOverflow post. Note, however, that those issues discuss the problem as relating to rules – in reality, it relates to :sigspace more generally.

The problem is that :sigspace does not currently behave the way that it's described in the documentation or in S05). Specifically, :sigspace currently allows backtracking over the atom preceding the (implicit) <ws> even when :ratchet is enabled. Thus, the following all match:

'a' ~~ regex {:r:s .* a };
'a' ~~ token {:s   .* a };
'a' ~~ rule  {     .* a };

Whereas none of them match if we replace :sigspace with explicit <ws> calls (or with <!ww> \s* calls):

'a' ~~ regex {:r  .*[<!ww> \s*]a};
'a' ~~ regex {:r  .*<.ws>a};

'a' ~~ token {    .*<.ws>a};
'a' ~~ token {    .*[<!ww> \s*]a};

'a' ~~ rule  {:!s .*<.ws>a};
'a' ~~ rule  {:!s .*[<!ww> \s*]a};

The reason that this is a Problem Solving issue is to resolve whether this is ① a Rakudo bug that should be fixed in the normal course; ② a Rakudo bug that has been in place for long enough that fixing it risks breaking code, so a fix should wait for a change in language version, or ③ either not a bug or a bug that's so deeply entrenched that we don't want to fix it and should instead update the docs/Roast to explain and pin down its exact behavior.

codesections commented 9 months ago

After looking into this a bit more, I believe that the following is a correct description of how :sigspace modifies the behavior of the whitespace it applies to:

Each relevant whitespace character is replaced with :!<.ws> – where :! is the "force greedy backtracking" modifier disused in S05. Further, by default <.ws> is defined as token ws { <!ww>\s* }; thus, by default each whitespace is replaced by :! [:ratchet <!ww> \s* ] (with the :ratchet coming automatically from <ws> being a token).

If that's correct – and I'd appreciate anyone telling me differently – then the next question is whether this behavior is intended/appropriate. If so, it needs to be documented; if not, fixed. I'm inclined to believe that this behavior is appropriate: I can imagine it being very easy for a atom to accidentally include part of the whitespace and, without backtracking, to prevent a match. Using :! seems like a fine compromise – or, at the very least, not worth making breaking changes to fix. But please feel free to push back.

If we do reach consensus that the current behavior is correct (or at least correct enough to be worth keeping), here are the next steps (in no particular order):

pmichaud commented 9 months ago

I'm very "out of the loop" with things, so please feel free to ignore this if it doesn't make sense.

It seems like people have accurately diagnosed that :sigspace is disabling ratcheting behavior on the previous atom, forcing backtracking where it should be ratcheting. I have trouble believing that's a deliberate design choice anywhere -- it feels more like something that crept in by accident and should be removed.

It wasn't originally intended to be that way in the beginning. Originally :sigspace was just "replace whitespace with <.ws> subrule calls", and didn't change anything else. In all of the work being done on NQP, MOAR, and the regex engine in the 2013-2014 timeframe, it wouldn't surprise me much if a bug crept in that caused :sigspace to suppress application of a :ratchet flag, and nobody noticed.

If it's intentional that :sigspace turns on backtracking for the previous atom, then there should be some tests for that specific behavior in roast somewhere. I very much hope it's not intentional.

Before going down to the level of "let's document that :sigspace is turning on greediness for the previous atom", it would be better to confirm that this is in fact an intentional behavior, and not just the one that happens to be there because of some unrecognized bug for many years. At least when we started Perl 6/Raku, we consciously decided that the language would be defined by its test suite, and not by what any particular implementation (e.g. Rakudo) did in response to a given set of statements.

Pm

codesections commented 9 months ago

After looking into this issue in more depth, I'm fairly confident both that :sigspace's current behavior is a bug and that it's a bug that we can fix without too much disruption. Here's the evidence I've gathered:

So all of that has me convinced that it's a bug. That only leaves the question of whether it's so fundamental that fixing it would break an unacceptable amount of code (either in Rakudo itself or in the ecosystem more broadly). I'm inclined to say "no" – especially since the change to RakuAST will likely involve some of that sort of pain anyway.

But I'll leave this issue open at least for now in case anyone has a different perspective.

jubilatious1 commented 7 months ago

Possibly irrelevant, but is this related to your issue (@codesections ) filed regarding LookArounds (which I can't find atm)?

jubilatious1 commented 1 month ago

Should :sigspace enable backtracking on the atom preceeding <.ws>?

The question is really, "What constitutes an atom when :sigspace is enabled"?

If one wants to have whitespace respected without enabling :sigspace, one simply quotes it:

 ~ % raku
Welcome to Rakudo™ v2023.05.
Implementing the Raku® Programming Language v6.d.
Built on MoarVM version 2023.05.

To exit type 'exit' or '^D'
[0] > "photo shop" ~~ m/"photo shop"/;
「photo shop」
[1] > "photo shop" ~~ m:sigspace/photo shop/;
「photo shop」
[2] > "photo shop" ~~ m/ :sigspace photo shop/;
「photo shop」

If you only want `:sigspace to apply to part of a match, place the adverb inside the matcher:

[2] > "photo shop" ~~ m/ :sigspace photo shop/;
「photo shop」
[3] > "photo shop" ~~ m/ passport :sigspace photo shop/;
False
[4] > "photo shop" ~~ m/ pass port :sigspace photo shop/;
Potential difficulties:
    Space is not significant here; please use quotes or :s (:sigspace) modifier (or, to suppress this warning, omit the space, or otherwise change the spacing)
    ------> "photo shop" ~~ m/ pass⏏ port :sigspace photo shop/;

Now, get the additional (one word) passport to match:

[5] > "photo shop" ~~ m:sigspace/ pass port :sigspace photo shop/;
False
[6] > "passport photo shop" ~~ m:sigspace/ pass port :sigspace photo shop/;
False
[7] > "passport photo shop" ~~ m:sigspace/ passport :sigspace photo shop/;
「passport photo shop」

Now get the additional (two words) passport to match:

[8] > "pass port photo shop" ~~ m/ pass port :sigspace photo shop/;
Potential difficulties:
    Space is not significant here; please use quotes or :s (:sigspace) modifier (or, to suppress this warning, omit the space, or otherwise change the spacing)
    ------> "pass port photo shop" ~~ m/ pass⏏ port :sigspace photo shop/;
False
[9] > "pass port photo shop" ~~ m:sigspace/ pass port :sigspace photo shop/;
「pass port photo shop」
[10] > "pass port 1234 shop" ~~ m:sigspace/ pass port :sigspace \d+ shop/;
「pass port 1234 shop」
jubilatious1 commented 1 month ago

(Cont'd) Now look at backtracking:

[2] > "motif" ~~ m/ or | if /;
「if」
[3] > "motif" ~~ m/ \w+ or | if /;
「if」
[4] > "motif" ~~ m:r/ \w+ or | if /;
「if」
[5] > "motif" ~~ m:r/ \w+ [or | if] /;
False
[6] > "motif" ~~ m/ \w+ [or | if] /;
「motif」
[7] > "motif" ~~ m/:sigspace \w+ [or | if] /;
False
[8] > "motif" ~~ m/ \w+ if /;
「motif」
[9] > "motif" ~~ m:r/ \w+ if /;
False
[10] > "motif" ~~ m:r:s/ \w+ if /;
False
[11] > "motif" ~~ m:r:s/ [\w+ if] /;
False
[12] > "motif" ~~ m:r:s/ [\w+if] /;
False
[13] > "motif" ~~ m:r:s/ \w+if /;
False
[14] > "motif" ~~ m:s/ \w+if /;
「motif」
[15] > "motif" ~~ m:s/ \w+?if /;
「motif」

Rakudo 2023.05

jubilatious1 commented 1 month ago

(cont'd)

[16] > "money penny" ~~ m:s/ \w+ /;
「money 」
[17] > "money penny" ~~ m:s/ \w+ \w/;
「money p」
[18] > "money penny" ~~ m:s/ \w+ \w**5/;
「money penny」
[19] > "money penny" ~~ m:s/ \w \w**5/;
「y penny」
[20] > "money penny" ~~ m/ \w \w**5/;
False
[21] > "money penny" ~~ m:s/ \w \w**5/;
「y penny」
[22] > "money penny" ~~ m:s/ \w \w**6/;
False
[23] > "money penny" ~~ m:s/ \w**6/;
False
[24] > "money penny" ~~ m/ \w \w**6/;
False
[25] > "money penny" ~~ m:r/ \w \w**6/;
False
[26] > "money penny" ~~ m:s/ \w \w**6/;
False
[27] > "money penny" ~~ m:s:r/ \w \w**6/;
False
[28] > "money penny" ~~ m:s:r/ \w \w**5/;
「y penny」
[29] >

Rakudo 2023.05

Not sure I've hit the issue in question. Things seem to behave like I would expect. Still confused as to how to add <.ws> token to custom character classes.

raiph commented 1 month ago

@jubilatious1

(Your first recent comment is marked "duplicate". What's that about?)

You appear to have misunderstood the issue. I will try to recap it in a helpful way.

Should :sigspace enable backtracking on the atom preceding <.ws>?

The question is really, "What constitutes an atom when :sigspace is enabled"?

That's a different question, one that seems to have led you far off topic.

The discovery that led to this issue being opened is demonstrable for trivial patterns:

say 'a' ~~ rule  { .* a } # 「a」

That matches.

The views expressed by every core dev who has commented on this, including pm above (pm is the original architect of Rakudo, and the person who came up with the idea that turned into :sigspace, and wrote the original code, and was complimented by Larry Wall on the elegance of their idea), and everything written in the design docs, and anyone seeking simplicity, has been consistent with the Expected column in the following table.

(While the above example documents the problem in golfed form, it makes analysis complicated. To simplify I will use plain regexes, anchored patterns with no spaces between the anchors (except for the one :sigspace row) and comment on matching and backtracking behaviors.)

Assume $_ = 'a';:

Code Expected Actual How it matched at first Did it "backtrack"?
/ :r ^.*a$ / Nil Nil .* greedily matched a No
/ :r ^.*<ws>a$ / Nil Nil .* greedily matched a No
/ :r ^.*!a$ / 「a」 「a」 .*! greedily matched a Yes
/ :r ^.*?a$ / 「a」 「a」 .*? frugally matched a Yes (forward tracked)
/ :r ^.*!<ws>a$ / 「a」 「a」 .*! greedily matched a Yes
/ :r:s ^.* a$ / Nil 「a」 .* greedily matched a Yes

The behavior of the pattern in the last row (for any string it's matching) should be the same as the pattern in the second row but instead it is the same as the behavior of the pattern in the second to last row.

This was an enormous shock to my system when @ElectricCoffee first brought it to our attention in 2020. Perhaps due to the stresses of covid in my life at the time, I recall it being an emotional roller coaster. How could such a basic aspect of the regex engine defy expectations based on any and all available documentation: the user doc, which I knew not to rely on, but also the design doc, which was by definition how it was supposed to behave, and was generally adhered to with precision? How could this have escaped notice for so many years? Would changing it break swathes of code (a terrible thing)? Or would it turn out to be negligible, or somewhere in between? Could getting rid of unintended backtracking provide a big speed boost (a great thing)? I'm grateful to @codesections for writing up this calm and carefully conceived issue.

Please read @codesections' last comment, which I think nicely summarizes the current situation. I find the fact that :sigspace does not enable backtracking on the atom preceding <.ws> in RakuAST to be very promising. It's not been needed there, and no one spotted the difference. Sounds like a great sign to me. But I don't want to prematurely get my hopes up high to have them dashed later. Whether changing this behavior turns out to break a lot of code or not, and/or speed up a lot of code or not, we would be well advised to approach the question of what to do about this problem with care.

Hth.

jubilatious1 commented 1 month ago

It's going to take a long time for me to understand this issue.

#Rakudo 2023.05

say  "________no adverbs:\n";

say 'a'  ~~ /  ^\S* a$ /; #「a」
say 'a ' ~~ /  ^\S* a$ /; #Nil
say ' a' ~~ /  ^\S* a$ /; #Nil
say 'a'  ~~ /  ^\S+ a$ /; #Nil
say 'a ' ~~ /  ^\S+ a$ /; #Nil
say ' a' ~~ /  ^\S+ a$ /; #Nil

say  "________:sigspace\n";

say 'a'  ~~ / :s ^\S* a$ /; #「a」
say 'a ' ~~ / :s ^\S* a$ /; #Nil
say ' a' ~~ / :s ^\S* a$ /; #「 a」
say 'a'  ~~ / :s ^\S+ a$ /; #Nil
say 'a ' ~~ / :s ^\S+ a$ /; #Nil
say ' a' ~~ / :s ^\S+ a$ /; #Nil

say  "________:ratchet:sigspace\n";

say 'a'  ~~ / :r:s ^\S* a$ /; #「a」
say 'a ' ~~ / :r:s ^\S* a$ /; #Nil
say ' a' ~~ / :r:s ^\S* a$ /; #「 a」
say 'a'  ~~ / :r:s ^\S+ a$ /; #Nil
say 'a ' ~~ / :r:s ^\S+ a$ /; #Nil
say ' a' ~~ / :r:s ^\S+ a$ /; #Nil

say  "________:ratchet:sigspace\n";

say 'a'  ~~ / :r:s ^\S* <ws>a$ /; #「a」 ws => 「」
say 'a ' ~~ / :r:s ^\S* <ws>a$ /; #Nil
say ' a' ~~ / :r:s ^\S* <ws>a$ /; #「 a」 ws => 「」
say 'a'  ~~ / :r:s ^\S+ <ws>a$ /; #Nil
say 'a ' ~~ / :r:s ^\S+ <ws>a$ /; #Nil
say ' a' ~~ / :r:s ^\S+ <ws>a$ /; #Nil

Continuing:

#Rakudo 2023.05

say 'a'  ~~ / :r:s ^\d* a$ /; #「a」
say 'a ' ~~ / :r:s ^\d* a$ /; #Nil
say ' a' ~~ / :r:s ^\d* a$ /; #「 a」
say 'a'  ~~ / :r:s ^\d+ a$ /; #Nil
say 'a ' ~~ / :r:s ^\d+ a$ /; #Nil
say ' a' ~~ / :r:s ^\d+ a$ /; #Nil

Of course, I defer to experts on this matter such as @b2gills who has helped me with regexes in the past (no disrespect to those already contributing to this conversation).