Raku / doc

🦋 Raku documentation
https://docs.raku.org/
Artistic License 2.0
289 stars 291 forks source link

Error describing <ws>; it is not the same as <!ww>\s* #4356

Open deoac opened 1 year ago

deoac commented 1 year ago

On Regexes > Predefined Regexes the page says

\ | no | Whitespace, same as: <!ww>\s*

This is not quite true. As nemokosch explains, \s can backtrack, but <ws> cannot. Here's an example to demonstrate the difference:

say "a\n" ~~ / 'a' <!ww>\s* \n /   #Output 「a
                                   #       」   
say "a\n" ~~ / 'a' <.ws> \n /      #Output Nil
raiph commented 1 year ago

TL;DR ws is a token. Maybe just "Besides the built-in character classes, the following other rules tokens are built into Raku:".


I just tested, and it looks like both the non-zerowidth rules in the doc table are tokens. (For the zerowidth ones it doesn't matter).

Consider:

my regex ws { 4+ }
my regex ident { 4 \w+ }

say '44indent8' ~~ / <ws> <ident> 8 /

displays:

「44indent8」
 ws => 「4」
 ident => 「4indent」

If either (or both) of the user defined regexes are commented out, or changed to tokens, the match fails.

So the ws rule/regex is presumably defined as a token:

token ws { <!ww> \s* }

Thus it doesn't backtrack because quantifiers don't backtrack by default if :ratchet is in effect (and so the \s* doesn't backtrack). Likewise the built-in indent rule.

The simplest doc edit I suggest is "Besides the built-in character classes, the following other rules tokens are built into Raku:".

deoac commented 1 year ago

Hi raiph!

Your explanation that <ws> is a token, and hence doesn't backtrack, was very clear and helpful.

Thanks,

Shimon

2colours commented 1 year ago

Well, "rules" also don't backtrack, if we can trust the documentation with this: https://docs.raku.org/language/grammars#Rules

Actually, I couldn't tell by only what we see whether they are implemented as tokens or rules. Perhaps this one?

Either way, I think the most apparent problem is not what we call it but the fact that it is not the same as the snippet given as "the same". I think it should be checked if there is a similar snippet that is indeed the same, and if it starts to get complex, it's better to just describe the behavior.

raiph commented 1 year ago

Well, "rules" also don't backtrack, if we can trust the documentation with this: https://docs.raku.org/language/grammars#Rules

Yeah. Unfortunately you can't trust it. And, more to the point, it's wrong.

rule has backtracked in Rakudos since 2014.01.

I keep forgetting that I haven't yet widely reported/discussed that.

See https://github.com/rakudo/rakudo/issues/3726.

Actually, I couldn't tell by only what we see whether they are implemented as tokens or rules.

Imo it's not important whether they're implemented as tokens or rules, just what their behavior is best described as.

And while the word "token" and keyword token have unfortunate ambiguities, they aren't pertinent in any way I can see here, and the unfortunate ambiguities that "rule" and rule have, and especially the current Rakudo behavior of rules, make "rules" an especially poor pick for describing what these rules (tokens) do.

Perhaps this one?

No. That's the _ws that's called by the ws that's part of the grammar for parsing Raku code.

The ws we want to document for end users is the one that gets invoked by writing <ws> or writing significant space when :sigspace is in effect in a user defined grammar.

And, as explained in my related detailed write up of this stuff, that's in NQP, presumably here.

Either way, I think the most apparent problem is not what we call it but the fact that it is not the same as the snippet given as "the same".

I think that, as far as documenting it for an ordinary user is concerned, it should be referred to as the same as the snippet given as "the same", and the key thing that needs to change is to refer to it as a "token" or token.

I have gone down the rabbit hole of analyzing the code for ws like I did for ident in the SO I linked above, but need to call it quits on that for about a week, other than the following summary.

The ws I've linked has this comment:

# skip over any whitespace, fail if between two word chars

Because the whitespace matching ("skip") consumes characters and the other ("between") doesn't, the fact that the comment mentions them in the opposite order from the snippet is technically immaterial, even though it's disconcerting. That is to say, these are logically the same:

Imo it's easier to understand written as <!ww> \s*. But YMMV.

Beyond the comment I've explored around the code a bit, but have concluded it's beyond my paygrade to draw any conclusions beyond that I get the impression the code does what the comment says it does, and that that matches the snippet.

I think it should be checked if there is a similar snippet that is indeed the same, and if it starts to get complex, it's better to just describe the behavior.

I think it's best to do both, but that what matters for closing this ticket is just to refer to the "rules" as "tokens" instead.

For the snippet, I think the two options are:

<!ww> \s*
\s* <!ww>

and I favor the first.


I've got some heavy RL commitments until next week so probably won't be able to follow up further till then.

2colours commented 1 year ago

Yeah. Unfortunately you can't trust it. And, more to the point, it's wrong.

rule has backtracked in Rakudos since 2014.01.

I keep forgetting that I haven't yet widely reported/discussed that.

See rakudo/rakudo#3726.

Well, to me, the linked issue, and the linked discussions in that issue, rather sound like this is simply a bug, moreover a bug that doesn't make rules consistently backtrack, only under certain circumstances. So rather than "fixing" the documentation, what should be noted, if at all, is that Rakudo is buggy with rules. I don't think that has any implications on the terminology.

No. That's the _ws that's called by the ws that's part of the grammar for parsing Raku code.

The ws we want to document for end users is the one that gets invoked by writing <ws> or writing significant space when :sigspace is in effect in a user defined grammar.

And, as explained in my related detailed write up of this stuff, that's in NQP, presumably here.

Given my subjective impression on Rakudo's architecture, even being aware that the ws defined here is for parsing Raku itself, it wouldn't have surprised me if the named tokens simply slipped through from an arbitrary Raku-related grammar Rakudo has... I wouldn't assume common sense with these things, to be honest.

But anyway, thank you for clarification. I checked several NQP source files but didn't think of that one.

I think it's best to do both, but that what matters for closing this ticket is just to refer to the "rules" as "tokens" instead.

Well, what can I say. You haven't convinced me at all about the second part of the sentence. If someone just changed "rules" to "tokens" and called it a day, I would genuinely open another issue because it doesn't address the problem that is even stated in this one.

<ws> is not the same as <!ww> \s*, period. One may say something like "<ws> could be defined as token ws { <!ww> \s* }", that would be accurate at least, and the user could at least get the idea that the word "token" bears special meaning and significance here; that it's not "this simple".

In general, it would be great to set up less traps for users in the documentation as well. If somebody already opened an issue for two pieces of syntax inside a regex not being the same, and they are indeed not the same, it's better to not claim that they are the same, with an overly subtle disclaimer. I'm not saying this is deliberate but I think there are loads of communication patterns in the explanations users get from the Raku community that are most similar to gaslighting. Let's be more open in the communication! Let's show the connection, let's maybe offer a pseudo-implementation and let's emphasize the significance of the word, whether that word is "rule" or "token". (I'm fine with "token", especially if "rules" are broken.)

codesections commented 1 year ago

I just opened Raku/problem-solving#390 to help resolve the uncertainty around how rules should behave.

Please note that, as I mention in that issue, the problem is not actually with rules specifically but rather with :sigspace more generally.

raiph commented 1 year ago

Well, to me, the linked issue, and the linked discussions in that issue, rather sound like this is simply a bug

Sure. It's definitely an issue. And I have mentally labelled it as a potential bug for 3 years, and increasingly since then as preferably a bug. And I hope it's simple.

So it sounds like we're in agreement on that, though I think I'm much less convinced than you about it being simple. (Not so much simple to change the behavior, which is likely simple, but simple to deal with the consequences, which could easily take years to unravel.)

But it seemed to me you thought the doc was right, and I thought it appropriate to let you know it wasn't, and might need to be changed, but definitely needed to not be trusted on this point until we resolve the issue.

moreover a bug that doesn't make rules consistently backtrack, only under certain circumstances.

Yes, but the "certain circumstances" are any time an atom prior to a significant space has a quantifier.

I haven't attempted an analysis of code in the wild, but my guess is that more than 90% of all grammars that have ever been written, including one line grammars, contain at least one instance of this, and more than 10% contain hundreds of instances.

And each such instance can utterly transform parsing, not only from failing to succeeding or vice-versa, but often from one parse result to a different one, with the change happening silently, and being impossible to determine through any static analysis. The only practical thing to do is compare before and after results over corpuses pushed through test parses with before/after the hoped for "simple" bug fix. And then I expect problems. And sorting out those problems may take years.

So rather than "fixing" the documentation, what should be noted, if at all, is that Rakudo is buggy with rules.

Agreed we can note that Rakudo is buggy if you wish. I think that goes without saying, but if it floats your boat to say so, go for it. Agreed too that we can be more specific, and mention rules, though for me the word "rules" has always meant methods/rules/tokens/regexs, exactly as was explicitly the case for the specs, so is ambiguous in a way that's useful if used with awareness of ambiguity and a deep problem if used or read without awareness of its meaning. Agreed too that we can be even more specific, and mention rules (though for me the key issue is :sigspace, which is what rules enable). And agreed that we could note nothing. This last was always my view, so if that's your view, then we fully agree.

I don't think that has any implications on the terminology.

Agreed about that too. I guess I wish I never mentioned it. But you said something to the effect of trusting the doc about something it was wrong about. So I brought it up.

What does have an implication on the terminology relevant to this issue is that ws is a token in terms of its behavior. And calling it a "rule" is unfortunate, because the word is ambiguous, and that ambiguity was central to @deoac filing this issue.

The fact is, there is no error in describing <ws>. The fact is it is the same as <!ww> \s*. And @deoac has (presumably) agreed.

So we could perhaps just close this issue.

But if we just closed this issue with zero action then we wouldn't be dealing with the underlying problem that led to this issue being filed, namely the ambiguity of "rule" leading to someone thinking <!ww> \s* backtracked, because they ignored the relevance of whether the pattern is a regex, token, or rule, and assumed \s* backtracks, when in fact it doesn't, provided it's in the scope of a :ratchet, which is, for good reasons, associated with token (because :ratchet is what tokens enable), so referring to it (and the other items in the table) as a "token", or better still, token, is what, imo, is needed.

All of which leads me back to my original conclusion: make the tiny judicious edit: s/other rules/C<tokens>/. Job done.

Given my subjective impression on Rakudo's architecture, even being aware that the ws defined here is for parsing Raku itself, it wouldn't have surprised me if the named tokens simply slipped through from an arbitrary Raku-related grammar Rakudo has... I wouldn't assume common sense with these things, to be honest.

I wouldn't assume anything, but, having watched the design and architecture and code evolve over two decades, and especially closely for a decade, it would have surprised the heck out of me because those named tokens were designed before their implementation, and the Raku ws is very clearly a custom user defined ws.

(It mentions things like "unspace", which is use of \ in code, specifically Raku code, whereas ws is used in parsing just about anything, eg parsing bioinformatics data. "unspace" would be a million miles out of place. And we're talking about ws. Imagine if that just accidentally had "unspace" stuff in it about 2 decades after it was first specified, when it's just about the most important rule in the entirety of the Raku design!)

But anyway, thank you for clarification. I checked several NQP source files but didn't think of that one.

Right. I only worked out where things were at when I wrote the SO I linked. I think I spent something like 20 hours on it.

what matters for closing this ticket is just to refer to the "rules" as "tokens" instead.

Bingo.

Oh! I thought you had written that. :)

Well, what can I say. You haven't convinced me at all about the second part of the sentence. If someone just changed "rules" to "tokens" and called it a day, I would genuinely open another issue because it doesn't address the problem that is even stated in this one.

Right. The correct resolution is to call them tokens.

<ws> is not the same as <!ww> \s*, period.

I think it is. I agree it's been written in nqp to speed it up. But that's an optimization. We should be documenting behavior, not implementation details, and especially not optimization, and we should be using the lie-to-children that "same" refers to an abstraction you can 100% rely on rather than documenting 15 lines of nqp code that I still didn't understand an hour after I first tried to understand it.

One may say something like "<ws> could be defined as `token ws { <!ww> \s* }", that would be accurate at least, and the user could at least get the idea that the word "token" bears special meaning and significance here; that it's not "this simple".

But it is that simple. If we take on board what you're saying, the entirety of all the docs are all lies.

In general, it would be great to set up less traps for users in the documentation as well.

Yes. More generally I largely hate the doc, even though I respect the intentions of the many that have poured their hearts into it.

If somebody already opened an issue for two pieces of syntax inside a regex not being the same, and they are indeed not the same, it's better to not claim that they are the same, with an overly subtle disclaimer.

We disagree about whether they are the same. And I provided no disclaimer. I merely suggested changing "other rules" to "Cs". The doc had been fine for years, with no one being "trapped" by what it said. When @deoac posted to say there was an issue, I accepted that, and drilled down to the key change that would have headed off filing an issue about it.

(Someone would have said to themselves "Hang on, it says "tokens". What's that about? Ah, :ratchet. What's that about? Ah, it switches backtracking off. Ah, OK." End of confusion. No need to file an issue.)

I'm not saying this is deliberate but I think there are loads of communication patterns in the explanations users get from the Raku community that are most similar to gaslighting.

I absolutely agree.

Let's be more open in the communication! Let's show the connection, let's maybe offer a pseudo-implementation and let's emphasize the significance of the word, whether that word is "rule" or "token". (I'm fine with "token", especially if "rules" are broken.)

I'd prefer token, unless it's obvious from context that it unambiguously means token.

But imo making ws complicated is gaslighting. It's really dead simple. If the current regex engine is between two word characters, then it doesn't match. If it does match, it consumes all contiguous whitespace. And it doesn't backtrack -- it's a token. End of story.

codesections commented 1 year ago

@raiph wrote:

All of which leads me back to my original conclusion: make the tiny judicious edit: s/other rules/C<tokens>/. Job done.

@2colours wrote: <ws> is not the same as <!ww> \s*

I agree that describing <ws> as "the same as" <!ww> \s* is misleading – or at least potentially confusing. When $thing is described as "the same as" $other-thing, I usually think that they can be freely substituted, which isn't the case here.

However, with a tiny extension to @raiph's judicious edit, I think we can avoid the confusion: I propose replacing "Whitespace, same as: <!ww>\s" with "Whitespace, same as: [:ratchet <!ww>\s]". (Plus a parallel edit for ident).

Would that solve this issue?

pmichaud commented 1 year ago

I've been semi-following this issue in the background, but it feels like it's gone all wonky.

My suggestion is to just say that the default <ws> is the same as <!ww>\s*: and leave it at that. Then it doesn't matter if the reader is thinking rule or token or regex. The behavior is "correct" in all of those contexts. You don't need to get into the vagaries of :ratchet in the middle of regex slang to figure out what it does.

FWIW, I think it's recursive/self-referencing (and incorrect) to think of <ws> as rule ws { <!ww> \s* } , so let's avoid that from now on.

My apologies if this suggestion misses some key point that was made earlier in the thread.

Pm

pmichaud commented 1 year ago

Also, I would agree with @raiph that whenever <ws> is mentioned, it's better to call it a "token" than a "rule". It doesn't need to be pedantically done throughout the documentation, but anytime the docs or someone can say "token" instead of "rule" and still have the correct concept come through it's a level of precision that will greatly improve the documentation, imo.

I also think understand some of the confusion. The common idiom for formal grammars has historically been that they consist of "rules" as the fundamental element, this idiom existed long before Raku was a thing. Even Wikipedia says "A formal grammar is a set of rules..." In Raku, grammars are really classes with methods written using regex syntax, some of which methods are "tokens" and most of which are often "rules" in a ratchet/sigspace sense. When describing Raku to someone supposedly having a background in formal grammars, we'll want to use "rule" in its more generic "it describes a syntax" sense rather than the Raku "a regex with ratchet and sigspace enabled" sense.

IMO one of Raku's greatest innovations was recognizing that what everyone traditionally called "rules" and "tokens" in the formal grammar world could actually be specified as regular expressions with certain features enabled by default. Prior to my encounters with Raku (or "Perl 6" as it was known then), almost all parser tools and language descriptions I worked with used one syntax/system to describe tokenization, another syntax/system to describe grammar rules, and handwaved whitespace issues by English descriptions or custom code to handle the special cases. Raku provided a common syntax and description language that cleanly encompassed all three.

To me, Raku's meaning for rule most closely approximates what we used to call "rules" in Backus-Naur form and other formal language syntax notations, so we shouldn't try to change Raku terminology, or even consider it a mistake in the original language design. Coming from a Perl 5 world, it seemed natural to treat regexes as fundamental and look at token/rule as specializations of that. But for an audience of people coming from a formal grammar background, it might've been better to conceptualize it the other way -- a "token" is a rule where whitespace in the syntax isn't significant, and a "regex" is a token where backtracking is enabled by default. (I do think Raku is correct by treating regex as fundamental.)

Even if it had been different, we'd still have the issue that some "rules" (regexes) in a grammar have special features, and whether we choose "rule", "regex", or "method" as the term for its fundamental building block, we still have to handle describing to a new user that something like <ws> is a "token" and not strictly either a "rule" or a "regex".

Thanks for letting me add a (potentially very wrong) perspective to the discussion.

Pm

pmichaud commented 1 year ago

Lastly, and apologies for the third post:

I'm the one that coined <ww>, and it was explicitly coined to make default <ws> behavior easier to describe by using the existing regex syntax. Before <ww> existed it was always kind of difficult to explain how <ws> actually worked.

But behind the scenes <ws> never actually used the <ww> subrule directly, and I suspect it still doesn't. For efficiency reasons <ws> was a hard coded method that directly checked word characters and consumed whitespace as needed, it didn't make an actual subrule call for <!ww>. When parsing something the <.ws> subrule tends to get called a lot (like, oh, when parsing Raku source programs), so you want that to be subrule call to be as absolutely efficient as you can possibly make it.

I spent a long time trying to describe <ws> behavior in regex syntax using only the existing built-in constructs at the time, and I never found a satisfactory answer. Once I came up with <ww> as a new pseudo-builtin it was easy though, and Larry and the rest of the design team ultimately accepted and adopted it. <!ww>\s* describes exactly what we want -- "Fail if between two word characters, otherwise consume any optional whitespace". (It turns out the prose description is also much easier if you think of "between two word characters" as a built-in of some sort... all of the prior English descriptions we had for "optional whitespace except where it isn't" were also either long or convoluted or handwavy.) And when we used <!ww>\s* as a description of <ws> it was always as a token, so ratcheting was assumed. I don't know where that got lost.

Today, given this ticket, I would definitely describe <ws> as <!ww>\s*: to make the ratchet on the whitespace explicit. (It's also possible that we didn't have the colon modifier fully defined on quantifiers at that time, although I think we did.)

If anyone comes up with a definition for <ws> that doesn't use something like a <ww> subrule, perhaps that'd be a better description for documenting it now. Back then we didn't have the << or >> word boundary built-ins in the language yet, so maybe defining <ws> can be easier now with current built-ins than with what we had defined then. I haven't thought it through deeply.

Regardless, I ultimately believe the default <ws> needs to be a ratcheting expression (a token) and not backtrack by default.

Lastly, even though <ws> didn't actually use <ww> in its implementation due to performance reasons, I did go ahead and implement a built-in <ww> subrule anyway. I did this because (1) it's much easier to do that than to explain that <ww> was primarily intended to succinctly explain what <ws> did without actually using it, and (2) someone creating a custom :sigspace token might want to have a built-in like <ww> readily available, since it's apparently relatively hard to create it using just basic regex syntax.

Pm

2colours commented 1 year ago

But it seemed to me you thought the doc was right, and I thought it appropriate to let you know it wasn't, and might need to be changed, but definitely needed to not be trusted on this point until we resolve the issue.

I still think the doc is right, and Rakudo is wrong. I'm not going to delve into Roast again, neither do I demand you to do that but as long as it's not proven to be specified that rules were meant to backtrack, under pretty much unspecified circumstances, I won't deem it a doc issue. Why would it be different than any of the Rakudo bugs?

I haven't attempted an analysis of code in the wild, but my guess is that more than 90% of all grammars that have ever been written, including one line grammars, contain at least one instance of this, and more than 10% contain hundreds of instances.

The irony is that I picked up this habit to call Raku a "90% language" whenever I'm criticizing the utterly non-exhaustive design of it. I can only assume this mentality is coming straight from Perl. Regardless, I have never seen a language where it was okay if something worked 90% of time and whenever users made to the remaining 10%, there was even a magic phrase to invalidate the use case (DIHWIDT, sometimes used abusively).

I think this mentality should go, and after that, eventually the language should move away from hardcoded defaults that are cute 90% of the time and completely ruin the remaining 10%. So please, don't normalize something working one way an undefined 90% of time and another way the remaining times. If it should backtrack, it should backtrack always, and we should document that. If it shouldn't backtrack, then we are back to a serious bug.

(...) And sorting out those problems may take years.

It doesn't matter. Severity and consequences doesn't make a bug a feature. If it takes long, that's all the more reason to state that it's bugy.

But you said something to the effect of trusting the doc about something it was wrong about.

Frankly, I'm gonna repeat it as many times as you make that reference: the doc was not wrong about it. Rakudo is wrong about it. In an ideal world where Rakudo doesn't have monopoly over the project, it shouldn't even matter. And I don't know why you thought it added to the discussion in the first place if you even knew that this was a Rakudo bug all along.

What does have an implication on the terminology relevant to this issue is that ws is a token in terms of its behavior. And calling it a "rule" is unfortunate, because the word is ambiguous, and that ambiguity was central to @deoac filing this issue.

I don't think the docs read with "rule" as a term in the sentence, rather than just rule as in a rule of thumb or a rule to follow. Moreover, it seemed to me that you proposed to use the word "token" under similar vague conditions but that's beside the point.

The fact is, there is no error in describing <ws>. The fact is it is the same as <!ww> \s*. And @deoac has (presumably) agreed.

I think this is so far-fetched that it's basically untrue. Somebody, who uses the documentation to get to understand things, could understand the behavior and was happy about that. This is a perfectly normal reaction for a user: they rather want an explanation than a fix for the documentation. Hell, actually, I was the one who proposed opening an issue! If it matters so much for the essence of the problem, I'm going to just open a new one.

And I don't know, I frankly cannot get around your usage of "the same". As @codesections said, if they are "the same", it's the least to be expected that they are substitutable for each other. <!ww> \s* is not a token, for starters. Of course they are not substitutable for each other when they aren't even the same category of things.
It seems to me that you expect the user to resolve this apparent contradiction routinely in favor of the word "token", like the users should automatically go from "hm, it cannot be the same and a token at once" to "okay, I get it, it's as if you defined a token with that regex". Why are you so sure this happens? Why do you think their eyes won't catch "the same as (...)" more than the disclaimer somewhere on top about being a token, or won't trust this (wrong) statement more?

it would have surprised the heck out of me because those named tokens were designed before their implementation, and the Raku ws is very clearly a custom user defined ws.

Point is, existing methods slip through grammars and can interfere with tokens you define there. I couldn't be very surprised if there was some global mishmash in the core as well.

We disagree about whether they are the same. And I provided no disclaimer. I merely suggested changing "other rules" to "Cs". The doc had been fine for years, with no one being "trapped" by what it said. When @deoac posted to say there was an issue, I accepted that, and drilled down to the key change that would have headed off filing an issue about it.

It's really hard this way, basically every sentence is a serious obstacle to what I'm trying to get across.

Whether two things are the same or not shouldn't be a matter of opinion; I cannot accept your standpoint on that. I think it's factually proven at this point, back and forth, that they are not the same.

I would seriously disadvise drawing positive (as in, actively stating) conclusions from lack of feedback. We don't know how many people have read that, with what background, and what their reaction was. It could even be that they also thought of free substitution (what "the same" really implies) and they just haven't hit a case where it matters. It could be that they had a similar bad opinion about the docs as you have, and ignored the whole stuff.

And as I already said, if it makes things clearer: treat this issue as my issue. A user was baffled about behavior based on what the docs said and it was probably nighttime so I asked them to open a doc issue, rather than doing a PR right away. I'm thankful for catching this issue which I think stays with us, whether the OP cares about it or not.

But imo making ws complicated is gaslighting. It's really dead simple. If the current regex engine is between two word characters, then it doesn't match. If it does match, it consumes all contiguous whitespace. And it doesn't backtrack -- it's a token. End of story.

Excellent. This could even be written down instead of "it's the same as (...)" something that isn't even a token.

2colours commented 1 year ago

My own proposal would be still something like:

Edit: I forgot the proposal of @codesections ; that's also fine by me.

2colours commented 1 year ago

My suggestion is to just say that the default <ws> is the same as <!ww>\s*: and leave it at that.

@pmichaud

I had to read all your messages, then go back to these snippets, confirm they all contain the colon and even mention it... Aaand it is even documented, apparently... https://docs.raku.org/language/regexes#Preventing_backtracking:_:

I have never seen this over years, neither in code, nor in the documentation, and the biggest problem is probably the usual one: I would have never imagined something like this exists, let alone with a colon. I have the vague impression that this never got proposed because the others also never thought of this, or worst case, genuinely weren't aware of it.

I think it is a great idea actually to use it here but at the moment this colon seems so arcane that we should perhaps deliberately direct some attention to it for the time being, even if by a lame didactic "please mind the colon modifier of the quantifiers".

codesections commented 1 year ago

Yeah, : is under-documented and :? and :! (which I didn't know about until looking into this issue) are entirely undocumented. I'll open a separate issue for them.

2colours commented 1 year ago

By the way, https://docs.raku.org/language/grammars#ws even states the wrong thing explicitly: with regex, instead of token...

pmichaud commented 1 year ago

I have never seen this over years, neither in code, nor in the documentation, and the biggest problem is probably the usual one: I would have never imagined something like this exists, let alone with a colon. I have the vague impression that this never got proposed because the others also never thought of this, or worst case, genuinely weren't aware of it.

I think it is a great idea actually to use it here but at the moment this colon seems so arcane that we should perhaps deliberately direct some attention to it for the time being, even if by a lame didactic "please mind the colon modifier of the quantifiers".

I agree. The :, ::, and ::: colon modifiers for backtracking control have been present from the earliest versions of the regex language, even (I think) predating "token" and "rule". The fact that we no longer encounter them frequently in actual code perhaps testifies to just how effective "token", "rule", ":ratchet", and ":sigspace" have been. Earlier versions of grammars we wrote would make frequent use of the colon modifiers to try to DTRT, which led to finding ways to make backtracking control more automatic and not have to sprinkle them everywhere.

The colon modifiers just make sense -- they give a way to control backtracking at an atomic level, for those cases where you want to enforce/prevent backtracking for a specific quantifier or backtracking point and not as the default used by the entire regex/token/rule.

: came first, the :! and :? forms came much later as we were coming up with a generic pattern for understanding backtracking control. Things we use all the time like .*? are really a shorthand notation for .*:?.

Thanks to whoever added #4370 to get these (highly useful in key contexts) modifiers documented, since their existence had been lost to the sands of time.....

Pm

pmichaud commented 1 year ago

For those who like history, I even found the original March 2009 thread where <ws> was clarified and <ww> got introduced.

https://www.nntp.perl.org/group/perl.perl6.language/2009/03/msg31119.html

Larry's comment three messages later still warms my heart to this day:

From: Larry Wall Date: March 9, 2009 10:53 I have wanted <!ww> a number of times, particularly after generic tokens that might or might end in \w. So feel free to spec it.

Pm