Open ishax-kos opened 1 year ago
I'm not sure that your assesment
Using
<-
containing rules that use<
will nullify the effectiveness of the former
is correct. Let's write out your initial grammar, replacing <
with <-
and adding explicit Spacing
:
File <- Spacing Statement_list Spacing eoi
Statement_list <- Statement Inline_spacing (Separator Inline_spacing Statement)*
Separator <- ',' / eol+
Statement <- Spacing '1' Spacing '+' Spacing '2' Spacing
Inline_spacing <- :(' ')*
Spacing <- :(' ' / '\t' / '\r' / '\n' / '\r\n')*
This is equivalent to your grammar. Now, feeding it this input
1 + 2, 1 + 2
1
+2
the second statement on the first line of input will consume the line ending, because Statement
ends with Spacing
. The next token is 1
on the second line of input, but the grammar expects Separator
, which you want to be the line ending that was already consumed. This is the reason that this grammar fails to parse this input.
In cases where spacing has semantic relevance depending on context, like this one, I recommend to not use <
rules and be explicit about the kind of spacing. See this complete example, using the predefined parsers space
and blank
:
/+dub.sdl:
dependency "pegged" version="~>0.4"
+/
import pegged.grammar;
import std.stdio;
mixin(grammar(`
MyGrammar:
File <- Statement_list eoi
Statement_list <- Statement (Separator Statement)*
Separator <- ',' / eol+
Statement <- :space* '1' :blank* '+' :blank* '2' :space*
`));
void main()
{
auto parseTree = MyGrammar("1 + 2");
assert(parseTree.matches == ["1", "+", "2"]);
writeln(parseTree);
writeln;
parseTree = MyGrammar("1 + 2, 1 + 2");
assert(parseTree.matches == ["1", "+", "2", ",", "1", "+", "2"]);
writeln(parseTree);
writeln;
parseTree = MyGrammar("1 + 2, 1 + 2\n1\n+2");
assert(parseTree.matches == ["1", "+", "2", ",", "1", "+", "2", "\n", "1", "+", "2"]);
writeln(parseTree);
writeln;
}
As you see, this parses your input correctly, because Statement
will not consume any trailing line endings, leaving it to be parsed as Separator
.
As for your proposal, we cannot change the current behaviour, as that would break existing grammars. So a new shorthand would have to be introduced. I haven't studied this further, but perhaps parameterized rules may help you avoid the need for a new shorthand.
Hope this helps, Bastiaan.
You reiterated what I described in the br
rule example. Consuming leading and trailing spacing is redundant in the majority of cases and causes specific cases to be more difficult to express.
A new shorthand would certainly work. Perhaps I should fork it and make such a thing. <<
seems appropriate. It may also be of use to be able to specify alternative spacing rules on a per rule basis. And a "flatten" operator.
Thank you for the fast reply!
You reiterated what I described in the
br
rule example.
Sorry about that. That wasn't clear to me when I read your code, but at least we are on the same page :-)
A new shorthand would certainly work. Perhaps I should fork it and make such a thing.
<<
seems appropriate.
I accept pull requests. Beware though that some files are generated, and those should not be edited by hand. See https://github.com/PhilippeSigaud/Pegged/tree/master/pegged/dev
It may also be of use to be able to specify alternative spacing rules on a per rule basis.
That is what I was thinking about with my parameterized rules suggestion. Totally not sure whether that is applicable though.
And a "flatten" operator.
What's that? We have the existing <:
.
Thank you for the fast reply!
My pleasure!
And a "flatten" operator.
What's that? We have the existing
<:
.
The behavior is completely different. What that does is drop the parent AND the child nodes. What I mean is to be able to mark a node for decimation. I tried to pull this off by altering the name the parent with a semantic action, so that it would be culled, but as it happens <{cull}
only operates on the top node INSIDE the rule and not the rule node itself.
And a "flatten" operator.
What's that? We have the existing
<:
.The behavior is completely different. What that does is drop the parent AND the child nodes.
Sorry, I meant <~
.
What I mean is to be able to mark a node for decimation. I tried to pull this off by altering the name the parent with a semantic action, so that it would be culled, but as it happens
<{cull}
only operates on the top node INSIDE the rule and not the rule node itself.
I still don't quite understand what you want. What happens to the children of a culled node? Do they replace it?
If you recall how decimation works, like that. The children are added to the tree. What I ended up doing is using a recursive script that renames everything with a certain prefix.
Problem
Using
<-
containing rules that use<
will nullify the effectiveness of the former. For example, imagine you desire grammar like the following:valid input that yields 3 statements might look like
The intention of this grammar is that only one statement can occur per line. Basically statements can be separated by a number of newlines, or one comma, but not both, while the contents of those statements can be spread across multiple lines. These rules are impossible with this more intuitive setup, because
<
captures leading and trailing white-space. There is a way to pull this off, but it requires instead setting a custom spacing rule with no newlines, and then manually specifying everywhere a newline can occur in the middle of a statement. For instance, here's an excerpt from my own code.Proposal
There is an elegant solution that should save on some spacing checks and enable the code I presented at the top to work as intended. Essentially,
Spacing
is only inserted between characters. Never at the start or end of the rule. There is one exception to this though, which is the entrypoint rule. With this the entrypoint must have non space characters at the first position of the input. The solution for that is to either make an exception where<
on the entrypoint will capture leading whitespace, or just require the user to handle it explicitly with""
or usingSpace
directly. This shouldn't cause breakages for most code, consider:Expands to:
The only case where spacing is missed, is where
A
uses<-
, forcing2
and3
to be adjacent, or allow only what the user wants.