dlang-community / Pegged

A Parsing Expression Grammar (PEG) module, using the D programming language.
533 stars 66 forks source link

Remove Leading and trailing automatic space capture #332

Open ishax-kos opened 1 year ago

ishax-kos commented 1 year ago

Problem

Using <- containing rules that use < will nullify the effectiveness of the former. For example, imagine you desire grammar like the following:

    File < Statement_list eof
    Statement_list <- Statement Inline_spacing (Separator Inline_spacing Statement)*
    Separator <- ',' / eol+
    Statement < '1' '+' '2'
    Inline_spacing <- :(' ')*

valid input that yields 3 statements might look like

1 + 2, 1 + 2
1
+2

The intention of this grammar is that only one statement can occur per line. Basically statements can be separated by a number of newlines, or one comma, but not both, while the contents of those statements can be spread across multiple lines. These rules are impossible with this more intuitive setup, because < captures leading and trailing white-space. There is a way to pull this off, but it requires instead setting a custom spacing rule with no newlines, and then manually specifying everywhere a newline can occur in the middle of a statement. For instance, here's an excerpt from my own code.

    br <- :(eol?)

    Expression < Ex5
    Ex5 < Type? br Ex4
    Ex4 < Ex3 (br '+' br Ex3)*
    Ex3 < Ex2 (br '*' br Ex2)*
    Ex2 < ('-' / '/' br)? Ex1
    Ex1 < '(' br Expression br ')' / Ex0
    Ex0 < lit_number / name_value

Proposal

There is an elegant solution that should save on some spacing checks and enable the code I presented at the top to work as intended. Essentially, Spacing is only inserted between characters. Never at the start or end of the rule. There is one exception to this though, which is the entrypoint rule. With this the entrypoint must have non space characters at the first position of the input. The solution for that is to either make an exception where < on the entrypoint will capture leading whitespace, or just require the user to handle it explicitly with "" or using Space directly. This shouldn't cause breakages for most code, consider:

    A < B C
    B < '1' '2'
    C < '3' '4'

Expands to:

    A <- B Spacing C
    B <- '1' Spacing '2'
    C <- '3' Spacing '4'

The only case where spacing is missed, is where A uses <-, forcing 2 and 3 to be adjacent, or allow only what the user wants.

veelo commented 1 year ago

I'm not sure that your assesment

Using <- containing rules that use < will nullify the effectiveness of the former

is correct. Let's write out your initial grammar, replacing < with <- and adding explicit Spacing:

    File <- Spacing Statement_list Spacing eoi
    Statement_list <- Statement Inline_spacing (Separator Inline_spacing Statement)*
    Separator <- ',' / eol+
    Statement <- Spacing '1' Spacing '+' Spacing '2' Spacing
    Inline_spacing <- :(' ')*
    Spacing <- :(' ' / '\t' / '\r' / '\n' / '\r\n')*

This is equivalent to your grammar. Now, feeding it this input

1 + 2, 1 + 2
1
+2

the second statement on the first line of input will consume the line ending, because Statement ends with Spacing. The next token is 1 on the second line of input, but the grammar expects Separator, which you want to be the line ending that was already consumed. This is the reason that this grammar fails to parse this input.

In cases where spacing has semantic relevance depending on context, like this one, I recommend to not use < rules and be explicit about the kind of spacing. See this complete example, using the predefined parsers space and blank:

/+dub.sdl:
dependency "pegged" version="~>0.4"
+/
import pegged.grammar;
import std.stdio;

mixin(grammar(`
MyGrammar:
    File <- Statement_list eoi
    Statement_list <- Statement (Separator Statement)*
    Separator <- ',' / eol+
    Statement <- :space* '1' :blank* '+' :blank* '2' :space*
`));

void main()
{
    auto parseTree = MyGrammar("1 + 2");
    assert(parseTree.matches == ["1", "+", "2"]);
    writeln(parseTree);
    writeln;

    parseTree = MyGrammar("1 + 2, 1 + 2");
    assert(parseTree.matches == ["1", "+", "2", ",", "1", "+", "2"]);
    writeln(parseTree);
    writeln;

    parseTree = MyGrammar("1 + 2, 1 + 2\n1\n+2");
    assert(parseTree.matches == ["1", "+", "2", ",", "1", "+", "2", "\n", "1", "+", "2"]);
    writeln(parseTree);
    writeln;
}

As you see, this parses your input correctly, because Statement will not consume any trailing line endings, leaving it to be parsed as Separator.

As for your proposal, we cannot change the current behaviour, as that would break existing grammars. So a new shorthand would have to be introduced. I haven't studied this further, but perhaps parameterized rules may help you avoid the need for a new shorthand.

Hope this helps, Bastiaan.

ishax-kos commented 1 year ago

You reiterated what I described in the br rule example. Consuming leading and trailing spacing is redundant in the majority of cases and causes specific cases to be more difficult to express. A new shorthand would certainly work. Perhaps I should fork it and make such a thing. << seems appropriate. It may also be of use to be able to specify alternative spacing rules on a per rule basis. And a "flatten" operator.

Thank you for the fast reply!

veelo commented 1 year ago

You reiterated what I described in the br rule example.

Sorry about that. That wasn't clear to me when I read your code, but at least we are on the same page :-)

A new shorthand would certainly work. Perhaps I should fork it and make such a thing. << seems appropriate.

I accept pull requests. Beware though that some files are generated, and those should not be edited by hand. See https://github.com/PhilippeSigaud/Pegged/tree/master/pegged/dev

It may also be of use to be able to specify alternative spacing rules on a per rule basis.

That is what I was thinking about with my parameterized rules suggestion. Totally not sure whether that is applicable though.

And a "flatten" operator.

What's that? We have the existing <:.

Thank you for the fast reply!

My pleasure!

ishax-kos commented 1 year ago

And a "flatten" operator.

What's that? We have the existing <:.

The behavior is completely different. What that does is drop the parent AND the child nodes. What I mean is to be able to mark a node for decimation. I tried to pull this off by altering the name the parent with a semantic action, so that it would be culled, but as it happens <{cull} only operates on the top node INSIDE the rule and not the rule node itself.

veelo commented 1 year ago

And a "flatten" operator.

What's that? We have the existing <:.

The behavior is completely different. What that does is drop the parent AND the child nodes.

Sorry, I meant <~.

What I mean is to be able to mark a node for decimation. I tried to pull this off by altering the name the parent with a semantic action, so that it would be culled, but as it happens <{cull} only operates on the top node INSIDE the rule and not the rule node itself.

I still don't quite understand what you want. What happens to the children of a culled node? Do they replace it?

ishax-kos commented 1 year ago

If you recall how decimation works, like that. The children are added to the tree. What I ended up doing is using a recursive script that renames everything with a certain prefix.