LanguageDev / Yoakke

A collection of libraries for implementing compilers in .NET.
https://languagedev.github.io/Yoakke/
Apache License 2.0
141 stars 8 forks source link

Ignore elements in parser rules #77

Open LPeter1997 opened 3 years ago

LPeter1997 commented 3 years ago

Is your feature request related to a problem? Please describe. In many cases, many elements in a parser do not make it into the constructed result and are purely there for disambiguation/aesthetics. Still, these elements clutter the argument list, like so:

[Rule("statement : 'return' expression? ';'")]
private static Statement Return(IToken _ret, Expression? value, IToken _sc) => new Statement.Return(value);

The keyword and semicolon are completely unnecessary and just make the function signature more verbose. A bigger example:

[Rule("statement : 'func' Identifier '(' (Identifier (',' Identifier)*)? ')' '{' statement* '}'")]
private static Statement FunctionDef(
    IToken _func, IToken name,
    IToken _lp, Punctuated<IToken<TokenType>, IToken<TokenType>> paramList, IToken _rp,
    IToken _ob, IReadOnlyList<Statement> statements, IToken _cb) => 
    new Statement.FunctionDef(name.Text, paramList.Values.Select(v => v.Text).ToList(), new Statement.Block(statements));

5 of the 8 parameters are unused.

Describe the solution you'd like There should be a way to ignore elements, so they don't get passed in to the transformer.

Describe alternatives you've considered We could introduce some kind of prefix operator that annotates that the given element should be ignored. Something like:

[Rule("statement : #'func' Identifier #'(' (Identifier (',' Identifier)*)? #')' #'{' statement* #'}'")]

I'm not sure about the symbol, I was considering -, ~, #, %, $, !, but any syntax recommendation could work.

LPeter1997 commented 3 years ago

I've thought of an alternative design. Since in most cases we just want to drop all terminals, we could have the annotations say, that we just want to drop all terminals. The parser itself could have a default preference, but rules could override it case-by-case. Example:

[Parser(typeof(TokenType), DiscardTerminals = true)]
public partial class Parser
{
    [Rule("statement : 'return' expression? ';'")]
    private static Statement Return(Expression? value) => new Statement.Return(value);

    [Rule("expression : IntLit", DiscardTerminals = false)]
    private static Expression IntLit(Token t) => new Expression.Literal(t);
}

I think this could be beneficial, because we don't litter the grammar with annotations.

Question: What should happen with terminals in nested constructs?

Let's look at the example from the original issue:

[Rule("statement : 'func' Identifier '(' (Identifier (',' Identifier)*)? ')' '{' statement* '}'")]
private static Statement FunctionDef(
    IToken _func, IToken name,
    IToken _lp, Punctuated<Token, Token> paramList, IToken _rp,
    IToken _ob, IReadOnlyList<Statement> statements, IToken _cb) => 
    new Statement.FunctionDef(name.Text, paramList.Values.Select(v => v.Text).ToList(), new Statement.Block(statements));

If we decide to discard terminals here, what should happen? Should we drop them from the (Identifier (',' Identifier)*)? construct too? That would disallow the Punctuated<...> sugar, which would be pretty nice to keep. We could:

  1. Keep terminals in non-top-level constructs.
  2. Make Punctuated work without punctuation elements.

Problem: We lose benefits by dropping all terminals! If you look at this more closely:

[Rule("statement : 'func' Identifier '(' (Identifier (',' Identifier)*)? ')' '{' statement* '}'")]
private static Statement FunctionDef(
    IToken _func, IToken name,
    IToken _lp, Punctuated<Token, Token> paramList, IToken _rp,
    IToken _ob, IReadOnlyList<Statement> statements, IToken _cb) => 
    new Statement.FunctionDef(name.Text, paramList.Values.Select(v => v.Text).ToList(), new Statement.Block(statements));

We do want to keep a few elements, namely the function name and the parameter names. These are terminals too! I also see a few options here:

  1. Make it possible to ignore only literal, textual matches. Notice, that we only want to keep things that are referenced by their token kind, and discard the literal matches. We want to keep Identifier, but not 'func'. We could have a DiscardLiteral option for this. If we assume that we make Punctuated work without the punctuation characters, the above could become

    [Rule("statement : 'func' Identifier '(' (Identifier (',' Identifier)*)? ')' '{' statement* '}'", DiscardLiterals = true)]
    private static Statement FunctionDef(IToken name, Punctuated<Token> paramList, IReadOnlyList<Statement> statements) => 
    new Statement.FunctionDef(name.Text, paramList.Values.Select(v => v.Text).ToList(), new Statement.Block(statements));
  2. Still have a notation in the grammar to keep certain tokens.

    [Rule("statement : 'func' +Identifier '(' (+Identifier (',' +Identifier)*)? ')' '{' statement* '}'", DiscardTerminals = true)]
    private static Statement FunctionDef(IToken name, Punctuated<Token> paramList, IReadOnlyList<Statement> statements) => 
    new Statement.FunctionDef(name.Text, paramList.Values.Select(v => v.Text).ToList(), new Statement.Block(statements));
skneko commented 3 years ago

I agree with many points of your design! Good job!

To expand it and address some details, I think the ParserAttribute could have a bool DiscardTerminals and then every RuleAttribute would have a bool? DiscardTerminals (notice it's nullable!), where null means "inherit" and a true or false is an explicit override for that rule only. The default value for the parser should be false for the sake of backward compatibility and for being less confusing for new users (it's easier to ignore something than to look up why something you expect is not there). The default value for rules would be null (inherit). We could also use an enum that makes this more explicit (True, False, Inherit?).

I don't think discriminating between string-literal-style and token-name-style terminals is a good idea. Maybe a user whishes to always use the token name for maintainability reasons, e.g. using Open Identifier Close instead of '(' Identifier ')' to avoid changing the parser rules while the lexer tokens are modified. There is no actual semantic difference between both, so we shouldn't add one ourselves, I reckon. I like the idea of using a explicit character (I like +), like this:

[Rule("statement : 'func' +Identifier '(' (+Identifier (',' +Identifier)*)? ')' '{' statement* '}'", DiscardTerminals = true)]
private static Statement FunctionDef(IToken name, Punctuated<Token> paramList, IReadOnlyList<Statement> statements) => 
    new Statement.FunctionDef(name.Text, paramList.Values.Select(v => v.Text).ToList(), new Statement.Block(statements));

Regarding Punctuated, what makes the most sense for me is for it to completely ignore the attribute. Why? Because this is a convenience change and this class already allows the user to efficiently and ergonomically handle the pontentially "useless" tokens separately, discriminating between Values and the separators (sorry I don't remember the field name now). This may sound inconsistent, but I see it as a pragmatic sugar over an already consistent and solid base. Regarding the + marks, we could make it an error to not put them, in order to gain consistency. This way, a punctuated sequence in DiscardTerminals = true mode comes from this, yes or yes: (+Terminal (+Separator +Terminal)*)?.

Alternative: if DiscardTerminals = true and we find (+Terminal (Separator +Terminal)*)? (notice the separator is not marked) it becomes IReadOnlyList<Token>.

Also, maybe we could call it IgnoreTerminals instead of DiscardTerminals? I see upsides and problems with both alternatives, and I am not sure if they explain "I do expect the terminals but I won't pass them to you" to a user who has not read the documentation. Maybe "discard" is better then, now that I think of it.