0x2a-42 / lelwel

Resilient LL(1) parser generator for Rust
Apache License 2.0
102 stars 0 forks source link

More direct CST node elision control #2

Open mullr opened 3 weeks ago

mullr commented 3 weeks ago

First, thank you for making this. It's the Rust parser generator I've always wanted.

Reading through the documentation, I saw that there are a handful of heuristics to determine if a node should be included in the CST or not. They generally make sense, but I feel like it would be better if they were more explicit. This is especially true because the AST functions that you build to extract information from the CST tend to care about its structure, and things can go pretty badly if the structure isn't like you expect.

Have you considered a mechanism to directly control whether a node is elided or not?

One approach that I've seen work very well is the one taken by https://github.com/Engelberg/instaparse. They have an extension to the ebnf that lets you mark if a node should appear in the output or not. You can do this on both when you use a rule, or on the left hand side of a rule definition. They call this "hiding": https://github.com/Engelberg/instaparse?tab=readme-ov-file#hiding-content

0x2a-42 commented 3 weeks ago

Thanks for the feedback. I hope you'll find the project useful.

I completely agree. I'm also not happy with the implicit nature of the rule pattern matching. The language server helps a little, as it will show the pattern when one hovers over the rule name, but it is still quite dangerous. Another problem that can easily occur is an operator precedence rule turning into a left recursive rule, which does not take into account precedence. I plan to use a Pratt parser instead of the precedence climbing method to mitigate this problem in the future.

I think the relevant section from the instaparse documentation would be the one about hiding tags, as hiding content seems to be about hiding tokens in the CST, which would go against the idea of lossless syntax trees. Tree-sitter also supports this style of hiding rules by prepending an underscore to the rule name.

This would be sufficient for what I called "unconditional forwarding", where a rule will derive to exactly one nonterminal e.g.

stmt:
  for_stmt
| while_stmt
| expr_stmt
| return_stmt
;

However, for all the other cases of node elision this would not be enough. For example, the following rule should only elide the node if the second branch is taken.

unary_expr:
  ('+' | '-') unary_expr
| primary_expr
;

Possible Solution

A possible solution would be a "forwarding" operator ^ which can be used on nonterminals and rule declarations. This would basically allow the same behavior as before, but it would makes the node elision explicit.

// allowed at rule if it only forwards to other rules or derives to the empty word, otherwise error
stmt^:
  for_stmt
| while_stmt
| expr_stmt
| return_stmt
;
generic_args^:
  ['<' type_list '>']
;

// allowed at nonterminal if it is possibly the whole result of the rule application, otherwise error
unary_expr:
  ('+' | '-') unary_expr
| primary_expr^ 
;
bin_expr:
  bin_expr ('*' | '/') bin_expr
| bin_expr ('+' | '-') bin_expr
| primary_expr^
;
expr_list:
  expr^ (',' expr)*
;