erikrose / parsimonious

The fastest pure-Python PEG parser I can muster
MIT License
1.83k stars 128 forks source link

Grammar extensions #208

Open lucaswiman opened 2 years ago

lucaswiman commented 2 years ago

@erikrose @lonnen This PR proposes a method of extending grammars, including Parsimonious' own rule syntax. Implements #30.

Changes

Syntax for referencing/overriding previously-defined rules.

Erik suggested a syntax like this in #30. It seems reasonable, if a little terse. Very open to suggestions here.

The key point here is that to truly extend functionality of other grammars, references cannot be resolved until after ^super expressions have been resolved. This allows e.g. defining a new kind of expression, and having it included anywhere expression was used in the original grammar.

Example:

default = foo*
foo = "bar"
foo = ^foo / "baz"

This is equivalent to the following grammar:

default = foo*
foo = "bar" / "baz"

Syntax for dividing up rule sections

Two or more = or - characters makes a new kind of comment. It has no semantic content, though it could be used for refining the inheritance semantics, e.g. around **more_rules custom rules.

default = foo*
foo = "bar"
==============
foo = ^foo / "baz"

Grammar.extend instance method

Takes the same arguments as the Grammar constructor, but instead extends the existing grammar by concatenating the original grammar definition and the new one. To achieve this, the original arguments passed to the constructor are retained.

Class variables on Grammar to define how a grammar is parsed and visited

Each Grammar subclass defines a grammar that parses rules, and a visitor class that visits them.

This allows extensions to parsimonious's syntax without needing to reach consensus on what those extensions should be. Individual users can update the syntax to make a DSL useful for their own purposes.

I included an example of a different approach to token grammars that is useful for a particular problem I'm trying to solve. Here, CAPITAL_REFERENCES refer to token types, while lowercase references refer to rules. Attributes of tokens can themselves be matched or parsed with a language similar to xpath.

Limitations

  1. This exposes some parts of the internals as "public" parts of the API, which may be a problem if we need to change those internals. However, this is extremely useful functionality, and would allow making and using proposed syntax changes before or instead of altering the grammar definition DSL. Still, it may make sense to resolve https://github.com/erikrose/parsimonious/issues/199 before shipping this functionality.
  2. The **more_rules construct is a bit wonky or buggy. Consider the following:
    g = Grammar("...", custom_expr=MyCoolCustomExpression())
    g2 = g.extend("""
    custom_expr = ^custom_expr / something_else
    """)

Here the extension doesn't do anything since the extra "custom" overrides the extension. I think there are solutions to this, but they're a bit finicky to implement, so I figured I'd put this up for discussion before continuing.

That said, it doesn't break any existing use of the **more_rules feature which is a bit of an advanced/experimental feature anyway.

Still TODO

lonnen commented 2 years ago

I don't have anything to add over Erik's original ideas for statement reference and override in #30.

I can really only offer that the AND/Or precedence issue remains something of a sticky issue which seems to actually be about dealing with backwards incompatibility in general. The extend syntax is going to expand the de-facto public API and maybe some policy or expectations would help here.

Erik added "I don't plan on making any backward-incompatible changes to the rule syntax in the future, so you can write grammars with confidence" in version 0.2, ten years ago, and then promptly shipped two breaking version (0.5, 0.6). It's sometimes necessary. You have the most skin in the game here, and I'm inclined to follow your recommendations but exposing the internals is going to add some tension with respect to backwards compatibility