erikrose / parsimonious

The fastest pure-Python PEG parser I can muster
MIT License
1.81k stars 127 forks source link

NodeVisitor API change proposals #114

Open lucaswiman opened 7 years ago

lucaswiman commented 7 years ago

Below are some of my thoughts on how to change the NodeVisitor class so it's more user-friendly. I wanted to write this down while the code was fresh in my head, but there is no rush or urgency. The motivating examples I have in mind are things like the following expressions:

union = (concatenation "|")+ concatenation
lookaround = "(" ("?=" / "?!" / "?<=" / "?<!") re ")"

Which requires visitors like this:

    def visit_union(self, node, children):
        disjuncts = []
        disjuncts_and_pipes, last_disjunct = children
        for (disjunct, superfluous_pipe) in disjuncts_and_pipes:
            disjuncts.append(disjunct)
        disjuncts.append(last_disjunct)
        return reduce(operator.or_, disjuncts)
    def visit_lookaround(self, node, children):
        lparen, [quantifier], lookaround_re, rparen = children
        ...

generic_visit changes

In practice it's quite difficult to write a visitor without a generic_visit method. Almost every use I've seen or written of parsimonious includes the following:

def generic_visit(self, node, visited_children):
    return node.text or visited_children

This is undesirable because it means anonymous rules which can have a range of number of children (eg Optional) can sometimes have different return types: either an (empty) list if nothing matched, or a string if something did. After #113, visitors can be a bit more intelligent to avoid those situations by inspecting the type. I propose making the following breaking changes:


For rules with no visitor method defined, perform the following changes:

Unpacking

Another issue comes in unpacking arguments. In Python <3, you could unpack tuples inside of function arguments. The core team decided to get rid of that nice feature, which means that for all Python >=3 work, you need to have an extra line unpacking the visited children into a useful form. With that in mind, I think changing this line to this:

return method(node, *[self.visit(n) for n in node])

would return some of the syntactic niceness of the python 2 tuple unpacking to the burgeoning Python 3 world.

Specifying that nodes should be ignored

This is stolen from lark, an upstart python parser generator library (for CFGs rather than PEGs). The idea is to add a syntax to the grammar allowing certain expressions to be ignored during visitation. This helps to avoid both verbose unpacking, and tightly coupling the structure of the expression in the grammar to the function that evaluates it. I propose :novisit or :n after any expression to mark it as should-not-visit. See "attribute grammars" for some prior art.


Putting it all together

As an example, consider the contrived language of parenthesized addition operations where whitespace is not semantically meaningful:

Currently this is handled like the following

class Evaluator(NodeVisitor):
    grammar = Grammar(r'''
        expr = _ (addition / number) _
        addition = "(" expr "+" expr ")"
        number = ~"[1-9]\d*"
        _ = ~"\s*"
    ''')
    def visit_number(self, node, children):
        return int(node.text)
    def visit_expr(self, node, children):
        _, [value], _ = children
        return value
    def visit_addition(self, node, children):
        lparen, left, plus, right, rparen = children
        return left + right
    def generic_visit(self, node, children):
        return children or node.text

>>> Evaluator().parse('(( 1 + 2) + 3 )')
6

Using the suggestions here, this could be reduced to something like:

class Evaluator(NodeVisitor):
    grammar = Grammar(r'''
        expr = _ (addition / number) _
        addition = "(":n expr "+":n expr ")":n
        number = ~"[1-9]\d*"
        _:novisit = ~"\s*"
    ''')
    def visit_number(self, node, number):
        return int(number)
    def visit_expr(self, node, value):
        return value
    def visit_addition(self, node, left, right):
        return left + right
lucaswiman commented 7 years ago

Erik: this is the last one today, I promise! :-)

Again, no rush on any of this. Thanks for your work on Parsimonious!

erikrose commented 7 years ago

Finally getting around to giving this a proper look. Thanks for your patience!

erikrose commented 7 years ago

At first glance, this looks very good! One comment so far:

For matched string literals (Literal and Regex) with no visitor method defined, just return the matched text.

Sounds useful. I've never seen anybody use capturing groups in a Parsimonious regex, but that's the functionality RegexNode exists to provide. If, however, we made RegexNode a subclass of str, we could have the best of both worlds. (I did something like this in blessings.) Reaction?

More to come!

erikrose commented 7 years ago

unpacking

The unpacking part looks good to me as well. I'm still mad at the core team for killing that nice feature. :-) If we ever had to add another param ot visitor methods, we can make it a kwarg and, in Python 3, a keyword-only arg so normal args don't accidentally fall into it if somebody provides too many.

erikrose commented 7 years ago

generic_visit changes

All your changes under generic_visit look good. I'd add to them that OneOrMore and ZeroOrMore expressions should always return lists. There's a bug now in which they sometimes don't.

erikrose commented 7 years ago

FWIW, regarding your example of a typical generic_visit, my visitors sometimes do the reverse of your return node.text or visited_children: https://github.com/mozilla/dxr/blob/1666883ed0d28a7d34280c00dd477beac365f6b0/dxr/query.py#L328. I just wanted to provide a counterexample in case it throws off any of your reasoning, though I don't think it should.

erikrose commented 7 years ago

Specifying that nodes should be ignored

Ignoring nodes is part of a larger ticket I've been wanting to work on for some time. Have a look at https://github.com/erikrose/parsimonious/issues/29#issuecomment-48846170 for some thoughts about syntax and the other kinds of transforms we might eventually want to support.

Attribute grammars are interesting; I hadn't seen them before. It looks to me like Parsimonious visitors are simply an alternate spelling of them; they seem completely homomorphic. For example, the .value attribute in Wikipedia's example is just the entire return value of our visitor methods as typically constructed. The equivalent to multiple attributes would be returning a dict from visitor methods and passing that around.

Syntax: I've not yet found one I really like. rlib's <- and >-based syntax is clearly wrong; I've read that page many times before and still I find myself scratching my head over their assignment of semantics. My reuse of that syntax with different semantics is more memorable to me, following a physical metaphor of squishing a term out of existence or expanding a term to replace the whole expression. The downside is that there's nowhere obvious to go if we need to represent additional transformations. Your named syntax gives us that, at the cost of having to use our lexical brains. Hmm. I'd be interested in kicking around more ideas with you. Or if you wanted to implement one syntax but were able to keep it fluid in case we want to change it before landing, that'd be fine too.

erikrose commented 7 years ago

Putting it all together

Looks great. It would also be nice to see what effect the changes have on some of my pet visitors, just to ferret out unexpected corner cases:

Maybe we could split up this work.

That's all I have to say for now! I think you again for your extreme patience and await your response.

lucaswiman commented 7 years ago

tl;dr

Maybe we could split up this work.

The "ignoring" syntax is by far the hardest part, so I'm happy to leave it to you. :-D

The other parts are fairly easy to implement in one go, so I'd be happy to submit a PR for you to try out on your pet NodeVisitors. Regarding backwards compatibility: would you prefer to provide a new name (e.g. NodeVisitor2, with a deprecation schedule) for an easier upgrade path, or just break compatibility? The former would be easier for the codebase at my work, but I understand if you don't want to mess with all that.

More detailed comments

Sounds useful. I've never seen anybody use capturing groups in a Parsimonious regex, but that's the functionality RegexNode exists to provide. If, however, we made RegexNode a subclass of str, we could have the best of both worlds. (I did something like this in blessings.) Reaction?

This would be a clean API, so I mostly like it. There are two issues I see with it:

  1. This would substantially increase the memory usage of regex nodes, since I don't think there's a way to inherit from str without actually making a memory allocation of the requisite size. For my use cases, the strings I'm working with aren't that long, so this is fine by me. However, it might impinge on the original intent of parsimonious to be fast and memory efficient.
    1. It would be a very nice feature to support bytes grammars in Python 3 (see #31), and this would make supporting it harder.

FWIW, regarding your example of a typical generic_visit, my visitors sometimes do the reverse of your return node.text or visited_children

Oops! That was a typo, and the idiom in your link is similar to the one I mean:

def generic_visit(self, node, visited_children):
    return visited_children or node.text

node.text is almost always truthy, so the generic_visit I wrote in the description wouldn't be very useful.


DXR's regex language dialect

:-) I also wrote a regex parser in my regex-reversing library revex (visitor, grammar). DXR looks really cool; I'll have to try it out!

erikrose commented 7 years ago

NodeVisitor compatibility

I lean toward breaking NodeVisitor compatibility but mitigating it with a well-written upgrade guide. It's high time we got Sphinx working on Parsimonious, and somebody has a branch open that does it. That would be a good place to park it. Also, I've been recommending for some time that people pin the exact version of Parsimonious they're using until we get to 1.0; people who do that shouldn't have a problem. Of course, if one NodeVisitor can subclass the other and therefore the number of extra lines of code from keeping both isn't many, I could be persuaded to go the other way. Maybe we even call the new NodeVisitor "NodeVisitor" and the old one "OldNodeVisitor" and make people just change their imports.

erikrose commented 7 years ago

regex grammars

It's wonderful that we both have written these and that you used yours as an example in this PR. It sure makes it easy to understand where you're coming from. :-)

erikrose commented 7 years ago

Maybe we could split up this work.

I meant banging the new visitor against some existing grammars. But still, if you feel like doing the easy bits first and separately, we can merge those in before we have the ignoring, I think.

erikrose commented 7 years ago

This would substantially increase the memory usage of regex nodes

Good point. I think you're right, and I don't want to do that. It's also good to keep the bytes option open; I keep waffling on whether to support that or not. Python 2 can't die soon enough; then at least we don't have 4 slightly different semantics to contend with.

We could probably have our cake and eat it too if we kept emitting RegexNode from expressions.Regex but reached in and yanked the string out of it in the visitor, before passing it to the visitor method. (Someone who wants to play with groupings can override that visitor method to tell it to knock it off.) There are several factorings we could choose if we didn't want to hard-code a special case into the visitor, like having a for_visitor() method on Node and its subclasses. Makes me wish for a multiple-dispatch language so we wouldn't have to pollute a clean class, though. :-) Anyway, if that sounds reasonable, we can think about shapes.

There aren't any situations in which we'd regret having, say, the expr_name unavailable due to receiving text rather than a Node, are there? I haven't thought of any that are unavoidable, but I've only been at it a few minutes.

jonpry commented 5 years ago

It would be nice if OneOf expressions passed some information to the visitor about which rule succeeded. This would be extremely useful in codegen. Perhaps OneOf would create a tuple with an element for every rule. Then (A/B/C) would be parse to (None,B,None) if B matched.

davaya commented 4 years ago

I like the description of Lark as an "upstart" parser -- sometimes you need to get uppity to make progress. In any case, it would be nice if Specifying that nodes should be ignored became part of the codebase.

As for syntax, I don't see a problem with Lark's use of ! to preserve nodes and it's opinionated position that constants (anything known a-priori) are just noise in a parse tree. But that is the perspective of an information modeler, not a grammar maven. Any method of trimming nodes would be better than none.