badicsalex / peginator

PEG parser generator for creating ASTs in Rust
MIT License
34 stars 3 forks source link

Precedence climbing #7

Open oovm opened 2 years ago

oovm commented 2 years ago

I wonder if there are plans to support precedence climbing like https://docs.rs/peg/latest/peg/#precedence-climbing

I know that peg means never left recursive, but this way of writing does have higher readability.

For traditional peg, this is equivalent to a two-phase parsing

stage-1 attempt to capture

expr = term {infix term}
term = {prefix} atom {postfix}
atom = '(' expr ')' | int | str

stage-2 to reorganization

Vec<atom|prefix|infix|postfix>using extern progress, e.g. the pratt parser.

The second stage can even be dynamically adjusted by the user(like swift/haskell), and static parsing (parser-embedded) is sufficient for most languages.

badicsalex commented 2 years ago

I would rather implement left-recursion support (with a directive at the recursion point). It's not that hard anyway, since memoization is already implemented.

(EDIT: I realized that a left-recursive grammar and a precedence climber would produce different parse trees. Also most things that would need left-recursion is already solved by the repetition/closure operator, as seen in the operator example)

badicsalex commented 2 years ago

Also I'm not really sure how the output of a precedence climber would look like in the AST.

Something like this?

enum Expression {
    Number(String),
    InfixExpression(InfixExpression),
    // could be also prefix, postfix, bracket, etc.
}

struct InfixExpression {
    left_side: Box<Expression>,
    operator: Operator,
    left_side: Box<Expression>,
}

enum Operator {
    Add,
    Sub,
    Mul,
    Div,
}

I'm also not really sure how this could be represented in the grammar in a readable way, so that it remains really intuitive what kind of AST will be generated.

oovm commented 2 years ago

I have a question, how should Sequence be handled?

https://github.com/badicsalex/peginator/blob/8dc7c9c442b16d7c14a1494c3a92d60059e50263/grammar.ebnf#L18

Is Sequence an atomic operation or an infix operation?

oovm commented 2 years ago

I think at least self can be introduced for judging direct left recursion.

@precedence
arithmetic =
    = x:self op:"+" y:self
    | x:self op:"-" y:self
    | x:self op:"*" y:self
    | x:self op:"/" y:self
    | x:self op:"^" y:self
    | x:number
    | "(" x:self ")"
    ;

This syntax does not changed, but cannot represent priority groups (like rust-peg).

However, the meaning of priority groups needs to be discussed, and it may be better to form a strict partial order of priorities.


EDIT: I thought about it on the way home, priority groups really don't make sense.

Because you can always define AddLike = '+' | '-', to achieve the same effect

oovm commented 2 years ago

Another question is what is the name of the variant in this case? There is no way to tag.

ANTLR uses # to mark branches, the usage might be

@precedence
arithmetic =
    @Add| x:self op:"+" y:self
    @Sub| x:self op:"-" y:self
    @Mul| x:self op:"*" y:self
    @Div| x:self op:"/" y:self
    @Pow| x:self op:"^" y:self
    @Num| x:number
    @Gro| "(" x:self ")"
    ;

Here I introduced an operator like @VAR_NAME|, It looks weird, but the name of the enum can be determined by analogy with the @:Rule and @NAME:Rule operator.

I think we can think of @ as a macro that acts on |

oovm commented 2 years ago

Another problem is the ternary operator and the sequence operator.

@precedence
arithmetic =
    | x:self op1:"?" y:self op2:":" z:self
    | x:self y:self 
    ;

The writing is consistent but logically unreasonable, and special treatment is required for this writing.

badicsalex commented 2 years ago

Is Sequence an atomic operation or an infix operation?

I'd consider Sequence a special case of infix where there is no actual operator token.

but cannot represent priority groups (like rust-peg).

Since it's such a special syntax, it might as well copy most of the rust-peg syntax with some modification so it looks more like the rest of the syntax. Something like this maybe (with a ton of restrictions, e.g. only specific forms and string literals are allowed):

@precedence
Expression = (
    Add = (@) "+" @;
    Sub = (@) "-" @;
    --
    Mul =  (@) "*" @;
    Div = (@) "/" @;
    --
    Power = @ "^" (@);
    --
    Number = @:NUMBER;
    Group = "(" @ ")";
);

ternary operator

Apparently that's a solvable problem.

badicsalex commented 2 years ago

Then again, at that point it might be better to create a separate library (or use an existing one) that does precedence climbing simply and reliably, and then just pulling that in with @extern

badicsalex commented 2 years ago

Again, with left recursion you can have the following grammar:

Expression =
    left:*Expression (op:Add | op:Sub) right:*Expression |
    left:*Expression (op:Mul | op:Div) right:*Expression |
    left:*Expression op:Power right:*Expression |
    "(" atom:*Expression ")" |
    atom:Number ;

Add = "+";
Sub = "-";
Mul= "*";
Div = "/";
Power = "^";

Resulting types:

struct Expression {
    left: Box<Expression>,
    right: Box<Expression>,
    atom: Box<Expression_atom>,
    op: Expression_op,
}

enum Expression_atom {
    Expression(Box<Expression>),
    Number(Number),
}

enum Expression_op {
    Add(Add), Sub(Sub), ...
}

I think it would be just as easily readable as the precedence climbing proposals above. And it's way more general too, you can fit the ?: and array indexing in nicely.

P.S.: CPython's own parser is also written this way: https://medium.com/@gvanrossum_83706/peg-at-the-core-developer-sprint-8b23677b91e6#4a74 https://github.com/we-like-parsers/pegen/blob/main/data/python.gram#L125

badicsalex commented 2 years ago

I've been playing with left recursion this afternoon, and it's very difficult to get right. See a left-recursive calculator here: https://github.com/badicsalex/peginator/blob/eade5ae60f777e101df173ac95b5509ec235e55f/peginator_test/src/calculator_example/grammar.ebnf

The main problem is that you cannot use Expression as the left side of multiplications, it must be a Term, so you need to have a Term type that only allows other multiplications and things above it in precedence (e.g. power, or groups).

To make it a tiny bit prettier, I tried using >Term instead of @:Term in Expression, but that completely broke the left recursion cache handling (going through Expression and Term have different cache entries). This is why I put a warning in the readme.

But, on the bright side, even though types look horrible when used as literals, the code itself can be very compact, non-redundant and all around pretty.

oovm commented 2 years ago

Operation precedence, and operation associativity, are very special grammar rules.

Widely used in programming languages, it is worth using a special syntax.

The operation priority of the programming language has more than ten levels, if such an expression tree is used.

It is not very convenient to nested match the required level, which is inconsistent with the ADT model of rust.

badicsalex commented 2 years ago

Yeah, you convinced me. I'll definitely implement this when I have the time. Right now a bunch of other projects have more priority.

PRs are welcome too, of course :)