dlang-community / Pegged

A Parsing Expression Grammar (PEG) module, using the D programming language.
534 stars 66 forks source link

Type checking and automatic trimming #129

Closed Shorttail closed 6 years ago

Shorttail commented 10 years ago

I'm trying to type check my parse tree. It's based on a subset of the formal D grammar, and thus it can be very difficult to determine what something is by simply looking at a node's name. I'm trying out semantic actions, since they are applied per rule, and so I will have an easy time determining which rule was taken. Right now I'm trying to determine if a Decl is a function:

Decl < StorageClasses Decl / BasicType Declarators :";" / BasicType Declarator FunctionBody { derp }

It unfortunately applies derp to FunctionBody and not Decl. How do I apply it to Decl without touching all other possible Decl? What derp does it change the name to "function". Perhaps there's a better alternative? Adding an extra child to Decl with the name function? That too would require me to touch Decl and not just its children.

I can do all of this once the tree has been generated, but it seems like a more obvious solution to use semantic actions, since they are part of the grammar.

Also, how do I kill off automatic trimming? For instance:

DeclDef < Declaration

Declaration < Decl

In this case, DeclDef is the name of the node, and both Declaration and Decl are skipped in the tree.

Shorttail commented 10 years ago

Okay, I found out how to keep nodes, but it's a little bothersome if I have to add ^ to almost every grammar rule.

PhilippeSigaud commented 10 years ago

Decl < StorageClasses Decl

/ BasicType Declarators :";" / BasicType Declarator FunctionBody { derp }

It unfortunately applies derp to FunctionBody and not Decl. How do I apply it to Decl without touching all other possible Decl? What derp does it change the name to "function". Perhaps there's a better alternative? Adding an extra child to Decl with the name function? That too would require me to touch Decl and not just its children.

To apply derp to the result of Decl you can either use a special arrow: <{ derp }, or enclose your entire rule in parenthesis:

Decl <{derp}
    StorageClasses Decl
  / BasicType Declarators :";"
  / BasicType Declarator FunctionBody

or

Decl < (StorageClasses Decl
         / BasicType Declarators :";"
         / BasicType Declarator FunctionBody) { derp }

Also, how do I kill off automatic trimming? For instance:

DeclDef < Declaration

Declaration < Decl

In this case, DeclDef is the name of the node, and both Declaration and Decl are skipped in the tree.

There should not be any trimming there. By default, all parse nodes from the called grammar are kept. Are you calling your grammar from another grammar?

Shorttail commented 10 years ago

With following rules:

Module < DeclDefs eoi

DeclDefs < DeclDef*

DeclDef < Declaration

Declaration < Decl

D [0, 25]["int", "main", "return", "0"] +-D.Module [0, 25]["int", "main", "return", "0"] +-D.DeclDefs [0, 25]["int", "main", "return", "0"] +-D.DeclDef [0, 25]["int", "main", "return", "0"] +-D.BasicType [0, 4]["int"] | +-D.BasicTypeX [0, 3]["int"] +-D.Declarator [4, 10]["main"] | +-D.Identifier [4, 8]["main"] +-D.FunctionBody [10, 25]["return", "0"] +-D.StatementList [13, 23]["return", "0"] +-D.Statement [13, 23]["return", "0"] snip

With following rules:

Module <^ DeclDefs eoi

DeclDefs <^ DeclDef*

DeclDef <^ Declaration

Declaration <^ Decl

D [0, 25]["int", "main", "return", "0"] +-D.Module [0, 25]["int", "main", "return", "0"] +-D.DeclDefs [0, 25]["int", "main", "return", "0"] +-D.DeclDef [0, 25]["int", "main", "return", "0"] +-D.Declaration [0, 25]["int", "main", "return", "0"] +-D.Decl [0, 25]["int", "main", "return", "0"] +-D.BasicType [0, 4]["int"] | +-D.BasicTypeX [0, 3]["int"] +-D.Declarator [4, 10]["main"] | +-D.Identifier [4, 8]["main"] +-D.BlockStatement [10, 25]["return", "0"] +-D.StatementList [13, 23]["return", "0"] +-D.Statement [13, 23]["return", "0"] snip

Clearly it has some effect to add the ^. The full D grammar has more rules than this, but I still see rules that contain only one other rule (no such in the full grammar) getting trimmed all the time.

I parse the grammar and dump it in a file. There are no other grammars present.

Also, the <{ derp } applies everything to decl, not to just one rule of decl. Are all decl functions? Either way I'd still like to mark nodes without having to do insane tree crawling post creation.

PhilippeSigaud commented 10 years ago

I'll look into your tree-trimming problem.

Also, the <{ derp } applies everything to decl, not to just one rule of decl. Are all decl functions? Either way I'd still like to mark nodes without having to do insane tree crawling post creation.

If you want to apply derp only to one part of the alternative, then enclose this part between parenthesis:

Rule <- Rule 1 / (Rule2 Rule3) {action} / Rule4

There action will be applied on the result of (Rule2 Rule3), not on Rule1 nor Rule4.

PhilippeSigaud commented 10 years ago

I did some changed in the underlying engine for named rules. That seems to clean the issue for me. Can you pull the latest version and tell me if that's OK with you?

Shorttail commented 10 years ago

My output for arithmetic expression explodes now, which is great. Thanks.

Semantic actions I can't get to work though. Here's a snip of the output for parsing a function:

+-D.Declaration [0, 25]["int", "main", "return", "0"] +-D.Decl [0, 25]["int", "main", "return", "0"] +-D.BasicType [0, 4]["int"] | +-D.BasicTypeX [0, 3]["int"] +-D.Declarator [4, 10]["main"] | +-D.Identifier [4, 8]["main"] +-D.FunctionBody [10, 25]["return", "0"] +-D.BlockStatement [10, 25]["return", "0"] +-D.StatementList [13, 23]["return", "0"] +-D.Statement [13, 23]["return", "0"]

Now, snipping Decl:

Decl < BasicType Declarator FunctionBody {derp} < removes FunctionBody, its child appearing instead Decl < (BasicType Declarator FunctionBody) {derp} < removes Decl, its children taking its place

My semantic action looks like this:

T derp(T)(T t) { t.name = "lala"; return t; }

PhilippeSigaud commented 10 years ago

FunctionBody {derp} < removes BlockStatement, its child appearing instead Decl < (BasicType Declarator FunctionBody) {derp} < does nothing

I can reproduce it:

Test:
    A <- (B C) { foo }
    B <- 'b'
    C <- 'c'

(B C) { foo } is correctly parsed by the Pegged grammar, but there is no action!(foo) in the generated code (whereas it works for B C { foo }.

That's clearly a bug, thanks for finding it!

So as to track issues more clearly with commits and unit tests, could you open a new issue specifically for this, please?

PhilippeSigaud commented 10 years ago

OK, testing again on another computer, I get the correct input. Here is my test grammar:

 Test:
        A <- (D C B) 
           / ((B C D) { foo })
        B <- 'b'
        C <- 'c'
        D <- 'd'

So the DCB branch has no {foo}, whereas {foo} get invoked with BCD.

With the following foo:

T foo(T)(T t)
{
    t.name = "Test.foo";
    return t;
}

I get correct parse trees:

With "dcb":

Test [0, 3]["d", "c", "b"]
 +-Test.A [0, 3]["d", "c", "b"]
    +-Test.D [0, 1]["d"]
    +-Test.C [1, 2]["c"]
    +-Test.B [2, 3]["b"]

With "bcd":

Test [0, 3]["b", "c", "d"]
 +-Test.A [0, 3]["b", "c", "d"]
    +-Test.foo [0, 3]["b", "c", "d"]
       +-Test.B [0, 1]["b"]
       +-Test.C [1, 2]["c"]
       +-Test.D [2, 3]["d"]

Where a Test.foo node was correctly put in the final result.

Note that using t.name == "foo"; does not work. It does create a node with the name "foo", but since nodes external to the grammar are cut from the final output, unless explicitly prefixed by ^, this node then disappears.

Can you test your own grammar with your derp using t.name = "GrammarName.RuleName";?

Shorttail commented 10 years ago

Hah, that works. D.lala is in the tree alright. I guess there's no bug then.

PhilippeSigaud commented 10 years ago

Hah, that works. D.lala is in the tree alright. I guess there's no bug then.

OK, cool.

Another way to get a specially named rule is simply to create such a rule. Here is what I mean: imagine you have:

Root <- Rule1 Rule2
        / Rule3 Rule4 Rule5 # Here I want to flag that
        / Rule5 Rule6 Rule7

Then a simple way to extract the more significant Rule3 Rule4 Rule5 is to make it its own rule:

Root <- Rule1 Rule2
        / Flagged
        / Rule5 Rule6 Rule7
Flagged <- Rule3 Rule4 Rule5