antlr / grammars-v4

Grammars written for ANTLR v4; expectation that the grammars are free of actions.
MIT License
10.2k stars 3.71k forks source link

Wrong Parse Tree for Python3 Hello World #2251

Open josepaiva94 opened 3 years ago

josepaiva94 commented 3 years ago

The parse tree obtained for Hello World! program in Python3

print("Hello, World!")

is the following

File_input
    Stmt
      Simple_stmt
        Small_stmt
          Expr_stmt
            Testlist_star_expr
              Test
                Or_test
                  And_test
                    Not_test
                      Comparison
                        Expr
                          Xor_expr
                            And_expr
                              Shift_expr
                                Arith_expr
                                  Term
                                    Factor
                                      Power
                                        Atom_expr
                                          Atom
                                            T:NAME[print]
                                          Trailer
                                            T:OPEN_PAREN[(]
                                            Arglist
                                              Argument
                                                Test
                                                  Or_test
                                                    And_test
                                                      Not_test
                                                        Comparison
                                                          Expr
                                                            Xor_expr
                                                              And_expr
                                                                Shift_expr
                                                                  Arith_expr
                                                                    Term
                                                                      Factor
                                                                        Power
                                                                          Atom_expr
                                                                            Atom
                                                                              T:STRING["Hello, World!"]
                                            T:CLOSE_PAREN[)]
        T:NEWLINE[
]
    T:EOF[<EOF>]

I'm using Java to parse and the grammar at https://github.com/antlr/grammars-v4/tree/master/python/python3

After checking the grammar more carefully, I see no function call expression.

josepaiva94 commented 3 years ago

Done. Edited post.

kaby76 commented 3 years ago

The "official" grammar in Python.org doesn't have a *call* named nonterminal, and neither does the current Python3 grammar. What are you expecting?

I do notice that the current Python3 grammar diverges from the "official" grammar in Python.org: the current grammar doesn't have "primary", which was added at some point between 3.3.7 and 3.9.6. python3.g4 was added in 2014. It may be time for a re-scrape of the grammar from the official source.

Expressions in the current grammar are not optimized, but that can be provided as an alternative grammar, derived automatically via grammar refactoring.

I would suggest a specific version should be applied to the grammar. The python/python3 grammars, of which there are seven!!, should be redone to be target independent using the "agnostic" approach.

josepaiva94 commented 3 years ago

I understand there is no *call* named nonterminal, and that a function call is covered by an Atom_expr rule.

However, do you mean it is expected to have a path containing these tree nodes

          Expr_stmt
            Testlist_star_expr
              Test
                Or_test
                  And_test
                    Not_test
                      Comparison
                        Expr
                          Xor_expr
                            And_expr
                              Shift_expr
                                Arith_expr
                                  Term
                                    Factor
                                      Power
                                        Atom_expr

in every function call? Does the path make any sense?

Why is there no grammar rule that would allow the Atom_expr to be linked directly from Expr_stmt?

kaby76 commented 3 years ago

@josepaiva94 The long chain of parse tree nodes is consistent with the 3.3.7 spec grammar or the latest spec grammar. There should be an optimized grammar, preferably the "main" grammar one chooses, but also the spec grammar available for folks who use tools to optimize the grammar in a number of other ways. Expression optimization should be part of that optimized grammar.

@studentmain Sorry, I haven't looked much at your PR. One reason is that I've been waiting for PR2223 which I did three weeks ago, before your PR was submitted. Neither @teverett nor @KvanTTT have reviewed or merged my PR, I assume they're busy with other things. The PR uses the "agnostic" approach, which is an alternative that doesn't require a forked nor pre-processed grammar, except the Dart target. I'm not sure how I stumbled on that blurb in the Antlr Python target, but it seems to work so far. The Dart target would be another candidate for fix-up with your preprocessor because it currently has to be foked--the "agnostic" solution does not work. The other reason is I think your PR had some tests failing. But, your PR seems right on for anything that currently doesn't use the "agnostic" approach.

josepaiva94 commented 3 years ago

@kaby76 Thanks. Is there any automatic tool to optimize grammars? One rule I see is: if there is an intermediate node p in the parse tree's path from x to ysuch that p has a single child, then we can "remove" p from the path (i.e., make a direct link from parent of p to child of p). I believe something has already been developed in this direction, otherwise I can start a script.

kaby76 commented 3 years ago

@josepaiva94 My tool Trash is the only tool that I know that can do the unfolding. But you would have to tell what to unfold--it doesn't automatically detect single occurrences and unfold those. I didn't add that in because it seemed obvious to not write rules that have only a single-use. But, it does do an analysis on references on the symbols in RHS (it uses it to find issues in recursion, some of which Antlr detects--mutual left recursion--and others that it does not) which could be applied to undirected unfolding single references.

kaby76 commented 3 years ago

I tested a modified grammar with expr_stmt changed by hand (I haven't gotten around yet to porting trinsert) with this new rule. Not sure what to call the transform other than for the moment a "short-circuit" because it adds an alt to an inner symbol that is on a single chain of parse tree nodes.

expr_stmt: atom_expr | testlist_star_expr (annassign | augassign (yield_expr|testlist) |
                     ('=' (yield_expr|testlist_star_expr))*);

The tree is flatter, for sure, and all the examples pass.

$ trparse hw.py | trtree
Time to parse: 00:00:00.1355124
# tokens per sec = 44.276390942821465

( file_input
  ( stmt
    ( simple_stmt
      ( small_stmt
    ( expr_stmt
      ( atom_expr
        ( atom
          ( NAME i=0 txt=print tt=42 DEFAULT_TOKEN_CHANNEL
        ) )
        ( trailer
          ( OPEN_PAREN i=5 txt=( tt=54 DEFAULT_TOKEN_CHANNEL
          )
          ( arglist
        ( argument
          ( test
            ( or_test
              ( and_test
            ( not_test
              ( comparison
                ( expr
                  ( xor_expr
                ( and_expr
                  ( shift_expr
                    ( arith_expr
                      ( term
                    ( factor
                      ( power
                        ( atom_expr
                          ( atom
                        ( STRING i=6 txt=\"Hello, World!\" tt=3 DEFAULT_TOKEN_CHANNEL
          ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) )
          ( CLOSE_PAREN i=21 txt=) tt=55 DEFAULT_TOKEN_CHANNEL
      ) ) ) ) )
      ( NEWLINE i=22 txt=\r\n tt=41 DEFAULT_TOKEN_CHANNEL
  ) ) )
  ( EOF i=24 txt= tt=-1 DEFAULT_TOKEN_CHANNEL
) )

But, anecdotally, the modified grammar parser is slightly slower.