Open josepaiva94 opened 3 years ago
Done. Edited post.
The "official" grammar in Python.org doesn't have a *call*
named nonterminal, and neither does the current Python3 grammar. What are you expecting?
I do notice that the current Python3 grammar diverges from the "official" grammar in Python.org: the current grammar doesn't have "primary", which was added at some point between 3.3.7 and 3.9.6. python3.g4 was added in 2014. It may be time for a re-scrape of the grammar from the official source.
Expressions in the current grammar are not optimized, but that can be provided as an alternative grammar, derived automatically via grammar refactoring.
I would suggest a specific version should be applied to the grammar. The python/python3 grammars, of which there are seven!!, should be redone to be target independent using the "agnostic" approach.
I understand there is no *call*
named nonterminal, and that a function call is covered by an Atom_expr
rule.
However, do you mean it is expected to have a path containing these tree nodes
Expr_stmt
Testlist_star_expr
Test
Or_test
And_test
Not_test
Comparison
Expr
Xor_expr
And_expr
Shift_expr
Arith_expr
Term
Factor
Power
Atom_expr
in every function call? Does the path make any sense?
Why is there no grammar rule that would allow the Atom_expr
to be linked directly from Expr_stmt
?
@josepaiva94 The long chain of parse tree nodes is consistent with the 3.3.7 spec grammar or the latest spec grammar. There should be an optimized grammar, preferably the "main" grammar one chooses, but also the spec grammar available for folks who use tools to optimize the grammar in a number of other ways. Expression optimization should be part of that optimized grammar.
@studentmain Sorry, I haven't looked much at your PR. One reason is that I've been waiting for PR2223 which I did three weeks ago, before your PR was submitted. Neither @teverett nor @KvanTTT have reviewed or merged my PR, I assume they're busy with other things. The PR uses the "agnostic" approach, which is an alternative that doesn't require a forked nor pre-processed grammar, except the Dart target. I'm not sure how I stumbled on that blurb in the Antlr Python target, but it seems to work so far. The Dart target would be another candidate for fix-up with your preprocessor because it currently has to be foked--the "agnostic" solution does not work. The other reason is I think your PR had some tests failing. But, your PR seems right on for anything that currently doesn't use the "agnostic" approach.
@kaby76 Thanks. Is there any automatic tool to optimize grammars? One rule I see is: if there is an intermediate node p
in the parse tree's path from x
to y
such that p
has a single child, then we can "remove" p
from the path (i.e., make a direct link from parent of p
to child of p
). I believe something has already been developed in this direction, otherwise I can start a script.
@josepaiva94 My tool Trash is the only tool that I know that can do the unfolding. But you would have to tell what to unfold--it doesn't automatically detect single occurrences and unfold those. I didn't add that in because it seemed obvious to not write rules that have only a single-use. But, it does do an analysis on references on the symbols in RHS (it uses it to find issues in recursion, some of which Antlr detects--mutual left recursion--and others that it does not) which could be applied to undirected unfolding single references.
I tested a modified grammar with expr_stmt changed by hand (I haven't gotten around yet to porting trinsert
) with this new rule. Not sure what to call the transform other than for the moment a "short-circuit" because it adds an alt to an inner symbol that is on a single chain of parse tree nodes.
expr_stmt: atom_expr | testlist_star_expr (annassign | augassign (yield_expr|testlist) |
('=' (yield_expr|testlist_star_expr))*);
The tree is flatter, for sure, and all the examples pass.
$ trparse hw.py | trtree
Time to parse: 00:00:00.1355124
# tokens per sec = 44.276390942821465
( file_input
( stmt
( simple_stmt
( small_stmt
( expr_stmt
( atom_expr
( atom
( NAME i=0 txt=print tt=42 DEFAULT_TOKEN_CHANNEL
) )
( trailer
( OPEN_PAREN i=5 txt=( tt=54 DEFAULT_TOKEN_CHANNEL
)
( arglist
( argument
( test
( or_test
( and_test
( not_test
( comparison
( expr
( xor_expr
( and_expr
( shift_expr
( arith_expr
( term
( factor
( power
( atom_expr
( atom
( STRING i=6 txt=\"Hello, World!\" tt=3 DEFAULT_TOKEN_CHANNEL
) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) )
( CLOSE_PAREN i=21 txt=) tt=55 DEFAULT_TOKEN_CHANNEL
) ) ) ) )
( NEWLINE i=22 txt=\r\n tt=41 DEFAULT_TOKEN_CHANNEL
) ) )
( EOF i=24 txt= tt=-1 DEFAULT_TOKEN_CHANNEL
) )
But, anecdotally, the modified grammar parser is slightly slower.
The parse tree obtained for Hello World! program in Python3
is the following
I'm using Java to parse and the grammar at https://github.com/antlr/grammars-v4/tree/master/python/python3
After checking the grammar more carefully, I see no function call expression.