antlr / grammars-v4

Grammars written for ANTLR v4; expectation that the grammars are free of actions.
MIT License
10.15k stars 3.7k forks source link

[python] The Python grammars in this repo are in a terrible state! #3539

Closed kaby76 closed 3 days ago

kaby76 commented 1 year ago

There are 10 python grammars in this repo. I don't know where to even begin on what to pick and maintain because they are all terrible.

The plan to reorganize

kaby76 commented 1 year ago

The python/python3 grammar has a JavaScript port, but it is ripping slow in comparison with the CSharp and Java ports.

06/17-06:42:17 ~/issues/g4-3539/python/python3/Generated-CSharp
$ bash run.sh ../examples/base_events.py
CSharp 0 ../examples/base_events.py success 1.0922569
Total Time: 1.1610637
06/17-06:44:41 ~/issues/g4-3539/python/python3/Generated-CSharp
$ pushd
~/issues/g4-3539/python/python3/Generated-JavaScript ~/issues/g4-3539/python/python3/Generated-CSharp
06/17-06:44:44 ~/issues/g4-3539/python/python3/Generated-JavaScript
$ bash run.sh ../examples/base_events.py
JavaScript 0 ../examples/base_events.py success 101.734
Total Time: 101.746
06/17-06:46:30 ~/issues/g4-3539/python/python3/Generated-JavaScript

The JavaScript port is typically slower than CSharp by a factor of 5, not 100x's.

For example, abb:

$ bash run.sh ../examples/robdata.sys
CSharp 0 ../examples/robdata.sys success 0.0331986
Total Time: 0.0832029
06/17-07:06:31 ~/issues/g4-3539/abb/Generated-CSharp
$ pushd
~/issues/g4-3539/abb/Generated-JavaScript ~/issues/g4-3539/abb/Generated-CSharp ~/issues/g4-3539/python/python3/Generated-CSharp
06/17-07:06:34 ~/issues/g4-3539/abb/Generated-JavaScript
$ bash run.sh ../examples/robdata.sys
JavaScript 0 ../examples/robdata.sys success 0.098
Total Time: 0.105
06/17-07:06:36 ~/issues/g4-3539/abb/Generated-JavaScript
$

The JavaScript port should not be this slow in comparison to the CSharp port. The parser traces of the closure compuations are identical. I do notice several; issues:

1) this grammar does not use optimized Antlr4 syntax for expressions, which can cause the parse trees to have large single child chains; 2) the max-k for atom-expr is huge, and there is a large error count associated with the "trailers*" reference in the rule. I wonder whether there is a missing semantic predicate.

Decision  Rule                    Invocations  Time      Total-k  Max-k  Fallback  Ambiguities  Errors
...
155       atom_expr               2796         2.312377  4922     28     6         0            324
...

3) I wonder whether the JavaScript runtime handles "errors" efficiently.

kaby76 commented 1 year ago

Ther perf data was mislabeled. Here's what it should be. (trperf needs to be fixed.)

Decision  Rule                    Invocations  Time      Total-k  Max-k  Fallback  Ambiguities  Errors  Transitions
...
155       atom_expr               2796         6.195964  4922     28     6         0            0       324

What I'm thinking is that the parser reads too far ahead per trailer in the atom_expr rule. It requires context in the parse 6 times, too, but I don't think that is the problem--most are SLL transitions, not LL. It feels like there is an anti-pattern here.

pramit-j2-sl commented 3 days ago

I have two suggestions:

kaby76 commented 3 days ago

I'm going to close the issue I opened because the python grammars are actually in an "okay state", and they are getting better with @RobEin on top of it.

Create the final output in a single grammar file rather than two files.

This cannot work given the constraints.

The grammar has to be split because the lexer has modes. Antlr does not accept a combined grammar with lexer modes. Further, the lexer needs to be in "target agnostic format" because if it were not, there would be forking of the .g4s for every target type. This is exactly how we got into a bad state to start: people would modify one target and not maintain the other targets. With target agnostic, there is one grammar for all targets.

Fetch from actual Python repo to import gram files and convert them to antlr4-g4 grammar file.

That is the plan.