[python] The Python grammars in this repo are in a terrible state!

kaby76 commented 1 year ago

There are 10 python grammars in this repo. I don't know where to even begin on what to pick and maintain because they are all terrible.

It's disorganized. Which grammar is "the one" to use and maintain?
The python3/ grammar is behind the current version of Python
Some of the grammars are poorly tested. E.g., python3alt has one--yes, one--test! python3-cpp has no tests. python3-ty has no tests. python3-py has two glorious tests.
There are forked grammars for different targets (python2-js, python3-cpp, python3-py, python3-ts). How far are they behind python2 or python3/ or is that python/?? (Maybe the target-specific grammars are ahead in features from these three.)
There are probably a dozen or two open issues for the Python grammars.
Targets are missing for python, python2, and python3 grammars. For example, there is no python3/Cpp. By the way, does python3-cpp/ work? Are the grammars python3/ and python3-cpp/ the same even if we take into account the split/combined difference?

The plan to reorganize

Nuke target-specific grammars like python3-cpp.
Nuke python3-without-actions. It's very unlikely that that grammar works for any reasonable set of input, so we should not pretend that a grammar without actions is going to work.
Nuke python3alt. Bart hasn't updated this for 2 years, and it's far behind the current version of Python3.
Nuke tiny-python/. This grammar hasn't been updated for 3 or 4 years. It contains a version that have actions and one that doesn't. It's really impossible to know what to make of this grammar.
Decide what to do of the remaining grammars (python, python2, python3).
Make ports for all targets.

kaby76 commented 1 year ago

The python/python3 grammar has a JavaScript port, but it is ripping slow in comparison with the CSharp and Java ports.

06/17-06:42:17 ~/issues/g4-3539/python/python3/Generated-CSharp
$ bash run.sh ../examples/base_events.py
CSharp 0 ../examples/base_events.py success 1.0922569
Total Time: 1.1610637
06/17-06:44:41 ~/issues/g4-3539/python/python3/Generated-CSharp
$ pushd
~/issues/g4-3539/python/python3/Generated-JavaScript ~/issues/g4-3539/python/python3/Generated-CSharp
06/17-06:44:44 ~/issues/g4-3539/python/python3/Generated-JavaScript
$ bash run.sh ../examples/base_events.py
JavaScript 0 ../examples/base_events.py success 101.734
Total Time: 101.746
06/17-06:46:30 ~/issues/g4-3539/python/python3/Generated-JavaScript

The JavaScript port is typically slower than CSharp by a factor of 5, not 100x's.

For example, abb:

$ bash run.sh ../examples/robdata.sys
CSharp 0 ../examples/robdata.sys success 0.0331986
Total Time: 0.0832029
06/17-07:06:31 ~/issues/g4-3539/abb/Generated-CSharp
$ pushd
~/issues/g4-3539/abb/Generated-JavaScript ~/issues/g4-3539/abb/Generated-CSharp ~/issues/g4-3539/python/python3/Generated-CSharp
06/17-07:06:34 ~/issues/g4-3539/abb/Generated-JavaScript
$ bash run.sh ../examples/robdata.sys
JavaScript 0 ../examples/robdata.sys success 0.098
Total Time: 0.105
06/17-07:06:36 ~/issues/g4-3539/abb/Generated-JavaScript
$

The JavaScript port should not be this slow in comparison to the CSharp port. The parser traces of the closure compuations are identical. I do notice several; issues:

1) this grammar does not use optimized Antlr4 syntax for expressions, which can cause the parse trees to have large single child chains; 2) the max-k for atom-expr is huge, and there is a large error count associated with the "trailers*" reference in the rule. I wonder whether there is a missing semantic predicate.

Decision  Rule                    Invocations  Time      Total-k  Max-k  Fallback  Ambiguities  Errors
...
155       atom_expr               2796         2.312377  4922     28     6         0            324
...

3) I wonder whether the JavaScript runtime handles "errors" efficiently.

kaby76 commented 1 year ago

Ther perf data was mislabeled. Here's what it should be. (trperf needs to be fixed.)

Decision  Rule                    Invocations  Time      Total-k  Max-k  Fallback  Ambiguities  Errors  Transitions
...
155       atom_expr               2796         6.195964  4922     28     6         0            0       324

What I'm thinking is that the parser reads too far ahead per trailer in the atom_expr rule. It requires context in the parse 6 times, too, but I don't think that is the problem--most are SLL transitions, not LL. It feels like there is an anti-pattern here.

pramit-j2-sl commented 3 days ago

I have two suggestions:

Create the final output in a single grammar file, rather than two files.
Fetch from actual Python repo to import gram files and convert them to antlr4-g4 grammar file.

kaby76 commented 3 days ago

I'm going to close the issue I opened because the python grammars are actually in an "okay state", and they are getting better with @RobEin on top of it.

Create the final output in a single grammar file rather than two files.

This cannot work given the constraints.

The grammar has to be split because the lexer has modes. Antlr does not accept a combined grammar with lexer modes. Further, the lexer needs to be in "target agnostic format" because if it were not, there would be forking of the .g4s for every target type. This is exactly how we got into a bad state to start: people would modify one target and not maintain the other targets. With target agnostic, there is one grammar for all targets.

Fetch from actual Python repo to import gram files and convert them to antlr4-g4 grammar file.

That is the plan.

antlr / grammars-v4

[python] The Python grammars in this repo are in a terrible state! #3539

The plan to reorganize