Open AiStudent opened 2 months ago
@AiStudent Thanks for this PR. My apologies for not evaluating it earlier.
I checked it out now and try to experiment with it.
But the dependency ply
seems hard to install, and looks badly maintained already.
Error: python-ply has been disabled because it does not meet homebrew/core's requirements for Python library formulae! It was disabled on 2024-08-15.
SLY is the more modern project and has more interesting features (such as recently added support for some EBNF grammar extensions). PLY was originally written in 2001 ... there will be no future pip-installable version of PLY. So, if people want to use it, they'd need to copy the code here.
It does not seem wise to add a new backend for a parser generator that is already phased out.
Could you instead target a parser generator that has a more promising future?
SLY by the same author also does not seem well-maintained, there is no release to the Python package repo: https://pypi.org/project/sly/
CC @aarneranta
Alternatives?
Retargeted to Lark with LALR. @andreasabel
Testsuite results: The example grammars work (no layout supported and internal has no effect). The regression tests yield:
Failures:
Errors:
Great! So the failure list is unchanged over the PLY target. I'll have another look.
So the failure list is unchanged over the PLY target.
Note it is slightly different: 249_unicode -> 358_MixFixLists for example.
N.B. The CI failures are caused by dependencies:
Testsuite results: The example grammars work (no layout supported and internal has no effect). The regression tests yield:
I went through all of the failed tests:
* expected: bar 42 . 42 . (41 + 1)
* but got: bar 42 . 42 . 41 + 1
Since bar
takes an Exp2
list, an Exp1
like 41 + 1
needs to be printed in parentheses.
Seems that whitespace in separators is not honored by printer
* expected: 1 2 3 4 5 6 7 8 9
* but got: 1 2 3 4 5 6 7 8 9
The generated printer seems to emit too many spaces:
-"A" . "B" . "QED" . "line1\n\tline2" . "\"%s\" is not an int" . semi-exprs: 1 + 1;
+"A" . "B" . "QED" . "line1\n\tline2" . "\"%s\" is not an int" . semi-exprs: 1 + 1;
* expected: 2 ^ (2 ** 2)
* but got: 2 ** (2 ** 2)
Grammar is:
Eexp. Exp ::= Exp "^" Exp1 ;
EExp. Exp ::= Exp "**" Exp1 ;
The reason for failure is that the generated method names in ParsingDefs.py
lose the case:
grammar = r"""
...
?exp: exp "^" exp1 -> r_eexp
| exp "**" exp1 -> r_eexp
| exp1
...
"""
...
#transformer
class TreeTransformer(Transformer):
@v_args(inline=True)
def r_eexp(self, exp_1, exp_2):
return Eexp(exp_1, exp_2)
@v_args(inline=True)
def r_eexp(self, exp_1, exp_2):
return EExp(exp_1, exp_2)
Same problem as in 278_Keywords (see below).
Clashes with pythons True
and False
are not avoided.
Missing python module name sanitization:
from bnfcPyGen222_Integer-List-1.ParsingDefs import *
The grammar has:
token SpaceInAlts '\'' ["01x "]* '\'' ;
token Triple upper letter ["]-\'"] '\'';
token Name (char - [ "(){};.@\"" ] - [ " \n\t" ]) + ;
This is translated to:
SPACEINALTS: /\'(\ |0|1|x)*\'/
TRIPLE: /[A-Z][A-Za-z](\'|\-|\])\'/
NAME: /[^\t\n\ \"\(\)\.\;\@\{\}]+/
Start symbol is:
Main ::= SpaceInAlts Name "/\\" Name "and" Triple;
Input is:
' x10 0' [x:4] /\ 'funky+identifier-umlaute:äöüß//\%#FOO and Un]'
Error is:
Unexpected token Token('NAME', "'") at line 1, column 1.
Expected one of:
* SPACEINALTS
I can get the test accepted if I give SPACEINALTS and TRIPLE a higher prio than NAME:
SPACEINALTS.1: /\'(\ |0|1|x)*\'/
TRIPLE.1: /[A-Z][A-Za-z](\'|\-|\])\'/
NAME: /[^\t\n\ \"\(\)\.\;\@\{\}]+/
That is weird, because the docs say that the order is https://lark-parser.readthedocs.io/en/stable/grammar.html#notes-for-when-using-a-lexer
Literals are matched according to the following precedence:
1. Highest priority first (priority is specified as: TERM.number: …)
2. Length of match (for regexps, the longest theoretical match is used)
3. Length of literal / pattern definition
4. Name
So if there is just one prio, the length of the match should win. Clearly, SPACEINALTS can match a longer string (' x10 0'
) than NAME ('
).
Is this a bug in LARK or its docu?
- Parameterized tests:Python:Python:278_Keywords
This testcase aims at name clashes with the host language:
bnfcPyGentest/Absyn.py", line 143
class False:
^^^^^
SyntaxError: invalid syntax
You need to make sure you sanitise python keywords and reserved identifiers.
Running this test reports:
Unexpected token Token('MYPOSID', 'Foo') at line 1, column 4160.
Expected one of:
* IDENT
The grammar has start symbol:
Init. Main ::= ... Ident MyId MyPosId ...
and the input is:
... Foo_ Bar1 Bar! ...
The generated lexer lexes Foo_
as MYPOSID probably because the regex generated for IDENT is not correct. IDENT is valid Haskell identifiers which includes underscores.
MYID: /[A-Za-z](\d|[A-Za-z])*/
MYPOSID: /[A-Za-z](\!|(\d|[A-Za-z]))*/
IDENT: /[A-Za-z]\w*/
This produces some ill-formed python:
ParsingDefs.py", line 104, in d2_r_esqrt4
return where(where(exp_))
^^^^^
NameError: name 'where' is not defined
Thanks for the extensive feedback. Currently there exists at least two issues.
Testsuite results:
It seems that the length of the regex decides the precedence, not the lengths of the matched tokens. In the below test program I try to tokenize "ab" with the literals A or AB:
from lark import Lark
grammar = r"""
start: A
| AB
A: /a(c)*/
AB: /ab/
"""
parser = Lark(grammar, start='start', parser='lalr', lexer='basic', priority='auto')
print(list(parser.lex("ab")))
Which yields:
lark.exceptions.UnexpectedCharacters: No terminal matches 'b' in the current parser context, at line 1 col 2
ab
^
Expected one of:
* A
* AB
Previous tokens: Token('A', 'a')
In order to make AB have higher precedence, one may: add priority level number, such as AB.1
, or make the regex of AB longer than A, like AB: /ab(c)*/
or removing (c)*
from A. I have not found a way to make the lexer use the longest matched expression; one alternative, for the one who dares, seems to be to utilize Lark's custom lexer option.
@AiStudent wrote:
It seems that the length of the regex decides the precedence, not the lengths of the matched tokens.
But this seems then to be a bug in LARK, at least according to the specification: https://lark-parser.readthedocs.io/en/stable/grammar.html#notes-for-when-using-a-lexer
Literals are matched according to the following precedence:
- Highest priority first (priority is specified as: TERM.number: …)
- Length of match (for regexps, the longest theoretical match is used)
Or maybe, this is what is meant by
(for regexps, the longest theoretical match is used)
In any case, not prioritising the actual longest match seems awkward and not in line with standard lexing tools. I wonder whether others have bumped into this issue. Is there any discussion of this issue on bug trackers etc.?
If you have really nailed the issue, it might be worthwhile raising it to the developers of LARK.
@AiStudent wrote:
Recursion error, a list with 10k elements breaks the recursive list concatenation.
Interesting, is this because Python has a limit on recursion depth?
Anyway, doing lots of concatenations is usually done not directly but via concepts like StringBuilder
/ StringBuffer
, for performance reasons. (In Haskell, one can use difference lists, that's what the Haskell backend does.)
Is there a blessed way in Python for assembling strings from many pieces?
Python backend using Python Lex Yacc, PLY.
1 - Requires python3.10+ and ply.
2 - Getting started documentation added to docs/user_guide.rst.
3 - I'll not pledge to be an active maintainer for 3 years, but who really knows.
4 - Questions (especially tricky or critical) are welcome.
Testsuite results: The example grammars work (no layout supported). The regression tests yield: