BNFC / bnfc

BNF Converter
http://bnfc.digitalgrammars.com/
586 stars 165 forks source link

Python backend #485

Open AiStudent opened 2 months ago

AiStudent commented 2 months ago

Python backend using Python Lex Yacc, PLY.

1 - Requires python3.10+ and ply.

2 - Getting started documentation added to docs/user_guide.rst.

3 - I'll not pledge to be an active maintainer for 3 years, but who really knows.

4 - Questions (especially tricky or critical) are welcome.

Testsuite results: The example grammars work (no layout supported). The regression tests yield:

andreasabel commented 3 weeks ago

@AiStudent Thanks for this PR. My apologies for not evaluating it earlier. I checked it out now and try to experiment with it. But the dependency ply seems hard to install, and looks badly maintained already.

It does not seem wise to add a new backend for a parser generator that is already phased out.

Could you instead target a parser generator that has a more promising future?

SLY by the same author also does not seem well-maintained, there is no release to the Python package repo: https://pypi.org/project/sly/

CC @aarneranta

andreasabel commented 3 weeks ago

Alternatives?

AiStudent commented 2 weeks ago

Retargeted to Lark with LALR. @andreasabel

Testsuite results: The example grammars work (no layout supported and internal has no effect). The regression tests yield:

Failures:

Errors:

andreasabel commented 2 weeks ago

Great! So the failure list is unchanged over the PLY target. I'll have another look.

AiStudent commented 2 weeks ago

So the failure list is unchanged over the PLY target.

Note it is slightly different: 249_unicode -> 358_MixFixLists for example.

andreasabel commented 2 weeks ago

N.B. The CI failures are caused by dependencies:

andreasabel commented 2 weeks ago

Testsuite results: The example grammars work (no layout supported and internal has no effect). The regression tests yield:

I went through all of the failed tests:

Parameterized tests:Python:Python:100_coercion_lists

* expected: bar 42 . 42 . (41 + 1)

* but got:  bar 42 . 42 . 41 + 1

Since bar takes an Exp2 list, an Exp1 like 41 + 1 needs to be printed in parentheses.

Parameterized tests:Python:Python:70_WhiteSpaceSeparator

Seems that whitespace in separators is not honored by printer

* expected: 1  2  3  4  5  6  7  8  9

* but got:  1 2 3 4 5 6 7 8 9

Parameterized tests:Python:Python:358_MixFixLists

The generated printer seems to emit too many spaces:

-"A" . "B" . "QED" . "line1\n\tline2" . "\"%s\" is not an int" . semi-exprs: 1 + 1;
+"A"  . "B"  . "QED"  . "line1\n\tline2"  . "\"%s\" is not an int"  . semi-exprs: 1 + 1;

Parameterized tests:Python:Python:479_LabelsCaseSensitive

* expected: 2 ^ (2 ** 2)

* but got:  2 ** (2 ** 2)

Grammar is:

Eexp.  Exp  ::= Exp "^"  Exp1 ;
EExp.  Exp  ::= Exp "**" Exp1 ;

The reason for failure is that the generated method names in ParsingDefs.py lose the case:

grammar = r"""
...
  ?exp: exp "^" exp1 -> r_eexp
  | exp "**" exp1 -> r_eexp
  | exp1
...
"""
...
#transformer
class TreeTransformer(Transformer):
  @v_args(inline=True)
  def r_eexp(self, exp_1, exp_2):
    return Eexp(exp_1, exp_2)

  @v_args(inline=True)
  def r_eexp(self, exp_1, exp_2):
    return EExp(exp_1, exp_2)

Parameterized tests:Python:Python:289_LexerKeywords

Same problem as in 278_Keywords (see below). Clashes with pythons True and False are not avoided.

Parameterized tests:Python:Python:222_IntegerList

Missing python module name sanitization:

from bnfcPyGen222_Integer-List-1.ParsingDefs import *

Parameterized tests:Python:Python:256_Regex

The grammar has:

token SpaceInAlts '\'' ["01x "]* '\'' ;
token Triple upper letter ["]-\'"] '\'';
token Name (char - [ "(){};.@\"" ] - [ " \n\t" ]) + ;

This is translated to:

  SPACEINALTS: /\'(\ |0|1|x)*\'/
  TRIPLE: /[A-Z][A-Za-z](\'|\-|\])\'/
  NAME: /[^\t\n\ \"\(\)\.\;\@\{\}]+/

Start symbol is:

Main ::= SpaceInAlts Name "/\\" Name "and" Triple;

Input is:

' x10 0' [x:4] /\ 'funky+identifier-umlaute:äöüß//\%#FOO and Un]'

Error is:

Unexpected token Token('NAME', "'") at line 1, column 1.
Expected one of: 
    * SPACEINALTS

I can get the test accepted if I give SPACEINALTS and TRIPLE a higher prio than NAME:

  SPACEINALTS.1: /\'(\ |0|1|x)*\'/
  TRIPLE.1: /[A-Z][A-Za-z](\'|\-|\])\'/
  NAME: /[^\t\n\ \"\(\)\.\;\@\{\}]+/

That is weird, because the docs say that the order is https://lark-parser.readthedocs.io/en/stable/grammar.html#notes-for-when-using-a-lexer

Literals are matched according to the following precedence:
1. Highest priority first (priority is specified as: TERM.number: …)
2. Length of match (for regexps, the longest theoretical match is used)
3. Length of literal / pattern definition
4. Name

So if there is just one prio, the length of the match should win. Clearly, SPACEINALTS can match a longer string (' x10 0') than NAME (').

Is this a bug in LARK or its docu?

  • Parameterized tests:Python:Python:278_Keywords

This testcase aims at name clashes with the host language:

bnfcPyGentest/Absyn.py", line 143
    class False:
          ^^^^^
SyntaxError: invalid syntax

You need to make sure you sanitise python keywords and reserved identifiers.

Parameterized tests:Python:Python:235_SymbolsOverlapTokens

Running this test reports:

Unexpected token Token('MYPOSID', 'Foo') at line 1, column 4160.
Expected one of: 
    * IDENT

The grammar has start symbol:

Init.        Main ::= ... Ident MyId MyPosId ...

and the input is:

... Foo_ Bar1 Bar! ...

The generated lexer lexes Foo_ as MYPOSID probably because the regex generated for IDENT is not correct. IDENT is valid Haskell identifiers which includes underscores.

  MYID: /[A-Za-z](\d|[A-Za-z])*/
  MYPOSID: /[A-Za-z](\!|(\d|[A-Za-z]))*/
  IDENT: /[A-Za-z]\w*/

Parameterized tests:Python:Python:266_define

This produces some ill-formed python:

ParsingDefs.py", line 104, in d2_r_esqrt4
    return where(where(exp_))
           ^^^^^
NameError: name 'where' is not defined
AiStudent commented 1 week ago

Thanks for the extensive feedback. Currently there exists at least two issues.

Testsuite results:

It seems that the length of the regex decides the precedence, not the lengths of the matched tokens. In the below test program I try to tokenize "ab" with the literals A or AB:

from lark import Lark
grammar = r"""
    start: A
     | AB

    A:  /a(c)*/
    AB: /ab/
"""
parser = Lark(grammar, start='start', parser='lalr', lexer='basic', priority='auto')
print(list(parser.lex("ab")))

Which yields:

lark.exceptions.UnexpectedCharacters: No terminal matches 'b' in the current parser context, at line 1 col 2

ab
 ^
Expected one of: 
    * A
    * AB

Previous tokens: Token('A', 'a')

In order to make AB have higher precedence, one may: add priority level number, such as AB.1, or make the regex of AB longer than A, like AB: /ab(c)*/ or removing (c)* from A. I have not found a way to make the lexer use the longest matched expression; one alternative, for the one who dares, seems to be to utilize Lark's custom lexer option.

andreasabel commented 1 week ago

@AiStudent wrote:

It seems that the length of the regex decides the precedence, not the lengths of the matched tokens.

But this seems then to be a bug in LARK, at least according to the specification: https://lark-parser.readthedocs.io/en/stable/grammar.html#notes-for-when-using-a-lexer

Literals are matched according to the following precedence:

  1. Highest priority first (priority is specified as: TERM.number: …)
  2. Length of match (for regexps, the longest theoretical match is used)

Or maybe, this is what is meant by

(for regexps, the longest theoretical match is used)

In any case, not prioritising the actual longest match seems awkward and not in line with standard lexing tools. I wonder whether others have bumped into this issue. Is there any discussion of this issue on bug trackers etc.?

If you have really nailed the issue, it might be worthwhile raising it to the developers of LARK.

@AiStudent wrote:

Recursion error, a list with 10k elements breaks the recursive list concatenation.

Interesting, is this because Python has a limit on recursion depth? Anyway, doing lots of concatenations is usually done not directly but via concepts like StringBuilder / StringBuffer, for performance reasons. (In Haskell, one can use difference lists, that's what the Haskell backend does.) Is there a blessed way in Python for assembling strings from many pieces?