Easy to reproduce baffling parse failure when using semantic predicates. Bug in parser generator?

antlr / antlr4

ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files.

http://antlr.org

BSD 3-Clause "New" or "Revised" License

17.22k stars 3.29k forks source link

Easy to reproduce baffling parse failure when using semantic predicates. Bug in parser generator? #4660

Open weetmuts opened 4 months ago

weetmuts commented 4 months ago

I have a grammar that parses mathematical formulas. It uses semantic predicates to steer parsing based on a dynamic symbol table. (I.e. is this string a variable or a constant or a set symbol etc...) This has worked very well.

Now I am trying to add infix dynamic operators. It ought to be a piece of cake, since the semantic predicates for the other kind of symbols has worked so well. However, there is a behaviour of the generated parser that looks baffling and is perhaps an error.

The minimized grammar can be seen here: https://github.com/viklauverk/EventBTool/blob/AntlrReproducer/src/main/antlr4/com/viklauverk/eventbtools/core/EvBFormula.g4

./reproduce.sh console -q 'eb.show.formula x+y' q Builds an expression and prints: x+y

./reproduce.sh console -q 'eb.show.formula false & true' q Builds a predicate and prints: ⊥∧⊤

Now test semantic pred that detects "ip" is infix predicate operator, just like &. However this operator "ip" takes expressions as arguments instead of predicates.

./reproduce.sh console -q 'eb.show.formula x ip y' q Builds a predicate and prints: x ip y

Now test semantic pred that detects "ie" an infix expression operator, just like +.

But here is the problem: ./reproduce.sh console -q 'eb.show.formula x ie y' q Should build an expression and print x ie y However it fails.

But very strangely this works: ./reproduce.sh console -q 'eb.show.formula (x ie y)+y' q Builds an expression and prints: (x ie y) + y

This behaviour looks very strange. Why fail with (x ie y) but succeed with (x ie y) + y Is this a bug in the generated parser, or have I made a mistake in the grammer?

Man thanks for your help!

How to build on Linux/Mac: git clone https://github.com/viklauverk/EventBTool.git cd EventBTool/ git checkout AntlrReproducer mvn package ./reproduce.sh

(or just ./configure; make)

Update: I have simplifie the grammar further, now the variable symbols x and y , the operator symbols ip and ie are hardcoded in the grammar. The grammar has no external java package dependencies any more.

Update: By changing the grammar from:

operator_ip: { _input.LT(1).getText().equals("ip") }? SYMBOL;
operator_ie: { _input.LT(1).getText().equals("ie") }? SYMBOL;

to:

operator_ip: { _input.LT(1).getText().equals("ip") }? 'ip';
operator_ie: { _input.LT(1).getText().equals("ie") }? SYMBOL;

then suddenly it succeeds in parsing x ie y

weetmuts commented 4 months ago

Run ./reproduce.sh console -q --trace=formula 'eb.show.formula --tree x ip y' q to get succesful parse:

(formula) parsing x ip y
enter   start, LT(1)=x
enter   predicate, LT(1)=x
enter   expression, LT(1)=x
consume [@0,0:0='x',<12>,1:0] rule expression
exit    expression, LT(1)=ip
enter   operator_ip, LT(1)=ip
consume [@2,2:3='ip',<12>,1:2] rule operator_ip
exit    operator_ip, LT(1)=y
PASSED IP operator
enter   expression, LT(1)=y
consume [@4,5:5='y',<12>,1:5] rule expression
exit    expression, LT(1)=<EOF>
exit    predicate, LT(1)=<EOF>
consume [@5,6:5='<EOF>',<-1>,1:6] rule start
exit    start, LT(1)=<EOF>
<OPERATOR_INFIX_PREDICATE_SYMBOL <VARIABLE_SYMBOL x> ip <VARIABLE_SYMBOL y>>

weetmuts commented 4 months ago

Run ./reproduce.sh console -q --trace=formula 'eb.show.formula --tree x ie y' q to get failed parse:

(formula) parsing x ie y
enter   start, LT(1)=x
enter   predicate, LT(1)=x
enter   expression, LT(1)=x
consume [@0,0:0='x',<12>,1:0] rule expression
exit    expression, LT(1)=ie
enter   expression, LT(1)=ie
enter   operator_ie, LT(1)=ie
consume [@2,2:3='ie',<12>,1:2] rule operator_ie
exit    operator_ie, LT(1)=y
PASSED IE operator
enter   expression, LT(1)=y
consume [@4,5:5='y',<12>,1:5] rule expression
exit    expression, LT(1)=<EOF>
exit    expression, LT(1)=<EOF>
enter   operator_ip, LT(1)=<EOF>
line 1:6 rule operator_ip failed predicate: { _input.LT(1).getText().equals("ip")
               /* symbol_table.isOperatorSymbol(OperatorNotationType.INFIX, OperatorType.PREDICATE, _input.LT(1).getText()) */
             }?
exit    operator_ip, LT(1)=<EOF>
PASSED IP operator
enter   expression, LT(1)=<EOF>
exit    expression, LT(1)=<EOF>
exit    predicate, LT(1)=<EOF>
consume [@5,6:5='<EOF>',<-1>,1:6] rule start
exit    start, LT(1)=<EOF>
Could not parse formula :
    x ie y

weetmuts commented 4 months ago

Run ./reproduce.sh console -q --trace=formula 'eb.show.formula --tree (x ie y) + y' q to get a successful parse of x ie y

(formula) parsing (x ie y) + y
enter   start, LT(1)=(
enter   expression, LT(1)=(
consume [@0,0:0='(',<1>,1:0] rule expression
enter   expression, LT(1)=x
consume [@1,1:1='x',<12>,1:1] rule expression
exit    expression, LT(1)=ie
enter   expression, LT(1)=ie
enter   operator_ie, LT(1)=ie
consume [@3,3:4='ie',<12>,1:3] rule operator_ie
exit    operator_ie, LT(1)=y
PASSED IE operator
enter   expression, LT(1)=y
consume [@5,6:6='y',<12>,1:6] rule expression
exit    expression, LT(1)=)
exit    expression, LT(1)=)
consume [@6,7:7=')',<2>,1:7] rule expression
exit    expression, LT(1)=+
enter   expression, LT(1)=+
consume [@8,9:9='+',<10>,1:9] rule expression
enter   expression, LT(1)=y
consume [@10,11:11='y',<12>,1:11] rule expression
exit    expression, LT(1)=<EOF>
exit    expression, LT(1)=<EOF>
consume [@11,12:11='<EOF>',<-1>,1:12] rule start
exit    start, LT(1)=<EOF>
<ADDITION <PARENTHESISED_EXPRESSION (<OPERATOR_INFIX_EXPRESSION_SYMBOL <VARIABLE_SYMBOL x> ie <VARIABLE_SYMBOL y>>)>+<VARIABLE_SYMBOL y>>

jimidle commented 4 months ago

It is more likely that your grammar is slightly misconfigured. The semantic precedent is kind of an on/off switch, but you cannot usually use them to select an alt based on context because ANTLR will generate a a predict() for the alt, and not a simple token lookahead (most of the time). You could try SLL mode, but I think your grammar is likely incorrect.

Sll mode:

``java

If you are using the predicate for precedence, then remember that in left recursive rules, the higher precedence alt should be at the top of the list:


e: e '*' e
  | e '+' e
  | INT
  | FLOAT
 ;

 - if you are using it to try and lookahead to force an alt selection, this is likely not going to work. You will have to make your grammar unambiguous or pre-parse it. 

 It is also possible, that your language does not lend itself well to generated parsers. Mathematical formulae tend to be like that.

weetmuts commented 4 months ago

I think my use of semantic predicates is correct, I use it to distinguish between syntactically equivalent but semantically different parses. I.e. the formula "expression SYMBOL expression" can become a predicate if the text lexed into a SYMBOL (eg "ip") is marked as a predicate operator, but can become an expression if the lexed SYMBOL ("ie") is marked as an expression operator.

From the antlr3 manual:

One of the most common semantic predicates uses a symbol table to help
distinguish between syntactically, but semantically different productions. In FORTRAN, array references
and function calls look the same, but may be distinguished by checking what the type of the identifier is.
expr : {isVar(LT(1))}? ID LPAREN args RPAREN // array ref
| {isFunction(LT(1))}? ID LPAREN args RPAREN // func call
;

The FORTRAN parsers uses a symbol table to check if ID is a variable or a function. My parser uses a symbol table to check if SYMBOL is a predicate operator, or an expression operator.

jimidle commented 4 months ago

This is ANTLR4. However, I will take a look at your grammar. If the grammar is not standalone though, it may be difficult to diagnose.

In general, don't try to make semantic judgements in the parse. Create a second pass: parser -> parse tree -> AST -> Execute/Codegen.

jimidle commented 4 months ago

After a quick glance, I think your expression rule is specifying incorrect precedence:


expression
   : ETRUE                                     # ExpressionTRUE
   | EFALSE                                    # ExpressionFALSE
   | BOOL                                      # BOOLSet
   | {
       _input.LT(1).getText().equals("x") || _input.LT(1).getText().equals("y")
       /*symbol_table.isVariableSymbol(_input.LT(1).getText()) */
     }?   variable=SYMBOL  # ExpressionVariable
   | left=expression operator=ADD  right=expression  # Addition
   | '(' inner=expression ')'                       # ExpressionParentheses
   | left=expression operator=operator_ie  { System.out.println("PASSED IE operator"); } right=expression # OperatorInfixExpression
    ;

Try:


// The earlier a variant is specified in this rule, the higher the precedence and will resolve first
expression
   :  '(' inner=expression ')'                       # ExpressionParentheses
      | left=expression operator=ADD  right=expression  # Addition

   | left=expression operator=operator_ie  { System.out.println("PASSED IE operator"); } right=expression # OperatorInfixExpression
   | {
       _input.LT(1).getText().equals("x") || _input.LT(1).getText().equals("y")
       /*symbol_table.isVariableSymbol(_input.LT(1).getText()) */
     }?   variable=SYMBOL  # ExpressionVariable
   | ETRUE                                     # ExpressionTRUE
   | EFALSE                                    # ExpressionFALSE
   | BOOL                                      # BOOLSet
;

However, without knowing more about your language, I cannot say for certain. Have you specified it, or is it already extant?

When you must do tricks like this, it means the language is not well specified, or the parser grammar isn't. Or, because you expect the parser to do one thing in a certain semantic case and another in another case, it can become impossible to correctly cover all possibilities. Hence try not to do this in the parser itself. Accept anything that looks like it is valid syntax, then work out the semantics in the next pass.

The other thing is that you are looking to hoist the predicates from the operator_ie rule. I don't think that will work, and anyway, this rule is going to produce a prediction rather than look at the next tokens (look at the generated code). You can keep playing, but I think that this is the wrong approach for a left recursive ANTLR4 expression chain.

You might have better luckk expanding the expression not to be left recursive in a traditional LL hierarchy where the LHS and RHS reference the next rule in the chain. This is what we did in ANTLR3.

jimidle commented 4 months ago

I would also say, that if you specify without the predicates, you might find that ANTLR resolves it for via its own predictions.

Edit to clarify...

Decelare IP and IE as lexer tokens and add them as operators in your expression tree with the correct precedence.

Then create a value rule:

value: VALUE | IP | IE (| any more keyword);

And use that in expression instead of just value.

Make sure your expression is correct because what you have does not deal with DOT and does not seem to have the ability to parse: X Y without an operator as in your example. DOT you can do withL

| expression DOT expression

Also, I would tend to favor banning the use of your keywords as VALUES, for these kinds of reasons.

weetmuts commented 4 months ago

Thank you Jim for taking the time to look at this. I think I have found the reason for the problem based on your answers and on a detailed reading of the manuals.

The important sentence from the antl4 manual is this: "Predicates can appear anywhere within a parser rule just like actions can, but only those appearing on the left edge of alternatives can affect prediction (choosing between alternatives)"

All my previous uses of semantic predicates have placed the predicate on the left edge of >>alternatives<<. However the failing operators are infix operators, ie the semantic predicate cannot be at the left edge of an >>alternative<< since the explicit left recursion is further to the left of the semantic predicate.

The grammar and language is fixed and pretty straightforward, the sempreds are used to detect variable types and for entering into local scopes with variable name/type overrides. If you are interested, the full grammar is here: https://github.com/viklauverk/EventBTool/blob/main/src/main/antlr4/com/viklauverk/eventbtools/core/EvBFormula.g4 (You have the link to the minimized grammar which is standalone,in the previous post.)

The new support I am writing for dynamically added operators that are prefix operators work fine with the semantic predicate detection, since the sempred is at the left edge of alternatives, ie like this: (pseudo code below) | ... | { isPrefixPredicatOp(LT(1)) }? operator=SYMBOL '(' expression ')' | ..

But for dynamically added infix operators we have: | ... | expression { isInfixPredicateOp(LT(1)) }? operator=SYMBOL expression | ...

They look so similar, but alas apparantly work so differently, since the sempred now is not used to discern between alternatives, instead if becomes a value checker of sorts, which clearly can fail if the parse steers into this path the wrong way. This creates very weird parse failures.

Also the antlr4 ALL(*) parser can apparently pickup tokens after the first expression to steer the parse right sometimes, which perhaps is why parsing "(a ie b)" fails, but "(a ie b) + b" succeeds.

Perhaps the good old { }?=> syntax to indicate that you want to use the sempred for parsing decisions was a useful, since you knew if the sempred was choice-making or value-checking. Now you have to know that the position of the sempred influences its purpose.

Fortunately there was a simple fix, move the predicate into the lexer instead, and override the token.

SYMBOL: [a-zA-Z][a-zA-Z0-9_]* { if (isOpIE(getText()) setType(EvBFormulaParser.OP_IE); else if (isOpIP(getText()) setType(EvBFormulaParser.OP_IP);
} ; OP_IE: ; OP_IP: ;

This will override the symbols alread in the lexer for the infix operators, and then the grammar rule becomes: | ... | expression operator=OP_IE expression | ... which the Antlr ALL(*) parser handles perfectly as usual. This is an ok solution since the dynamic operators do not change during the parse. (Contrary to local scope variables that get push/popped during the parse.)

Do you agree with this analysis Jim?