antlr / antlr4

ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files.
http://antlr.org
BSD 3-Clause "New" or "Revised" License
17.01k stars 3.26k forks source link

Single-token deletion error node not in parse tree #1298

Closed lionelplessis closed 7 years ago

lionelplessis commented 7 years ago

Hi there,

When single-token deletion kicks in and skips a token in a rule with alternative labels (it entered the rule but didn't find the alternative yet) the error node that is created is not attached to the parse tree. If the extra token is present after a valid rule alternative is found then the error node representing the token is correctly present in the parse tree.

This is due to the ParseRuleContext.CopyFrom method not copying children (at least error nodes). See the comments on the generated code:

private VariableExprContext variableExpr(int _p) throws RecognitionException {
   ParserRuleContext _parentctx = _ctx;
   int _parentState = getState();
   VariableExprContext _localctx = new VariableExprContext(_ctx, _parentState); // Base class of alt rule with labels (correct alt not found yet).
   VariableExprContext _prevctx = _localctx;
   int _startState = 6;
   enterRecursionRule(_localctx, 6, RULE_variableExpr, _p);
   int _la;
   try {
      int _alt;
      enterOuterAlt(_localctx, 1);
      {
      setState(36);
      _errHandler.sync(this); // Single-token deletion skips extra token and adds an error node to the children of _localctx (VariableExprContext).
      switch ( getInterpreter().adaptivePredict(_input,3,_ctx) ) {  // A valid alternative is found.
      case 1:
         {
         _localctx = new LiteralExprContext(_localctx); // Simple fields(properties are copied here but not children (which include the sinlge-token deletion error node).
         _ctx = _localctx;
         _prevctx = _localctx;

         setState(33);
         literal();
         }
         break;

Some more context and a concrete example: We are using Antlr 4.5.3 to build a proper grammar/parser for an existing template engine. Target languages are C# and java but we have been working mainly with C# so far. We have both a lexer and parser files to use Antlr modes. We are writing and testing the grammar using the Antlr plugin for IntelliJ. Note that for this issue the IntelliJ plugin is misleading as it shows the missing error node in the tree which is not the case for the TestRig (-gui) (parse tree vs listener issue??).

In our template engine, dynamic expressions in text must be inside ${ expr } . Expressions mainly support basic operators, a few functions, identifiers and number literals. Our lexer file has a catch-all rule to match any unrecognized character and we skip (hidden channel) whitespace characters.

So expression ${abc} is valid. Expression ${ ? abc ? } with 2 single invalid '?' tokens before and after a valid labeled alternative produces a parse tree without the error node for the first token. See the tree for ${ ? abc ? } : java_2016-09-29_15-46-58

Please note that to reproduce the issue with this simpler version of our grammar I had to generate it with the -xForce-atn option, otherwise it was generating LL1 code which doesn't call errHandler.sync(this) before finding alternative (see the string template for C# extracts below for LL1 and LL). Which leads me to another issue/question: *_why does LL1 generated code not do single-token deletion?**

https://github.com/antlr/antlr4/blob/master/tool/resources/org/antlr/v4/tool/templates/codegen/CSharp/CSharp.stg Line 597

// LL(*) stuff
AltBlock(choice, preamble, alts, error) ::= <<
State = <choice.stateNumber>;
ErrorHandler.Sync(this); // CALLING THE ERROR RECOVERY STRATEGY?
<if(choice.label)><labelref(choice.label)> = TokenStream.Lt(1);<endif>
<preamble; separator="\n">
switch ( Interpreter.AdaptivePredict(TokenStream,<choice.decision>,Context) ) {
<alts:{alt |
case <i>:
<alt>
break;}; separator="\n">
}
>>

Line 538

LL1AltBlock(choice, preamble, alts, error) ::= <<
State = <choice.stateNumber>; // NO CALL TO ErrorHandler.Sync(this);
<if(choice.label)><labelref(choice.label)> = TokenStream.Lt(1);<endif>
<preamble; separator="\n">
switch (TokenStream.La(1)) {
<choice.altLook,alts:{look,alt| <cases(ttypes=look)>
<alt>
break;}; separator="\n">
default:
<error>
}
>>

One more issue/question: if no rule alternative is found (e.g. with ${???} ), the node that will be created on the parse tree will be of type "alternative" base class (here VariableExprContext). But there is no overload in the generated visitor to visit this class. I understand it could be a bit confusing at first but still I think it makes sense to have it (or find a way to make this base class abstract which could solve both the visitor and missing node issues).

marcospassos commented 7 years ago

I can confirm diverging results from the IntelliJ plugin and TestRig, see #1299 and https://github.com/antlr/intellij-plugin-v4/issues/180#issuecomment-246037433.

marcospassos commented 7 years ago

ping @parrt

parrt commented 7 years ago

Hi. I hope to return to ANTLR early Nov after I finish teaching, present a paper at a conference.

lionelplessis commented 7 years ago

Hi, thanks @marcospassos for confirming one problem; thanks @parrt for the info and labelling this ticket, looking forward to Nov then. And good luck with your paper šŸ˜ƒ.

marcospassos commented 7 years ago

I've found more 5 cases where the tree generated by the IntelliJ plugin diverges from the one generated by the TestRig. In all cases, the results generated by the TestRig are much better by far.

parrt commented 7 years ago

yikes! Can you help me narrow this down? I'm now back on ANTLR for least a few weeks. I just integrated the Go target code generation for example. Can you give me a simple grammar and small input? I will chase it to the end of the earth! @sharwell would be interested as well, if there is an issue with the parser interpreter.

marcospassos commented 7 years ago

Done!

https://github.com/antlr/intellij-plugin-v4/issues/180#issuecomment-258651443

parrt commented 7 years ago

I resolved the different error messages for this fix: https://github.com/antlr/antlr4/issues/1337

parrt commented 7 years ago

@lionelplessis can you try the attached 4.6 snapshot jar (unzip to get jar)? It should now give same error message but what about tree construction? Is it ok now for you?

antlr-4.6-complete.jar.zip

lionelplessis commented 7 years ago

hi @parrt, 4.6 doesn't fix the tree. It makes it easier to create a simple grammar though (thanks to the sync() call in LL(1)). So here goes:

grammar TestGrammar;

r : '${' v '}' ;

v : A #altA
  | B #altB
  ;

A : 'a' ;
B : 'b' ;

WHITESPACE : [ \n\t\r]+ -> channel(HIDDEN) ;

ERROR : . ;

And test input: ${ ? a ?}

Gives correct error messages:

line 1:3 extraneous input '?' expecting {'a', 'b'}
line 1:7 extraneous input '?' expecting '}'

But the tree still doesn't have the first ? as error node: (r ${ (v a) ? }).

See my comments inline the Antlr generated code below:

public final VContext v() throws RecognitionException {
        VContext _localctx = new VContext(_ctx, getState());
        enterRule(_localctx, 2, RULE_v);
        try {
            setState(10);
            _errHandler.sync(this); // Single-token deletion skips extra token and adds an error node to the children of _localctx (VContext).
            switch (_input.LA(1)) {
            case A:
                _localctx = new AltAContext(_localctx); // Valid alternative found. Children of initial _localctx are lost.
                enterOuterAlt(_localctx, 1);
                {
                setState(8);
                match(A);
                }
                break;
            case B:
                _localctx = new AltBContext(_localctx);
                enterOuterAlt(_localctx, 2);
                {
                setState(9);
                match(B);
                }
                break;
            default:
                throw new NoViableAltException(this);
            }
        }
marcospassos commented 7 years ago

We can confirm it. We face the same problem today when parsing the expression 1 > > 3.

@parrt, we'd to fix it before releasing 4.6. Could you please guide us in how we may help you fixing it?

parrt commented 7 years ago

i want to fix too :) i'll get to it

marcospassos commented 7 years ago

Please, let me know if we can help in some way :)

parrt commented 7 years ago

@marcospassos you say In all cases, the results generated by the TestRig are much better by far. here but in https://github.com/antlr/antlr4/issues/1299 you say that the TestRig is incorrect.

marcospassos commented 7 years ago

Just a typo, I meant In all cases, the results generated by the interpreter are much better by far

parrt commented 7 years ago

Good to know. I think I've solved this one at least.

marcospassos commented 7 years ago

Could you please provide a preview build for testing?

parrt commented 7 years ago

antlr4-4.6-SNAPSHOT.jar.zip

marcospassos commented 7 years ago

@parrt yes, this error has been fixed! \o/

We've made an extensive test suite to cover most of the error cases that we've found during the development of our proprietary language and now looks like #1299 is last issue still open. So, Antlr 4.6 is gonna be huge!

parrt commented 7 years ago

@marcospassos hooray! I think the #1299 is fixed too ;)

lionelplessis commented 7 years ago

Coming back to this one really late but still wanted to they thanks @parrt and for the 4.6 release as well šŸ˜ƒ. I tested it for C# and it didn't work so I investigated a bit and saw that the fix was not ported to C#. Not sure how the Antlr community keeps the different target languages in sync. For this one I submitted a PR.