antlr / antlr4

ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files.
http://antlr.org
BSD 3-Clause "New" or "Revised" License
17.21k stars 3.29k forks source link

Parsing somehow succeeds on second try even though the input is not valid syntax - Lexer throws exception from nextToken() at EOF #1948

Open hakanai opened 7 years ago

hakanai commented 7 years ago

Grammar:

grammar ReducedQuery;

startExpression : WS? expression WS? EOF ;

expression : booleanExpression | queryFragment ;
booleanExpression : nested += queryFragment (WS nested += queryFragment)+ ;
queryFragment : unquotedQuery | quotedQuery ;
unquotedQuery : UNQUOTED ;
quotedQuery : QUOTED ;

UNQUOTED : UnquotedChar+ ;

fragment UnquotedChar
  : EscapeSequence
  | ~( ' ' | '\r' | '\t' | '\u000C' | '\n' | '\\' | '"' )
  ;

QUOTED : '"' QuotedChar* '"' ;

fragment QuotedChar
  : EscapeSequence
  | ~( '\\' | '"' | '\r' | '\n' )
  ;

fragment EscapeSequence : '\\' . ;

WS : ( ' ' | '\r' | '\t' | '\u000C' | '\n' )+;

Calling code:

public class ReducedQuerySyntaxParser
{
    public ParseTree parse(String query)
    {
        try
        {
            CharStream input = CharStreams.fromString(query);
            ReducedQueryLexer lexer = new ReducedQueryLexer(input);
            lexer.removeErrorListeners();
            lexer.addErrorListener(LoggingErrorListener.get());

            CommonTokenStream tokens = new CommonTokenStream(lexer);

            ReducedQueryParser parser = new ReducedQueryParser(tokens);
            parser.removeErrorListeners();
            parser.addErrorListener(LoggingErrorListener.get());
            parser.setErrorHandler(new BailErrorStrategy());

            // Performance hack as per the ANTLR v4 FAQ
            parser.getInterpreter().setPredictionMode(PredictionMode.SLL);
            ParseTree expression;
            try
            {
                expression = parser.startExpression();
            }
            catch (Exception e)
            {
                // Ignoring the exception, the faster way supposedly "doesn't always work", although many people say
                // that the only time they get here is for actually invalid text.
                parser.reset();
                parser.getInterpreter().setPredictionMode(PredictionMode.LL);
                expression = parser.startExpression();
            }

            return expression;
        }
        catch (ParseCancellationException | RecognitionException e)
        {
            throw new IllegalArgumentException("Invalid query: " + query, e);
        }
    }
}

Test case:

    @Test
    public void testSyntaxErrorAtLexer_FailedToThrow() throws Exception
    {
        ReducedQuerySyntaxParser parser = new ReducedQuerySyntaxParser();

        // Some edge case where they're swallowing the error somehow and omitting the trailing text,
        // which we work around by checking for omitted text.
        try
        {
            parser.parse("A \"B");
            fail("I expected an exception");
        }
        catch (IllegalArgumentException e)
        {
            // Expected
        }
    }

The input string is clearly invalid syntax, so the test expects an error, but it fails, somehow successfully parsing the query.

The investigation so far:

The calling code catches a StringIndexOutOfBoundsException at the catch block:

java.lang.StringIndexOutOfBoundsException: String index out of range: 5
    at java.lang.String.checkBounds(String.java:385)
    at java.lang.String.<init>(String.java:462)
    at org.antlr.v4.runtime.CodePointCharStream$CodePoint8BitCharStream.getText(CodePointCharStream.java:160)
    at org.antlr.v4.runtime.Lexer.notifyListeners(Lexer.java:360)
    at org.antlr.v4.runtime.Lexer.nextToken(Lexer.java:144)
    at org.antlr.v4.runtime.BufferedTokenStream.fetch(BufferedTokenStream.java:169)
    at org.antlr.v4.runtime.BufferedTokenStream.sync(BufferedTokenStream.java:152)
    at org.antlr.v4.runtime.BufferedTokenStream.consume(BufferedTokenStream.java:136)
    at org.antlr.v4.runtime.atn.ParserATNSimulator.execATN(ParserATNSimulator.java:537)
    at org.antlr.v4.runtime.atn.ParserATNSimulator.adaptivePredict(ParserATNSimulator.java:393)
    at ourpackage.ReducedQueryParser.expression(ReducedQueryParser.java:189)
    at ourpackage.ReducedQueryParser.startExpression(ReducedQueryParser.java:131)
    at ourpackage.ReducedQuerySyntaxParser.parse(ReducedQuerySyntaxParser.java:43)

After doing this, the '"B' token is not present in the token stream anymore, and asking for the next token gives the EOF token instead. So the invalid token appears to vanish from the input.

I noticed that Lexer#nextToken is in this stack trace. According to the docs on TokenSource#nextToken(), the method is not supposed to throw an error:

Do not fail/return upon lexing error; keep chewing on the characters until you get a good one; errors are not passed through to the parser.

The wording "keep chewing on the characters until you get a good one" implies that a good character will always be found, though this is not going to be the case for bad input. The docs do not describe what is supposed to happen at EOF, so the correct behaviour is unclear. I figured I would raise the ticket to at least get the docs updated even if we turn out to be using it incorrectly.

srosenthal commented 6 years ago

It's been several months, but I ran into the same error, on a different project/grammar. I was using ANTLR 4.7. I upgraded to ANTLR 4.7.1 and the bug no longer occurs.