ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files.
public class ReducedQuerySyntaxParser
{
public ParseTree parse(String query)
{
try
{
CharStream input = CharStreams.fromString(query);
ReducedQueryLexer lexer = new ReducedQueryLexer(input);
lexer.removeErrorListeners();
lexer.addErrorListener(LoggingErrorListener.get());
CommonTokenStream tokens = new CommonTokenStream(lexer);
ReducedQueryParser parser = new ReducedQueryParser(tokens);
parser.removeErrorListeners();
parser.addErrorListener(LoggingErrorListener.get());
parser.setErrorHandler(new BailErrorStrategy());
// Performance hack as per the ANTLR v4 FAQ
parser.getInterpreter().setPredictionMode(PredictionMode.SLL);
ParseTree expression;
try
{
expression = parser.startExpression();
}
catch (Exception e)
{
// Ignoring the exception, the faster way supposedly "doesn't always work", although many people say
// that the only time they get here is for actually invalid text.
parser.reset();
parser.getInterpreter().setPredictionMode(PredictionMode.LL);
expression = parser.startExpression();
}
return expression;
}
catch (ParseCancellationException | RecognitionException e)
{
throw new IllegalArgumentException("Invalid query: " + query, e);
}
}
}
Test case:
@Test
public void testSyntaxErrorAtLexer_FailedToThrow() throws Exception
{
ReducedQuerySyntaxParser parser = new ReducedQuerySyntaxParser();
// Some edge case where they're swallowing the error somehow and omitting the trailing text,
// which we work around by checking for omitted text.
try
{
parser.parse("A \"B");
fail("I expected an exception");
}
catch (IllegalArgumentException e)
{
// Expected
}
}
The input string is clearly invalid syntax, so the test expects an error, but it fails, somehow successfully parsing the query.
The investigation so far:
The calling code catches a StringIndexOutOfBoundsException at the catch block:
java.lang.StringIndexOutOfBoundsException: String index out of range: 5
at java.lang.String.checkBounds(String.java:385)
at java.lang.String.<init>(String.java:462)
at org.antlr.v4.runtime.CodePointCharStream$CodePoint8BitCharStream.getText(CodePointCharStream.java:160)
at org.antlr.v4.runtime.Lexer.notifyListeners(Lexer.java:360)
at org.antlr.v4.runtime.Lexer.nextToken(Lexer.java:144)
at org.antlr.v4.runtime.BufferedTokenStream.fetch(BufferedTokenStream.java:169)
at org.antlr.v4.runtime.BufferedTokenStream.sync(BufferedTokenStream.java:152)
at org.antlr.v4.runtime.BufferedTokenStream.consume(BufferedTokenStream.java:136)
at org.antlr.v4.runtime.atn.ParserATNSimulator.execATN(ParserATNSimulator.java:537)
at org.antlr.v4.runtime.atn.ParserATNSimulator.adaptivePredict(ParserATNSimulator.java:393)
at ourpackage.ReducedQueryParser.expression(ReducedQueryParser.java:189)
at ourpackage.ReducedQueryParser.startExpression(ReducedQueryParser.java:131)
at ourpackage.ReducedQuerySyntaxParser.parse(ReducedQuerySyntaxParser.java:43)
After doing this, the '"B' token is not present in the token stream anymore, and asking for the next token gives the EOF token instead. So the invalid token appears to vanish from the input.
I noticed that Lexer#nextToken is in this stack trace. According to the docs on TokenSource#nextToken(), the method is not supposed to throw an error:
Do not fail/return upon lexing error; keep chewing on the characters until you get a good one;
errors are not passed through to the parser.
The wording "keep chewing on the characters until you get a good one" implies that a good character will always be found, though this is not going to be the case for bad input. The docs do not describe what is supposed to happen at EOF, so the correct behaviour is unclear. I figured I would raise the ticket to at least get the docs updated even if we turn out to be using it incorrectly.
It's been several months, but I ran into the same error, on a different project/grammar. I was using ANTLR 4.7. I upgraded to ANTLR 4.7.1 and the bug no longer occurs.
Grammar:
Calling code:
Test case:
The input string is clearly invalid syntax, so the test expects an error, but it fails, somehow successfully parsing the query.
The investigation so far:
The calling code catches a
StringIndexOutOfBoundsException
at the catch block:After doing this, the '"B' token is not present in the token stream anymore, and asking for the next token gives the EOF token instead. So the invalid token appears to vanish from the input.
I noticed that
Lexer#nextToken
is in this stack trace. According to the docs onTokenSource#nextToken()
, the method is not supposed to throw an error:The wording "keep chewing on the characters until you get a good one" implies that a good character will always be found, though this is not going to be the case for bad input. The docs do not describe what is supposed to happen at EOF, so the correct behaviour is unclear. I figured I would raise the ticket to at least get the docs updated even if we turn out to be using it incorrectly.