antlr / antlr4

ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files.
http://antlr.org
BSD 3-Clause "New" or "Revised" License
17.29k stars 3.3k forks source link

Upgrade to 4.7.2 lead to change in parser behavior. #2650

Open nikhilvaishnavi opened 5 years ago

nikhilvaishnavi commented 5 years ago

While upgrading from ANTLR 4.6 to 4.7.2, we have noticed changed in parser behavior and some text which earlier didn't used to be matched gets matched with no change in grammar.

I am attaching a simplified version of grammar which parses "Filter" object for our application used in our queries for searches. The usual format of this filter could be - key operator value, with or without brackets.

After upgrading the 4.7.2, we noticed quite a few discrepancies - the following negative cases are working and accepted in the grammar, while they all failed in 4.6 with no change in grammar or generated visitor implementation.

  1. key eq "value") - notice only right parenthesis is given here.
  2. key eq "value") anotherValue
  3. title eq "Emplyee" 1234 true
  4. key1 eq "val1"key2 eq "val2"
  5. keyA eq "value" and keyB pr keyC eq "valc"

Note - we used generated visitor implementation and not listener. On debugging, noticed the exception used to come from ANTLR parser but now it just ignores the bad part of the text. So. e.g., key eq "value") gets converted to key eq "value" We use Java for this implementation.

Below is the Filter grammar file. Filter.g4.txt

nikhilvaishnavi commented 5 years ago

Appreciate any leads on this from the community or antlr4 team.

davesisson commented 5 years ago

This rule looks suspect to me:

STRING : DOUBLE_QUOTE ( '\"' | . )+? DOUBLE_QUOTE //"example@com","tony shark", "foo \"1\" bar"

Periods are usually used to match anything in a regex and so that's where it might be picking up the garbage characters. When I want everything except a set of characters I use an inverted character class such as:

[^"]+

sharwell commented 5 years ago

Your start rule does not end with an explicit EOF. Therefore, ANTLR is allowed to stop parsing before the end of the token stream in an attempt to successfully match a subset of the input. There is a good chance that all of your example negative cases involve trailing tokens that can be dropped to make a successful match, but they weren't being caught due to #118.

nikhilvaishnavi commented 5 years ago

@sharwell - thanks for your inputs. though adding start : appfilter EOF ; leads to the all matched values to be null. Any tips to do it correctly.

sharwell commented 5 years ago

@nikhilvaishnavi I'm not sure why that is occurring. Is the affected project open source where I can see the complete example?

nikhilvaishnavi commented 5 years ago

@sharwell I did give an almost complete example which I had shared in the example above which recreates the issue. My project is not open source.

We are converting an input text coming to our REST API in query parameters to a java class - Filter.

if I give a good source data which should match : key eq "value" --> this is leading to the Parser matching to null ParseTree tree = parser.start(); While in case of bad source data key eq "value") "value2" --> leads to error after adding EOF; as suggested by you.

Earlier without EOF, the good input data worked fine but the bad data also didn't produce any error, which is basically negative scenario. We want to catch such bad input and fail.

This worked for 4.6 release but started failing when we tried to move to 4.7+

sharwell commented 5 years ago

.. leading to the Parser matching to null ...

I do not understand what this particular statement means. Can you provide an example to help explain it?

nikhilvaishnavi commented 5 years ago

So, we have written a Deserializer code which converts this text - key op value to a java class : Filter using ANTLR 4.7.2

When we have the below in our ANTLR script : start : appfilter ; the following code (with good filter) : Filter filter = FilterTextDeserializer.readFilter ("keyA eq \"value\""); return a filter with following output (toString) :

FilterImpl{propertyName='keyA', op=eq, propertyVal='value', logicalOp=null, left=null, right=null}

While with a bad filter (notice the right parentheses is missing) :

 `Filter filter = FilterTextDeserializer.readFilter ("(keyA eq \"value\"");`

this fails with expected error : Invalid filter, Syntax error at position 16

But if the left parenthesis is missing, no such error is given. This behavior started in 4.7.2 and was working in 4.6 Filter filter = FilterTextDeserializer.readFilter ("keyA eq \"value\")"); output :

FilterImpl{propertyName='keyA', op=eq, propertyVal='value', logicalOp=null, left=null, right=null}


This is where I changed the script to add EOF as suggested by you - start : appfilter EOF; And now both the bad filters above are captured correctly and failing but the good filter condition returns a null value back - `Filter filter = FilterTextDeserializer.readFilter ("keyA eq \"value\""); return - filter.toString() --> "null"

We are using --no-listener & -visitor as options while generating the antlr code from the script. Let me know if more details are required and I can go over this on zoom if required. Not sure sending the whole code is a feasible option here.

sharwell commented 5 years ago

The next step would be seeing the implementations of:

nikhilvaishnavi commented 5 years ago

@sharwell please the below - Apologize for the delay.

public static Filter readFilter(String filterStr) {
        if (filterStr == null || filterStr.isEmpty()) {
            return null;
        }

        CharStream inputStream = CharStreams.fromString(filterStr);
        FilterLexer lexer = new FilterLexer(inputStream);
        injectErrorHandler(lexer);

        CommonTokenStream tokenStream = new CommonTokenStream(lexer);
        FilterParser parser = new FilterParser(tokenStream);
        injectErrorHandler(parser);

        ParseTree tree = parser.start();
        FilterBaseVisitorImpl filterBaseVisitor = new FilterBaseVisitorImpl();
        return filterBaseVisitor.visit(tree);
}
sharwell commented 5 years ago

I'm fairly certain this is the result of FilterBaseVisitorImpl not providing an implementation of visitStart. The default implementation returns the result of visiting the last child of a rule. Prior to the change, the last child was an AppFilterSyntax, but now it's a token. See visitChildren and aggregateResult for details:

https://github.com/antlr/antlr4/blob/3fcb6da1f6909f1e12279fcc3bc4ac34647ccf28/runtime/Java/src/org/antlr/v4/runtime/tree/AbstractParseTreeVisitor.java#L21-L37

https://github.com/antlr/antlr4/blob/3fcb6da1f6909f1e12279fcc3bc4ac34647ccf28/runtime/Java/src/org/antlr/v4/runtime/tree/AbstractParseTreeVisitor.java#L90-L109

nikhilvaishnavi commented 5 years ago

You are right, we do not have an implementation for visitStart, so you are suggesting to override this ? Can you suggest how different this overridden code should be as we don't have any application specific logic at this point and this should only need to visit all available children nodes.

sharwell commented 5 years ago

It should just return the result of visiting appFilter.

nikhilvaishnavi commented 5 years ago

Will work on this and get back to you if more support is required. Thanks for your time/tips