apache / directory-scimple

Apache Directory - SCIMple
https://directory.apache.org/scimple/
Apache License 2.0
66 stars 37 forks source link

Special Characters Not Handled Correctly in Filters #564

Open joshuapsteele opened 3 months ago

joshuapsteele commented 3 months ago

It appears that special characters are not handled correctly in Filters. For example, changing the name to "Bílbo Bággins" by adding accents here causes the test to fail:

https://github.com/apache/directory-scimple/blob/develop/scim-spec/scim-spec-schema/src/test/java/org/apache/directory/scim/spec/filter/FilterBuilderTest.java#L33-L40

line 1:18 token recognition error at: '"Bí'
line 1:21 extraneous input 'lbo' expecting CompValue

org.apache.directory.scim.spec.filter.FilterParseException: Failed to parse filter: name.givenName EQ "Bílbo" AND name.familyName EQ "Bággins"

    at org.apache.directory.scim.spec.filter.Filter.parseFilter(Filter.java:92)
    at org.apache.directory.scim.spec.filter.Filter.setFilter(Filter.java:70)
    at org.apache.directory.scim.spec.filter.Filter.<init>(Filter.java:55)
    at org.apache.directory.scim.spec.filter.FilterBuilderTest.testSimpleAnd(FilterBuilderTest.java:39)
    at java.base/java.lang.reflect.Method.invoke(Method.java:568)
    at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)
    at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)
Caused by: java.lang.IllegalStateException: failed to parse at line 1:21 due to extraneous input 'lbo' expecting CompValue
    at org.apache.directory.scim.spec.filter.Filter$1.syntaxError(Filter.java:81)
    at org.antlr.v4.runtime.ProxyErrorListener.syntaxError(ProxyErrorListener.java:41)
    at org.antlr.v4.runtime.Parser.notifyErrorListeners(Parser.java:544)
    at org.antlr.v4.runtime.DefaultErrorStrategy.reportUnwantedToken(DefaultErrorStrategy.java:377)
    at org.antlr.v4.runtime.DefaultErrorStrategy.singleTokenDeletion(DefaultErrorStrategy.java:548)
    at org.antlr.v4.runtime.DefaultErrorStrategy.recoverInline(DefaultErrorStrategy.java:467)
    at org.antlr.v4.runtime.Parser.match(Parser.java:208)
    at org.apache.directory.scim.spec.filter.FilterParser.filterExpression(FilterParser.java:459)
    at org.apache.directory.scim.spec.filter.FilterParser.filter(FilterParser.java:128)
    at org.apache.directory.scim.spec.filter.Filter.parseFilter(Filter.java:86)
    ... 6 more
bdemers commented 3 months ago

Hi @joshuapsteele!

I only had a minute to poke around at this, but my guess is it's related to the ANTLR grammar, the def for STRING looks like it's only parsing ASCII, and would need to be improved.

It looks like there is a better/improved grammar here: https://github.com/antlr/grammars-v4/blob/master/json/JSON.g4 Which contains:

STRING
    : '"' (ESC | SAFECODEPOINT)* '"'
    ;

fragment ESC
    : '\\' (["\\/bfnrt] | UNICODE)
    ;

fragment UNICODE
    : 'u' HEX HEX HEX HEX
    ;

fragment HEX
    : [0-9a-fA-F]
    ;

fragment SAFECODEPOINT
    : ~ ["\\\u0000-\u001F]
    ;

I did a quick replacement in src/main/antlr4/imports/Json.g4 and it seemed to fix the parsing error, but test failed still because of an equality check of a filter with B\u00EDlbo and Bílbo. (my guess is test needs to be tweaked a bit 🤷, but I don't have time to dig into it more today)

Anyway, great find, hopefully the above info helps!