JSQLParser / JSqlParser

JSqlParser parses an SQL statement and translate it into a hierarchy of Java classes. The generated hierarchy can be navigated using the Visitor Pattern
https://github.com/JSQLParser/JSqlParser/wiki
Apache License 2.0
5.34k stars 1.34k forks source link

4.1 Parser fails on Nested Comments /* foo /*bar*/ */ #1175

Open manticore-projects opened 3 years ago

manticore-projects commented 3 years ago

JSQLParser 4.1:

-- Nested Comments failure
select /*this is a /*nested!!!*/ comment*/ 1
from dual;

at net.sf.jsqlparser.parser.CCJSqlParserUtil.parseStatements(CCJSqlParserUtil.java:206) at net.sf.jsqlparser.parser.CCJSqlParserUtil.parseStatements(CCJSqlParserUtil.java:194)

at net.sf.jsqlparser.parser.CCJSqlParser.generateParseException(CCJSqlParser.java:28225) at net.sf.jsqlparser.parser.CCJSqlParser.jj_consume_token(CCJSqlParser.java:28064) at net.sf.jsqlparser.parser.CCJSqlParser.Statements(CCJSqlParser.java:623) at net.sf.jsqlparser.parser.CCJSqlParserUtil.parseStatements(CCJSqlParserUtil.java:204)

This is ugly, but valid SQL for H2 and Oracle so maybe we will want to support nested comments eventually.

I see 2 options: 1) fix the parser itself 2) or stripping any comments from the SQL before handing it over to the parser (per Regular Expressions)

Please let me know which way you'd prefer and I would have a look.

xsgao-github commented 3 years ago

This is not nested comment. Block comment ends as soon as */ appears. This applies to not only sql, but as well as other languages. Not sure if you can escape special characters in comment.

manticore-projects commented 3 years ago

This is a Nested Comment by definition. If we consider it legal and want to be able to parse it is a different question. But definitely this was not the hill I would want to die on.

xsgao-github commented 3 years ago

There is no such thing called nested comment.

manticore-projects commented 3 years ago

Thanks buddy for encouraging me to provide a PR for. Until now I felt like it was not worth it.

xsgao-github commented 3 years ago

Sorry for my ignorant, actually SQL/DB2/Postgres support nested comments, Oracle and MySql do not.

wumpz commented 3 years ago

To be honest I was surprised, that this works.

So the change would be in the grammar line 404:

< MULTI_LINE_COMMENT: "/*" (~["*"])* "*" ("*" | (~["*","/"] (~["*"])* "*"))* "/">

to make this hierarchical.

I recognized that at least in postgresql this

select /*this is a /*nested!!! comment*/ 1

returns an error. (I removed one */ closing comment) So the processing is indeed strictly hierarchical.

Now there are databases out there, that would parse this successfully. We have to decide, if we should change the default behaviour or make it optional like for instance bracket parsing.

Changing the default behaviour IMHO would be a drawback since there are so many instances of JSqlParser running out there, that I expect a lot of unexpected crashing after an update.

manticore-projects commented 3 years ago

While I am no big fan of nested comments, I actually hit myself with it many times, when I insert block comments into large statements in order to try something out. When this "deactivated parts" contain comments already, the parser and so the formatter would fail.

In my opinion, it is less relevant if the RDBMS can handle it, because JSQLParser will drop the comments in any way. But getting the relevant content parsed would be just nice!

Alternatively, I can find such nested comments in JSQLFormatters pre-parser and clean it up, e.g. turning it into /* -- -- */

In my opinion, we should expect that nested comments are closed properly and symmetric: /* /* */ * is ok, but /* /* */ is not ok (although I have no problem when both versions "survive" the parser.

manticore-projects commented 3 years ago

Just tested H2: select /* comment 1 /* comment 2 */ */ 1 is legal, but unclosed nested comment select /* comment 1 /* comment 2 */ 1 is not.

manticore-projects commented 3 years ago

Just tested in Oracle: select /* comment 1 /* comment 2 */ */ 1 from dual is NOT legal, but unclosed nested comment select /* comment 1 /* comment 2 */ 1 from dual is legal. fck Oracle.

wumpz commented 3 years ago

Sorry for my ignorant, actually SQL/DB2/Postgres support nested comments, Oracle and MySql do not.

Oracle he already mentioned.

wumpz commented 3 years ago

Both versions could not live at the same time.

/*    /*      */    ...  '*/'

The parser has to guess how to proceed and Murphy kicks in.

revusky commented 3 years ago

Both versions could not live at the same time.

You might consider this: https://javacc.com/2021/02/01/javacc-21-has-a-preprocessor/

    #if support_nested_comments
                   ....
    #else
                  ...
    #endif
manticore-projects commented 3 years ago

Thank you Jonathan,

myself, I came to the same conclusion, that a kind of Preprocessor was needed in order to solve: 1) different statement separators (at the moment, only ; is supported) 2) different quoting characters 3) the nested comments mentioned above and preserving comments for the deparser 4) squared brackets interpretation (although @wumpz implemented a beautiful work-around for that)

In my limited understanding, all these challenges have one common denominator: they happen on Token Definition Level, where Java CC does not provide any support for configuration or flexibility. (@revusky: feature request, I want configurable/dynamically generated tokens please!)

I know and accept that @wumpz is not to keen to solve these challenges outside of the JSQLParser Grammar definition. But thinking about that for many days, I see only the following solutions:

a) Create a specific Grammar/Parser for each Configurable Feature (and if I understand that correctly, the mentioned JavaCC21 Preparser would simplify that because we would need only 1 Grammar, but still ended up with many generated Parsers, one for each use case)

b) Generate the Parser dynamically and on the fly before parsing. So instead of delivering pre-generated JSQLParser.class inside the JAR, we would have to include the JavaCC.jar and deliver the Grammar definition and the generate the JSQLParser on the fly and compile it with Java before we parse).

Interestingly, there is such a software for ANTLR: https://github.com/julianthome/inmemantlr and I use this successfully to parse Groovy/Java Scripts in our ETL Software Module.

c) Use additional Java Code to pre-parse the SQL (setting/bending it according to the defined Tokens) before we send it to JSQLParser, then parse it and optionally post-process the deparsed SQL again

For JSQLFormatter I go the last route c) because I prefer the flexibility. I have written a kind of wrapper around JSQLParser which mangles all SQL content first (based on REGEX) before sending it into the parser. This way I was able to solve the 4 challenges mentioned on top.

revusky commented 3 years ago

In my limited understanding, all these challenges have one common denominator: they happen on Token Definition Level, where Java CC does not provide any support for configuration or flexibility. (@revusky: feature request, I want configurable/dynamically generated tokens please!)

I still have to write a blog post on this, but I recently (VERY recently, like in the last few hours) implemented the ability to activate/deactivate tokens. I decided to be very aggressive and already use it internally. To see what I'm talking about, see:

https://github.com/javacc21/javacc21/blob/master/examples/java/Java.javacc#L267 https://github.com/javacc21/javacc21/blob/master/examples/java/Java.javacc#L411 https://github.com/javacc21/javacc21/blob/master/examples/java/Java.javacc#L547

This uses the new ACTIVATE_TOKENS/DEACTIVATE_TOKENS instruction in JavaCC21. The above three points are the newer and better solution to the problem described here that requires a 5-part kludge in legacy JavaCC.

Maybe you would find this feature useful.

manticore-projects commented 3 years ago

Thank you very much @revusky.

ShiftExpression :

           AdditiveExpression

           ACTIVATE_TOKENS(RSIGNEDSHIFT,RUNSIGNEDSHIFT)

           (

              ("&lt;&lt;" | "&gt;&gt;" | "&gt;&gt;&gt;")

              AdditiveExpression

           )*

        ;

It looks very useful, if we can ACTIVATE/DEACTIVATE based on a condition. Will a semantic LOOKAHEAD like:

LOOKAHEAD({myCondition==true}) ( ACTIVATE_TOKENS(...) )

be possible? And would it apply to the generated Parser dynamically (what I hope for) or would it generate two Parsers (one switched on and one switched off).

@Wumpz: to me this looks like the possibility, to define different Tokens for: 1) StatementSeparators ; or GO 2) different column quote characters " or [ and 3) different String quote characters ' or \

and to switch it ON/OFF based on the Configurable Features per semantic lookahead.

revusky commented 3 years ago

It looks very useful, if we can ACTIVATE/DEACTIVATE based on a condition. It looks very useful, if we can ACTIVATE/DEACTIVATE based on a condition. Well, yeah, it works. You can write:

[
      {SCAN {condition}}#
      => ACTIVATE_TOKENS(....)
]

The # above means that this also applies within a lookahead. The problem is that you typically want this to work both in regular parsing and in a lookahead. The condition will be ignored in a lookahead if you don't have the # after it.

Well, the other thing to note is that you can also do this in Java code. The ACTIVATE_TOKENS/DEACTIVE_TOKENS instructions are really just a bit of syntactic sugar around the methods activateTokens and deactivateTokens. So, in a code action you can also write: { .... if (condition) activateTokens(SOME_DELIMITER, ANOTHER_DELIMITER); ... }#

Again, there is this sort of caveat about whether a code action or condition also applies in a lookahead and the # says that it does. There is a set of tricky aspects to this that really need to be documented better somewhere. If you use the ACTIVATE_TOKENS(...) instruction it automatically works both inside a lookahead and regular parsing so iy is usually a bit more foolproof, at least if you just use it uncomditionally.

In general, one could think of:

   DEACTIVATE_TOKENS(....)

as just an abbreviated way of writing:

     {deactivateTokens(...);}#
revusky commented 3 years ago

See: https://javacc.com/2021/06/13/activating-de-activating-tokens/