Error while parsing literals with backslash

xccui commented 3 years ago

Describe the bug Can't properly parse the statement when

Backslash is the last character of a literal;
The literal is not the last value of a sequence.

To Reproduce Steps to reproduce the behavior:

Example: CCJSqlParserUtil.parseStatements("""INSERT INTO "a"."b"("c", "d", "e") VALUES ('c c\', 'dd', 'ee\')""") (Kotlin)
Exception:

net.sf.jsqlparser.JSQLParserException: Encountered unexpected token: "dd" <S_IDENTIFIER>
    at line 1, column 53.

Was expecting one of:

    "&"
    ")"
    ","
    "::"
    "<<"
    ">>"
    "COLLATE"
    "["
    "^"
    "|"

Note that CCJSqlParserUtil.parseStatements("""INSERT INTO "a"."b"("c", "d", "e") VALUES ('c c', 'dd', 'ee\')""") won't trigger the exception.

Expected behavior

The literals 'c c\' and 'dd'should be properly parsed.

System

Database: Oracle
- Kotlin Version: 1.4.21-release-IJ2020.2-1
- Java Version: 1.8
- JSqlParser version: 4.0

manticore-projects commented 3 years ago

First of all a proper documentation of the challenge in H2:

VALUES ('c c\', 'dd', 'ee\');

returns

C1	C2	C3
c c\	dd	ee\

But JSQLParser complains:

I will look into that, there is also another similar case.

manticore-projects commented 3 years ago

The problem is simple: JSQLParser treats both characters ' and \ as escape character (which should be a good thing):

-- correctly parsed with JSQLParser
select 'c c\\', 'dd', 'ee\\'
from dual

Unfortunately, this is not the interpretation by Oracle or H2:

'CC\'	'DD'	'EE\'
c c\	dd	ee\

So as stated in another issue, we would need to define an Parser Configuration Feature, which allows only one Escape character at a time (preferably ').

manticore-projects commented 3 years ago

@wumpz: I read through a lot of stuff in order to find a solution. My current suggestion was to

1) define only one Quoting Character inside the Grammar (an maybe that might be the \) 2) move the Configuration of Quoting Characters outside of the Grammar into the Abstract Parser or Parser Utils 3) there we handle/replace the Quoting Character BEFORE parsing and after parsing.

And if you ask me, we should do the same for the squared brackets and any other phony stuff, because at least to me this looks terrifying:

{ if ( !configuration.getAsBoolean(Feature.allowSquareBracketQuotation) && matchedToken.image.charAt(0) == '[' ) {
         matchedToken.image = "[";
         for (int i=0;i<CCJSqlParserConstants.tokenImage.length;i++) {
            if (CCJSqlParserConstants.tokenImage[i].equals("\"[\"")) {
                matchedToken.kind = i;
            }
         }
         input_stream.backup(image.length() - 1);
       }
    }

And I also suspect that it might have performance implications.

Please let me know what you think about that kind of PRE/POST processing. I could easily implement that but only when you did not oppose to the idea.

xccui commented 3 years ago

Thanks for your insights! @manticore-projects

To temporarily fix this problem, I replace all \ with \\ before parsing the SQL and replace them back after the parsing process. Do you think this is a feasible workaround?

manticore-projects commented 3 years ago

Yes, this was exactly what I have suggested to the @wumpz. As kind of Pre-Processing and Post-Processing outside of the Parser itself.

On Mon, 2021-05-24 at 21:06 -0700, Xingcan Cui wrote:

Thanks for your insights! @manticore-projects To temporarily fix this problem, I replace all \ with \ before parsing the SQL and replace them back after the parsing process. Do you think this is a feasible workaround? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

wumpz commented 3 years ago

@manticore-projects To be honest I have no idea how you want to externalize this, since that would result in some kind of optional tokens. There was a huge discussion about this at least for square brackets (#677). For this escaping you have to change token definition. So it would a great benefit to do it, but the only way I saw was this backup token thing you mentioned.

manticore-projects commented 3 years ago

What I was refering to was to Modify the SQL String in Java before handing it over to the Parser itself. Something like simple Prep- Parser. I do this for the Comments already.

So we would transform:

select 'a\'

into

select 'a\'

then parse it.

But we would have to wrap a few Methods, such as Statement(), Statements() and the Constructors() and this is the reason why I asked if you are ok with that approach. And also we would need to change all getName() methods of the Objects and ensure returning the correct name (based on the requested Quoting Character).

On Tue, 2021-05-25 at 15:33 -0700, Tobias wrote:

@manticore-projects To be honest I have no idea how you want to externalize this, since that would result in some kind of optional tokens. There was a huge discussion about this at least for square brackets (#677). For this escaping you have to change token definition. So it would a great benefit to do it, but the only way I saw was this backup token thing you mentioned. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

wumpz commented 3 years ago

Ah. Ok. I get it. I was more referring to this bracket thing.

I prefer to build a correct parser. This kind of wrappers feels like a hack.

manticore-projects commented 3 years ago

Ok, clear. So I won't waste my time. Although I do not see a good solution then because onces your token is defined you are stuck with it.

JavaCC21 seems to have implemented Context Sensitive Phony Tokens and I also got our Grammar running on JavaC21. But I still have concerns on the performance and the maturity of JavaCC21. I would like to conclude on the outstanding PRs this week and slize them into smaller pieces making it easier for you to digest.

Maybe you would like to release the 4.1 by June because the performance fixes affect a lot of other applications (like DBeaver, Squirrel SQL).

And maybe then we would like to discuss if JavaCC21 was a strategical option (compiling a Pros/Cons list).

On Wed, 2021-05-26 at 16:31 -0700, Tobias wrote:

Ah. Ok. I get it. I was more referring to this bracket thing. I prefer to build a correct parser. This kind of wrappers feels like a hack. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

wheredevel commented 3 years ago

@wumpz,

I didn't get if there's an issue here or not. Because, I'm able to successfully parse:

SELECT 'c c\', 'dd', 'ee\'

and also this

assertSqlCanBeParsedAndDeparsed(
            "INSERT INTO \"a\".\"b\" (\"c\", \"d\", \"e\") VALUES ('c c\', 'dd', 'ee\')"
        );

If you solved this in the past, could you, please, explain how was it achieved? A reference to the corresponding commit?

wheredevel commented 3 years ago

@wumpz,

I didn't get if there's an issue here or not. Because, I'm able to successfully parse:
SELECT 'c c\', 'dd', 'ee\'
and also this
assertSqlCanBeParsedAndDeparsed(
          "INSERT INTO \"a\".\"b\" (\"c\", \"d\", \"e\") VALUES ('c c\', 'dd', 'ee\')"
      );
If you solved this in the past, could you, please, explain how was it achieved? A reference to the corresponding commit?

Let me elaborate on my own question: TokenManager is supposed to emit a result of the longest matching regex.
So, for example in:

SELECT 'c c\', 'dd', 'ee\'

that would be

'c c\', '

because \' is an escaped ' - so matching shouldn't stop on it.

So, please, explain how did you manage to persuade JavaCC to parse the above corectly?

manticore-projects commented 3 years ago

@wumpz,

I didn't get if there's an issue here or not.

There is the issue, that \ is understood as escape character in JSQLParser (always), but not by every RDBMS. (Similar to Regex and Java Strings, where you will always have to add a \ in the Java String, which won't be present in the original Regex.)

Because, I'm able to successfully parse:
SELECT 'c c\', 'dd', 'ee\'

Which version of JSQLParser are you using? It is possible that you use a Version, which does not define \ as escape character yet.

If you solved this in the past, could you, please, explain how was it achieved? A reference to the corresponding commit?

It is not about solving anything, but about the introduction of a feature (which is escaping as per \). We could simply remove that feature, but then we could not escape properly anymore.

For a really good solution, we would make the escape character configurable (e. g. \ or '). Unfortunately that is not easy and never clean because of the nature of JavaCC and the way, how Token work. Once a Token is defined in the grammar it is final and the Parser will be generated with exactly this Token.

wumpz commented 3 years ago

Could it be as simple as this, that you tried something like this:

assertSqlCanBeParsedAndDeparsed("SELECT 'c c\', 'dd', 'ee\'");

Here we have a single backslash escaping the quotation which results in a quotation. So the parser gets effectively

SELECT 'c c', 'dd', 'ee'

The problematic statement is

assertSqlCanBeParsedAndDeparsed("SELECT 'c c\\', 'dd', 'ee\\'");

where the parser indeed gets a backslashed quotation. And this statement is unfortunately not (yet) parseable. Your understanding of the regular expression is right and it fails like this:

Call:   Statement
  Call: SingleStatement
    Call:   Select
      Call: SelectBody
        Call:   SetOperationList
          Call: PlainSelect
            Consumed token: <<K_SELECT>: "SELECT" at line 1 column 1>
            Call:   getOracleHint
            Return: getOracleHint
            Call:   SelectItemsList
              Call: SelectItem
                Call:   SelectExpressionItem
                  Call: Condition
                    Call:   SQLCondition
                      Call: SimpleExpression
                        Call:   ConcatExpression
                          Call: BitwiseAndOr
                            Call:   AdditiveExpression
                              Call: MultiplicativeExpression
                                Call:   BitwiseXor
                                  Call: PrimaryExpression
                                    Consumed token: <<S_CHAR_LITERAL>: "\'c c\\\', \'" at line 1 column 8>
                                  Return: PrimaryExpression
                                Return: BitwiseXor
                              Return: MultiplicativeExpression
                            Return: AdditiveExpression
                          Return: BitwiseAndOr
                        Return: ConcatExpression
                      Return: SimpleExpression
                    Return: SQLCondition
                  Return: Condition
                  Call: Alias
                    Call:   RelObjectNameWithoutStart
                      Call: RelObjectNameWithoutValue
                        Consumed token: <<S_IDENTIFIER>: "dd" at line 1 column 17>
                      Return: RelObjectNameWithoutValue
                    Return: RelObjectNameWithoutStart
                  Return: Alias
                Return: SelectExpressionItem
              Return: SelectItem
            Return: SelectItemsList
          Return: PlainSelect
        Return: SetOperationList
      Return: SelectBody
    Return: Select
  Return: SingleStatement
Return: Statement

The real solution to this case would be to introduce a fallback. The parser should be preconfigured not to use this \' escape. Then parser (like for bracket quotation) should look into the token (<S_CHAR_LITERAL>: "\'c c\\\', \'" at line 1 column 8>) and if this escaping is not allowed should crop this literal (<S_CHAR_LITERAL>: "\'c c\\\'" at line 1 column 8>).

Unfortunately, there is no solution for both cases at the same time. I used (IMHO) the actual way with the most succeeding cases (without counting).

wheredevel commented 3 years ago

@wumpz ,

Yep, I already figured that. Had to print char-by-char in CharStream to discover single \ is not interpreted correctly.

I want to understand: is it correct what @manticore-projects says:

we would need to define an Parser Configuration Feature, which allows only one Escape character at a time (preferably ').

In this example:

SELECT 'c c\', 'dd', 'ee\'

Is the idea to tell in the config \' is no longer an escape for ' ?

Now, I came from the following case that didn't parse:

SELECT * FROM T WHERE t1 LIKE '' ESCAPE '\' AND (t2 = 'a')

I think the above config alone wouldn't help here either. Once \' is allowed as an escape in tokenization - the above will fail. It would "eat" '\' AND (t2 = ' as a token, and fail on the a'. On the other hand, we could contextually tokenize the char following ESCAPE in the grammar differently. I tried to solve it, but it's different from this code, for example:

{ if ( !configuration.getAsBoolean(Feature.allowSquareBracketQuotation) && matchedToken.image.charAt(0) == '[' ) {
         matchedToken.image = "[";
         for (int i=0;i<CCJSqlParserConstants.tokenImage.length;i++) {
            if (CCJSqlParserConstants.tokenImage[i].equals("\"[\"")) {
                matchedToken.kind = i;
            }
         }
         input_stream.backup(image.length() - 1);
       }
    }

because this code is a part of TokenManager (with all the variables and methods in scope). While I'm trying to do similar thing in the parser context:

        ...
        <K_ESCAPE>
        {
                //...
        }

Can you suggest how to solve it?

manticore-projects commented 3 years ago

The idea was to: 1) Define one Quote Character (default maybe ') 2) to allow setting this Quote Character when calling the parser (as configuration) (so either \ or ' was the Quote Character) 3) to parse accordingly

The challenge is, that we would need to modify the Token based on this Configuration but that is not possible. Inside a production we can use LOOKAHEAD for simple IF .. ELSE .. instructions. But there is no such thing for Token definitions, they are constants and the Parser will be built for these constants.

manticore-projects commented 3 years ago

There are 2 alternatives:

a) 2 different Grammar definitions, one for Quote Character ' and one for \, then generate 2 Parsers and select the applicable one as per Configuration

b) use only one predefined Quote Charater in the Grammer like ' and replace the actually desired Quote Character in the SQL before and after Parsing

The same would apply for Statement Separator (if you do not want to use ;).

wumpz commented 3 years ago

I tested a variant of your problem SQL with Postgresql:

SELECT * FROM T WHERE t1 LIKE 'test\m' ESCAPE '\'

Using this I could confirm, that the database looked for something like 'testm'. The escaping is in place But Postgresql does not parse this

SELECT * FROM T WHERE t1 LIKE 'test\'' ESCAPE '\'

Therefore at least for this database using Escape you are not able to switch escaping of quotes. To be honest I somehow expected that, because as you are describing one has to be context sensitive even backwards context sensitive to solve this which would make the parsing much more complex.

Switching the escape sequence in Postgresql is IMHO only possible using Posix styled string literals like

 SELECT * FROM T WHERE t1 LIKE E'test\''

This one at least could be tackled by a token definition.

wumpz commented 3 years ago

So the question is, is this statement of yours with a quotation correct? For which database?

SELECT * FROM T WHERE t1 LIKE 'test\'' ESCAPE '\' AND (t2 = 'a')

wheredevel commented 3 years ago

So the question is, is this statement of yours with a quotation correct? For which database?
SELECT * FROM T WHERE t1 LIKE 'test\'' ESCAPE '\' AND (t2 = 'a')

SQL Server. Like in this ticket, for example.

wheredevel commented 3 years ago

This is my hack:

    public void CommonTokenAction(Token t) {
        String STR = "\'\\\'";
        if(t.image.length() > 3 && t.image.startsWith(STR)) {
            input_stream.backup(t.image.length() - STR.length());
            t.image = STR;
        }
        t.absoluteBegin = getCurrentTokenAbsolutePosition();
        t.absoluteEnd = t.absoluteBegin + t.image.length();
    }

mostly works... :)

I think it should be further conditioned on a "parsing state" (which is not something provided by JavaCC as oppsed to lexical state).
For example, "we're now parsing ESCAPE char". I understand how this can be added, but, for now, it's good enough.
I'll be happy to have your comments on that.

manticore-projects commented 3 years ago

Pavel,

thank you for your contribution. Although I am just an interested observer, please allow me to ask: 1) What does "mostly works" mean? What are the constraints? 2) What are the use cases where it fails?

Cheers!

wheredevel commented 3 years ago

For example, these now fail:

net.sf.jsqlparser.statement.select.SelectTest.testIssue167_singleQuoteEscape()
net.sf.jsqlparser.statement.select.SelectTest.testIssue167_singleQuoteEscape2()

To complete the solution (and allow the above to parse correctly) I would limit this solution to "parsing state" - in my case the state should say "we're now parsing ESCAPE char". An example solution is to have such a flag in the parser and turn it on during LikeExpression parsing (after ESCAPE and turn off later).

But, even after that, it solves my case, but not this one:

SELECT 'c c\', 'dd', 'ee\'

wheredevel commented 3 years ago

A little better version - to NOT fail the above mentioned tests. (Still no "parsing state" used).

    public void CommonTokenAction(Token t) {
        String STR = "\'\\\'";
        if(t.image.length() > 3 && t.image.startsWith(STR) && Character.isWhitespace(t.image.charAt(3))) {
            input_stream.backup(t.image.length() - STR.length());
            t.image = STR;
        }
        t.absoluteBegin = getCurrentTokenAbsolutePosition();
        t.absoluteEnd = t.absoluteBegin + t.image.length();
    }

wumpz commented 3 years ago

@wheredevel This statement does not run on SqlServer

SELECT * FROM T WHERE t1 LIKE 'test\'' ESCAPE '\' AND (t2 = 'a')

for the same reasons like Postgresql. \' is not valid in SqlServer as well in standard string literals. Using escape you are only replacing the escape character for escapable characters :).

Therefore a complete solution seems to be:

to integrate an option to disable an escape single quote and then ala bracket parsing correct the accepted token. Since the used regular expression for our token accepts more or the correct length.
introduce a new quoted single character for escape, to insert a backslash without thinking it would escape something here

wheredevel commented 3 years ago

Tried this:

SELECT 
  *
FROM 
  sakila.customer 
WHERE 
  first_name NOT LIKE 'ID\\A' ESCAPE '\' AND last_name = 'ANDREWS'

Works on SQL Server.

wumpz commented 3 years ago

@wheredevel That's right, but you are not escaping a single quote. The problem here is to parse escape '\'. That's why I introduced point 2.

wheredevel commented 3 years ago

@wheredevel That's right, but you are not escaping a single quote. The problem here is to parse escape '\'. That's why I introduced point 2.

This is a legit query, it runs and produces results:

SELECT 
  *
FROM 
  sakila.customer 
WHERE 
  first_name LIKE 'IDA' ESCAPE '\' AND last_name = 'ANDREWS'

So, ESCAPE '\' should be parsed here...

wumpz commented 3 years ago

@wheredevel

The problem here is to parse escape '\'. That's why I introduced point 2.

revusky commented 3 years ago

mostly works... :)

I don't know whether you're aware of this (maybe you are aware of it, but your comment shows no awareness of it...) but the CommonTokenAction disposition in legacy JavaCC is quite fundamentally broken. I explain this in gory detail here. The main reason is that it doesn't work in conjunction with LOOKAHEAD.

You see, when you're in a lookahead routine,, the tokens are cached in the token.next and token.next.next etc. fields so, when you're parsing whatever construct, it is quite possible that there was a previous lookahead routine that already tokenized the input a certain way. In such a case, these tokens are cached and your CommonTokenAction will simply never be called. The aforementioned article explains that and how this is resolved in JavaCC21.

That is the main problem with the legacy CommonTokenAction but another problem is that you can only have ONE of them! So if you're munging tokens a lot, you end up with a very big CommonTokenAction and the whole thing tends to be become quite unmaintainable. In JavaCC21, you inject a TOKEN_HOOK method and it generates a unique name for it. So you can have as many of them as you want. Actually, that becomes a very big consideration in JavaCC21, because it has an INCLUDE directive. So if you include a grammarfile that already has a CommonTokenAction method defined, then there is a real first-order problem! Of course, that doesn't happen in legacy JavaCC for the simple reason that it does not have an INCLUDE directive!

In other matters, Pavel, after noticing that you starred the JavaCC21 project on github, and I wrote you an email at the address bp*****e(at)gmail.com. That was on April 18. Did you receive that email?

revusky commented 3 years ago

Hi Andreas. For a variety of reasons, I am a bit reluctant to weigh in on these things, because finally, what I can I say? People just have to figure out things for themselves, I guess... but in this case, finally, I feel like I have to make certain points about this.

But I still have concerns on the performance and the maturity of JavaCC21.

Andreas, based on my interaction with you over the last month, you're not a bad guy, but I have no choice but to point out that the above sentence embodies some outright logical fallacies. For starters, consider the old English homily about not looking a gift horse in the mouth. You know, in German: Einem geschenkten Gaul schaut man nicht ins Maul! Now, it's not my main point here, but I would say in passing that, even just on that level, if I've done all this work of fixing every outstanding issue that there is in the old crufty legacy JavaCC... and I have... then... it's a gift! Take it, use it, be happy... already, there is just that common-sensical level of things in the situation.

BUT again, that's not my main point even.... no, by all means... do look the gift horse in the mouth... sure, be my guest... But I would put it to you: Irrespective of what you find when you look the gift horse in the mouth, can that ever justify a decision to hitch your cart to a horse that is DEAD!!??

You see, the implicit fallacy is just too glaring for me to let it slide with no comment. And there are other problems here. One really does have to understand that legacy JavaCC and JavaCC 21 are not two different projects in the same way that ANTLR and JavaCC are two different projects. There is no technical reason for JavaCC and JavaCC 21 to be two separate projects. This state of affairs came about for entirely non-technical reasons -- specifically the intransigence and overall uselessness of the people sitting on the existing JavaCC project. I'm loath to repeat myself too much which is why I wrote the whole sordid history of the thing here.

So, the bottom line is that JavaCC 21 is simply a more advanced version of the legacy JavaCC project. So, to be talking about its "maturity" is really a bit much. However "mature" or "immature" it is, it is necessarily more mature than a more backward version of the SAME THING! And again, whatever problems you see when you look the gift horse in the mouth, that never is an argument for opting for a horse that is DEAD!

But it also seems like it's based on a conceptual confusion between something being "mature" and just being "old". A group of people, all born in 1996, are all of the same chronological age, but "maturity" is obviously another question. That varies. Drastically. The legacy JavaCC project was created in 1996 and developed actively for just a couple of years after that. The JJTree part is probably from the first half of 1997. And there just has not been a meaningful new feature since then! And I think it is pretty safe to say that no real development has taken place since 1998. Well, maybe 1999 at the latest... Regardless, it was basically abandoned in whatever half-baked state it was in until mid-2003 when Sun decided to open-source it. Unfortunately, the utterly wrong kinds of people took hold of the thing and nothing was ever really done. (No need to take my word for that. Just scan through their [commit history]().) JavaCC is kind of an archetypal nothingburger project.

In short, this is a project that yes, is old, but it is not mature at all! There is a period of a couple of years, maybe 1996-1998 in which it was actively developed and the real truth of the matter is that it never had a legitimate 1.0 release. For them to say that they are entering an 8.x release cycle is IMHO very hard to characterize as anything other than an outright fraud! If it was honestly labeled, the version number, when it was open-sourced in 2003, would have been about 0.3, and in the following 18 years, there has not been enough forward movement to justify taking it to 0.4. Granted version numbers are inherently arbitrary, but I think what I say basically stands...

So, to be expressing doubts about the maturity of a tool that is simply a more advanced version of that thing, that was NEVER mature in any real sense in the first place.... It's really a complete misrepresentation of what is really going on here.

Now, to be completely fair, it is correct to say that some parts of the current JavaCC21 codebase are immature. For example, the new machinery for generating fault-tolerant parsers could be described quite accurately as not very mature. BUT... in the context of a comparison (or implicit comparison) with the legacy JavaCC, this is totally beside the point because this is functionality the legacy tool simply does not have!

And it never will. (You could bet your house that.)

Okay, so there it is. I feel I've said what needs to be said. But, in closing... cutting to the chase, as it were.... just consider the following question.

What is the use case for the legacy JavaCC?

And I mean technically speaking, of course. Why would somebody ever opt to use the legacy JavaCC in the full knowledge that JavaCC21 exists? Just for example, why would you use a version of JavaCC that only supports the Java language up to JDK 7 (10 years old at this point) in the full knowledge that there is a version available that is up to date with JDK 16. Or, as I just explained to Pavel Beckerman, how CommonTokenAction basically doesn't work in legacy JavaCC. (Well, sometimes, maybe... not really....) Why would you ever use the legacy JavaCC in the full knowledge that this is broken AND _in the full knowledge that there is a fixed version?

By the way, it should be needless to say, but when I pose the above question, I do mean based on real, legitimate technical considerations... For example, that there is some set of people who want to pretend that they're doing something and, for some mysterious reason, you don't want to hurt their feelings.... that would not count, in my view as a "legitimate technical consideration". You know what I mean... By the same token, something like "I don't personally know the guy but somebody told me that that Revusky guy is a real asshole"... that would not be a "legitimate technical consideration" either.

So I leave that question open to anybody who happens on this. It's not just for Andreas here. I have to think that just trying to focus on that question, what is the real use-case for the legacy JavaCC, that might really just shed some light on the situation, no?

revusky commented 3 years ago

What does "mostly works" mean? What are the constraints?

What are the use cases where it fails?

My earlier response to Pavel explains the situation, I think. The general problem is that this CommonTokenAction hook doesn't really work properly in legacy JavaCC. I explain that in all gory detail here.

It's quite possible that this feature I just introduced could provide the most elegant solution to this. But at the risk of beating a now dead horse, I feel I should point out that, regardless of how mature this new feature is.... (Frankly, not very. I just implemented it a few days ago...) that's kind of beside the point.

If what it is addressing is something that is fundamentally broken in the legacy tool, like this CommonTokenAction disposition is, then really, the maturity of the functionality in question is kind of a moot point. Surely you can see that, can't you?

I'm sorry to be a bit heavy-handed on this, but this kind of thing does get my back up. Just imagine yourself in my shoes, Andreas. Really... You fix every longstanding bug in the legacy tool and somehow your fixes are not "mature" enough. Well, okay, sure... the bugs are plenty mature, I guess. (Though even that entails using the word "mature" in a rather screwy way!)

manticore-projects commented 3 years ago

On Thu, 2021-06-17 at 12:36 +0000, Jonathan Revusky wrote:

what is the real use-case for the legacy JavaCC

Hi Jonathan,

I can answer only for myself: My use-case for legacy JavaCC is to get the fastest possible generic SQL Parser for parsing SQL while editing. This is the part where performance really hurts and my minimum requirement is, that it parses faster than I can write the SQL code.

If I wanted to write a better parser, I would not use a Parser Generator, but write a specific parser.

If I wanted to (re-)write the existing Grammar in a easier way, I would go for ANTLR and sacrifice a bit of performance.

JavaCC21 eventually might provide me with a good trade-off, but during my last tests the performance was not as good as it is with legacy JavaCC right now and this diminished the benefit of having a cleaner Grammar.

Think of it like a simple score card, where we put much weight on performance and quite some more weight on continuity.

That said, I will re-run the performance tests again as soon as I have a time and would reconsider as soon as it is fast enough for the near time parsing.

Best regards Andreas

revusky commented 3 years ago

Andreas, I really really hope you're not one of these.... I dunno how to put it... I'll say Monty Python argument shop skit kinds of guys. You know what I mean? These people who just love arguing for its own sake and will even wilfully misunderstand your point in order to keep arguing. I say this in part because there's one such guy in my community now and I'm about to ban his ass. I'll refrain from saying his name, I guess... (but his initials are MM. :-))

On Thu, 2021-06-17 at 12:36 +0000, Jonathan Revusky wrote: what is the real use-case for the legacy JavaCC Hi Jonathan, I can answer only for myself: My use-case for legacy JavaCC is to get the fastest possible generic SQL Parser for parsing SQL while editing.

Andreas, what you're talking about doesn't relate to the question that I posed. This is not a "use-case" for legacy JavaCC. You see, modern computing is built on layers, right? The fact that you need an SQL parser and that this particular one, JSQLParser, seems like a good option... that's... that's a use-case for JSQLParser or for an SQL parser generally, but it's not a use-case for legacy JavaCC. Because, for one thing...

Modern computing is built on layers.

Or here's another example. The open source software that I was the main author of that really made a hit is FreeMarker, still a very widely used template engine. I did most of that work about 2002-2004. So it was developed against JDK 1.4 mostly and the FreeMarker parser was created using JavaCC 2.1 or something.... (It doesn't even matter because there is no difference between the versions!)

Your reasoning is like, say, if somebody is in the market for a template engine and they like FreeMarker (it's still widely used out there!) and since FreeMarker was built using JDK 1.4 and JavaCC 2.1, therefore that person has a "use-case" for those things. But no, that's a ridiculous framing of the situation. In that instance, the person has a "use-case" for a template engine and possibly for FreeMarker specifically, but to say that he has a use-case for JavaCC 2.1 is really quite a distortion of the situation. Because.... again, Andreas...

Modern computing is built on layers!

You see, whether consciously or not (and I really hope it's not conscious) you're shifting the conversation in a quite tricky way onto a completely different layer, and thus, making a completely illegitimate argument.

So, yes, the FreeMarker template engine was built against JDK 1.4, and lots of other software was, much of which is still working quite well, but that does not mean, certainly not in this context, that there is currently, in 2021, a use-case for JDK 1.4! Well, you can believe that if you want, but Oracle does not. The oldest Java version which they support is JDK 8, from 2014. If you file a bug report against JDK 7, they're not really interested! Because you're choosing to use something that is obsolete and even a company like Oracle has limited resources. And, by the same token, that FreeMarker was built using JavaCC 2.1 or something is not at all the same as saying that there is currently a use-case for that.

Because there really just isn't.

Yes, there is a lot of perfectly good software out there that was built against libraries and runtimes and so forth, that are pretty clearly obsolete now. And that does not mean that the software is not useful as-is. And it also doesn't mean that those people in the day did anything wrong, because whatever they used was typically the state of the art when they were using it.

This is the part where performance really hurts and my minimum requirement is, that it parses faster than I can write the SQL code.

Well, first of all, parsers generated using either the legacy JavaCC or JavaCC21 simply tend to be blazingly fast. The Java grammar that is included with JavaCC21 can parse (with full tree-building and validation) all the code in the JDK 16 src.zip in about half a minute. That is 14,448 files. The average source file takes slightly over 2 milliseconds to parse. And, if putting it this way is clearer: that is well over 100,000 lines a second. Meanwhile, even the fastest touch typists only type about 10 keystrokes a second. That's very fast, 600 keystrokes a minute, maybe 120 wpm. I actually do type about that fast, but only in short bursts, because, believe it or not, I actually have to stop and think frequently when writing code! But even in these fast bursts, there is about typically 100 ms between keystrokes. This Java parser, on hardware that is nothing special, can parse over 10,000 lines of code in that time.

And there is really no reason that a competently written SQL Parser should be any slower than that. Or not very different. Not as far as I can see. In short, you have to be doing something really god-awful to not be able to keep up with a touch typist.

It shouldn't even be close!

And I already explained what is going on with this case. First of all, there is no very significant difference in performance between legacy JavaCC and JavaCC21. Both of them, unless you're really misusing them, tend to generate blazingly fast parsers. Because algorithmically, it's the same damned thing. It's the same recursive descent algorithm and one is not going to generate code with one that is very different in performance from the other. Maybe a tiny bit, but it's not going to be a very big difference.

Now, the specific case you're talking about is quite perverse really because it relates to a fix to a longstanding bug in JavaCC. I describe that here. JavaCC simply does NOT generate nested syntactic lookaheads. And that is clearly a BUG. You can see that because it does generate nested semantic lookaheads. There is no sensible basis to be generating nested semantic lookahead and NOT generate nested syntactic lookaheads. There is no conceivable argument for it. It's just a BUG!

So, the result is that these unbounded syntactic lookaheads that JSQLParser uses (or abuses) frequently have much more severe performance implications in JavaCC21 than in legacy JavaCC, because JavaCC21 is CORRECTLY generating nested syntactic lookahead.

I've explained this to you and I did give you a stopgap workaround even (that should work) but it's an infuriating situation because the comparison you're making is so inherently unfair. A fair comparison has to be apples-to-apples or oranges-to-oranges. You can't say that legacy JavaCC has some huge advantage in performance based on the fact that if you write something like:

    LOOKAHEAD(Foo()) Foo()

it runs faster because syntactic nested lookahead is ignored! It's like saying that somebody is a faster distance runner than another guy, but it turns out that the guy managed to avoid running the latter half of the race course!

Like, get a grip!

You're insistently framing the question as being that JavaCC21 has all these big performance problems based on this bug having been fixed! (That's really quite insane!) So, in this case, JSqlParser achieves all this extra speed by virtue (or vice maybe???) of a known BUG!

I mean, it's utterly perverse. You're essentially arguing that you have a use-case for legacy JavaCC because these nothingburger artists (how wonderfully prescient of them!) never fixed this bug! (Of course, they never fixed ANY bug! I guess, just to be on the safe side, eh?) But this is wonderful, because it allows what is frankly quite poorly written code to run faster than it would otherwise!

So, the first order problem here is that there is code in their system that relies on the presence of a certain bug in the legacy JavaCC tool! Aside from the fact that it is not very good software engineering (to say the least!) it really is just perverse to now lay that at my doorstep. You have a performance issue and it's my fault. Why? Because I fixed this longstanding bug.

So, I tell you, and I would hope you relayed this to the other people here, that I'm willing to help you get it all straightened out, but really, you do have to stop coming at with me with this nonsense. Because it is starting to get egregious. And it can just get too aggravating and I don't have much of an investment in the situation and could finally just walk away, figuring I have better things to do.

But getting back to the question at hand of whether there is a legitimate use-case for legacy JavaCC, I'm quite sure there isn't. In terms of these unbounded lookaheads on deeply nested constructs, this is something that this project would have to refactor properly anyway. At some point... You can't really be building on top of something that is a known bug and then saying: "Look how fast our software runs".

If I wanted to write a better parser, I would not use a Parser Generator, but write a specific parser.

Well, that's your call, but you're now deviating even further from the question of whether there is really a use-case for the legacy JavaCC -- basically abandoned to rot for the last 20 years -- when there is a new version in which all the various issues are fixed! Just for starters, I am quite satisfied that, in the year 2021, there is not a use-case for a Java parser generator that only handles the Java language up to JDK 7 in the full knowledge that there is a more advanced version of the same tool that works with the current Java language.

And by opting to ignore the real work taking place in the space, not only is one cutting oneself off from the newer features that are already implemented, but also the newer features to come. Already there is a first-pass implementation of fault-tolerant parsing. I know that you know about that. It's not just that the legacy JavaCC cannot generate fault-tolerant parsers.

It never will.

If I wanted to (re-)write the existing Grammar in a easier way, I would go for ANTLR and sacrifice a bit of performance.

Well, that's fine, but it's likely to be more than just a bit! Well, actually, if performance is your sine qua non, I would suggest that you forget about ANTLR!

JavaCC21 eventually might provide me with a good trade-off, but during my last tests the performance was not as good as it is with legacy JavaCC right now

Well, yah-dee-dah, but the essence of the situation is that it's about the same work to switch whether you do it now or you do it later, so it's drop-dead simple that the better course of action is to do the switch now. Also, at a later point, you may not have my collaboration.

So, look, Andreas, you really have to stop this now. I have explained to you the situation now, more than once. There is not a significant performance difference between legacy JavaCC and JavaCC21. Well, granted, you're seeing one, but it's not based on a correct or fair comparison. The extra performance is spurious, because it's based on the presence of a BUG!

and this diminished the benefit of having a cleaner Grammar. Think of it like a simple score card, where we put much weight on performance and quite some more weight on continuity. That said, I will re-run the performance tests

Well, fine. Run as many performance tests as you want, but I'm giving you the straight dope here. I'm explaining the situation. AND I've always been quite willing to help you and to help this community get this all working properly.

This is a very one-sided relationship, you know. I'm giving you plenty and not getting much for myself in return. Well, fine, maybe you don't have to much to offer me right now. All I would ask now is that you please, please, please, try your best to stop talking shit.

Can you please do that? Is that too much to ask? Because if you're characterologically incapable of that, then...

Regards,

Jon

again as soon as I have a time and would reconsider as soon as it is fast enough for the near time parsing. Best regards Andreas

manticore-projects commented 3 years ago

This is a very one-sided relationship, you know. I'm giving you plenty and not getting much for myself in return. Well, fine, maybe you don't have to much to offer me right now. All I would ask now is that you please, please, please, try your best to stop talking shit.

Totally my mistake, but no worries. I won't bother you any further so you can fully focus on your work and your more qualified users.

Let this thread/issue be on parsing Quoted Strings in the different SQL Dialects solely.

revusky commented 3 years ago

This is a very one-sided relationship, you know. I'm giving you plenty and not getting much for myself in return. Well, fine, maybe you don't have to much to offer me right now. All I would ask now is that you please, please, please, try your best to stop talking shit.

Totally my mistake, but no worries. I won't bother you any further so you can fully focus on your work and your more qualified users.

Andreas, you really need to stop this kind of thing. It's really not a good thing, not a good behavior pattern.

Look, you're now insinuating something about me that has no basis in reality. For the record, I never said anything about you or any other user being "qualified". And I never would say such a thing because that does not even correspond to my whole way of thinking. You must be projecting your own inferiority complex.

Regardless, this kind of thing has a sleazy sort of feel to it, kind of makes my skin crawl.

I didn't really want to get into this conversation even. Truth told, I basically wrote off this community long ago. I only entered the conversation because you were making some very bizarre misstatements. For me, the real tipping point in this was when you casually said you had "doubts about JavaCC21's 'maturity'"

Well, as I recall, you said that you had read the full history of this that I wrote. So, surely you understand that JavaCC21 (previously FreeCC when it was initially forked back 2008) is a more advanced version of the tool that these people already happen to be using! And it's the result of systematically fixing every well-known, longstanding bug in the abandoned legacy tool.

To be clear, generally speaking it is not invalid to question the "maturity" of JavaCC21 (or whatever thing.) BUT... when we're in a context in which the only real point of comparison is a far more backward of the very same tool... that's just.... And finally, it's very hard to interpret this generously. Anybody can look at the issues page on JavaCC21 and see that I was very responsive and helpful to you. After that, I don't think you should be here (or anywhere else!) spreading this kind of sleazy FUD. I can't turn a blind eye to this.

Back in my earlier run at open source, the FreeMarker days, I put my trust in some very low-quality, treacherous people, mostly I'm thinking of Daniel Dekany, a real back-stabber. I try to learn from the past and any sign that somebody is a treacherous backstabber, I intend to extirpate the problem very quickly this time round. So I just banned you from my world. But never mind. It's not really personal.

(Actually, the above is formulaic. It is personal.)

Let this thread/issue be on parsing Quoted Strings in the different SQL Dialects solely.

JSQLParser / JSqlParser

Error while parsing literals with backslash #1209