Closed ryanmurf closed 5 years ago
Simplified Query
SELECT * FROM tbl WHERE (ROUND((((((period_diff(date_format(tbl.CD, '%Y%m'), date_format(SUBTIME(CURRENT_TIMESTAMP(), 25200), '%Y%m')) + month(SUBTIME(CURRENT_TIMESTAMP(), 25200))) - MONTH('2012-02-01')) - 1) / 3) - ROUND((((month(SUBTIME(CURRENT_TIMESTAMP(),25200)) - MONTH('2012-02-01')) - 1) / 3)))) = -3)
From the first query in (ms) jsqlparser-0.7.0.jar: Test Query: Iterations 5: mean time: 178.40 SD: 79.32 Var: 6291.04
jsqlparser-0.8.5.jar: Test Query: Iterations 5: mean time: 721.00 SD: 257.08 Var: 66090.80
jsqlparser-0.8.6.jar: Test Query: Iterations 5: mean time: 672.80 SD: 396.60 Var: 157295.36
jsqlparser-0.8.8.jar: Test Query: Iterations 5: mean time: 725.00 SD: 442.55 Var: 195848.00
jsqlparser-0.8.9.jar: Test Query: Iterations 5: mean time: 781.80 SD: 473.62 Var: 224313.76
jsqlparser-0.9.0.jar: Test Query: Iterations 5: mean time: 1598.00 SD: 651.82 Var: 424868.00
jsqlparser-0.9.1.jar: Test Query: Iterations 5: mean time: 2548.20 SD: 771.55 Var: 595285.76
jsqlparser-0.9.2.jar: Test Query: Iterations 5: mean time: 14975.80 SD: 515.50 Var: 265743.76
jsqlparser-0.9.3.jar: Test Query: Iterations 5: mean time: 4359.20 SD: 253.88 Var: 64457.36
jsqlparser-0.9.4.jar: Test Query: Iterations 5: mean time: 5013.60 SD: 370.49 Var: 137263.44
jsqlparser-0.9.5.jar: Test Query: Iterations 5: mean time: 1557.40 SD: 164.84 Var: 27171.44
It was quite suprising for me to see those large time differences between versions of JSqlParser. I will introduce a performance measure tool for JSqlParser to detect those jumps more consistently. I hope this will led us to some hotspots work on.
After digging around using a not released performance measure tool for JSqlParser I found some points to improve.
So I think now I need more SQLs to check.
What would be the best way to get those to you?
You could email me a text file. SQLs separated by ';'. It would be a good thing if you would allow to integrate these SQLs as test cases into JSqlParser or JSqlParser-Perf.
The most impact would be a rewrite of the SelectBody - production. Due to production lookahead - I dont know the exact name used by JavaCCs manual - the complete SQL statement is parsed twice at the minimum. So for normal selects this would give a guaranteed speedup of 50 percent. A next problem comes with the PrimaryExpression rule that is used so much that each minimal speedup would accumulate to some significant percents speedup of the whole parse process.
One could check the new performance test tool repository within JSqlParsers github organisation. This could be used for any JavaCC - grammar.
@wumpz do you still need additional queries for testing? I have a bunch I might be able to send over.
@timehat: To be honest. After not getting any answer to my last comment, which was meant to start a discussion, I considered this issue not as that important anymore.
@wumpz So I just noticed that the OP is actually a coworker of mine. I'm not a seasoned Java dev, so I can't really give any feedback on your comment from May 16th. But I did take another query that I saw taking about 23 seconds to parse and trimmed it down until I think I found (for my particular query) where some of the slowness comes in.
I'll give some stats on that in a second, but my main question is: what is your interest level on reports like this? In our case, we're parsing queries written by our users/customers for a data analytics/visualization tool. Some of what people come up with is surprising. We'd like to do our best effort to fulfill whatever we can (indeed some of these queries end up taking <1 second to execute against the database, but sometimes tens of seconds to parse). Are these too much of an edge case such that you don't want the noise?
Here's the reduction of what I found doing some testing today:
On my machine, this takes 71 seconds to parse:
SELECT CASE WHEN ( CASE WHEN ( CASE WHEN ( CASE WHEN ( 1 ) THEN 0 END ) THEN 0 END ) THEN 0 END ) THEN 0 END FROM a
Removing one layer of CASE statement and it drops to 7 seconds:
SELECT CASE WHEN ( CASE WHEN ( CASE WHEN ( 1 ) THEN 0 END ) THEN 0 END ) THEN 0 END FROM a
One less than that comes in around a 1/3 of a second.
Adding an extra layer on top of the 71 second version comes in at... well, it's still running. (Ha: just finished: 822 seconds). But I think you get the idea. There appears to be exponential growth in the time necessary in parsing this structure of query as the nesting of case statements increases. I'm sure if I added a few more, I'd essentially be locking up the entire thread and could eventually cause a DoS in my application.
Removing the parenthesis from the query that took 822 seconds to parse dropped the parsing time down to 2 seconds.
@timehat Aha, he is a seasoned Java dev. That explains the communication stop.
I started digging around again. Strangly I could not parse your statement or is this not the parsed statement itself.
As I mentioned a big impact would be a rewrite of SelectBody production. I am on it, at least to test its possible gain.
@timehat Test of your first sql is down to 80ms ;). SelectBody alone was not the whole thing.
@timehat Nevertheless the original sql is down to 3secs, but the large examples take still time.
@timehat ryanmurf sent me an 100kb SQL file, that consist of ca. 10 SQLs. This file is parsed taking 170secs. So going from not working to successfully processing with need of time. You could see it https://github.com/JSQLParser/javacc-perf-diag.
@wumpz thank you for looking into this. Where can I find your code change that made the perf improvements? I'm not seeing any commits on either repo. Or are you still finishing that up?
Correct. The code is still local. I need to make some tests as I am not yet sure if the parenthesis parsing results in the same output as before. Nevertheless all tests succeed.
Still there is work to do within the Primary expression production.
Here are statistics of JSqlParsers speed test for 0.9.7 release and this actual 0.9.8-SNAPSHOT:
42500 statements parsed in 3968 milliseconds (10710 statements per second, 0,0000934 seconds per statement ) 38500 select scans for table name executed in 102 milliseconds (377450 select scans for table name per second, 0,0000026 seconds per select scans for table name)
42500 statements parsed in 3032 milliseconds (14017 statements per second, 0,0000713 seconds per statement ) 38500 select scans for table name executed in 59 milliseconds (652542 select scans for table name per second, 0,0000015 seconds per select scans for table name)
at least this seems to be fixed
Noticed that large queries can take > 5 seconds. Below is an example of one that takes about 5 seconds. I have one that will take > 30 second.
SELECT 'CR' AS
^CR
, (1 - (SUM((CASE WHEN (tbl
.AS
= 'Cancelled') THEN (CASE WHEN (ROUND((((((period_diff(date_format(tbl
.CD
, '%Y%m'), date_format(SUBTIME(CURRENT_TIMESTAMP(), 25200), '%Y%m')) + month(SUBTIME(CURRENT_TIMESTAMP(), 25200))) - MONTH('2012-02-01')) - 1) / 3) - ROUND((((month(SUBTIME(CURRENT_TIMESTAMP(), 25200)) - MONTH('2012-02-01')) - 1) / 3)))) = -3) THEN 1 ELSE 0 END) ELSE 0 END)) / SUM((CASE WHEN (tbl
.AS
= 'Active') THEN (CASE WHEN (ROUND((((((period_diff(date_format(tbl
.OCD
, '%Y%m'), date_format(SUBTIME(CURRENT_TIMESTAMP(), 25200), '%Y%m')) + month(SUBTIME(CURRENT_TIMESTAMP(), 25200))) - MONTH('2012-02-01')) - 1) / 3) - ROUND((((month(SUBTIME(CURRENT_TIMESTAMP(), 25200)) - MONTH('2012-02-01')) - 1) / 3)))) <= -3) THEN 1 ELSE 0 END) ELSE 0 END)))) AS^P3Q TRR
, (1 - (SUM((CASE WHEN (tbl
.AS
= 'Cancelled') THEN (CASE WHEN (ROUND((((((period_diff(date_format(tbl
.CD
, '%Y%m'), date_format(SUBTIME(CURRENT_TIMESTAMP(), 25200), '%Y%m')) + month(SUBTIME(CURRENT_TIMESTAMP(), 25200))) - MONTH('2012-02-01')) - 1) / 3) - ROUND((((month(SUBTIME(CURRENT_TIMESTAMP(), 25200)) - MONTH('2012-02-01')) - 1) / 3)))) = -2) THEN 1 ELSE 0 END) ELSE 0 END)) / SUM((CASE WHEN (tbl
.AS
= 'Active') THEN (CASE WHEN (ROUND((((((period_diff(date_format(tbl
.OCD
, '%Y%m'), date_format(SUBTIME(CURRENT_TIMESTAMP(), 25200), '%Y%m')) + month(SUBTIME(CURRENT_TIMESTAMP(), 25200))) - MONTH('2012-02-01')) - 1) / 3) - ROUND((((month(SUBTIME(CURRENT_TIMESTAMP(), 25200)) - MONTH('2012-02-01')) - 1) / 3)))) <= -2) THEN 1 ELSE 0 END) ELSE 0 END)))) AS^P2Q TRR
, (1 - (SUM((CASE WHEN (tbl
.AS
= 'Cancelled') THEN (CASE WHEN (ROUND((((((period_diff(date_format(tbl
.CD
, '%Y%m'), date_format(SUBTIME(CURRENT_TIMESTAMP(), 25200), '%Y%m')) + month(SUBTIME(CURRENT_TIMESTAMP(), 25200))) - MONTH('2012-02-01')) - 1) / 3) - ROUND((((month(SUBTIME(CURRENT_TIMESTAMP(), 25200)) - MONTH('2012-02-01')) - 1) / 3)))) = -1) THEN 1 ELSE 0 END) ELSE 0 END)) / SUM((CASE WHEN (tbl
.AS
= 'Active') THEN (CASE WHEN (ROUND((((((period_diff(date_format(tbl
.OCD
, '%Y%m'), date_format(SUBTIME(CURRENT_TIMESTAMP(), 25200), '%Y%m')) + month(SUBTIME(CURRENT_TIMESTAMP(), 25200))) - MONTH('2012-02-01')) - 1) / 3) - ROUND((((month(SUBTIME(CURRENT_TIMESTAMP(), 25200)) - MONTH('2012-02-01')) - 1) / 3)))) <= -1) THEN 1 ELSE 0 END) ELSE 0 END)))) AS^PQ TRR
, (1 - (SUM((CASE WHEN ((ROUND((((((period_diff(date_format(tbl
.CD
, '%Y%m'), date_format(SUBTIME(CURRENT_TIMESTAMP(), 25200), '%Y%m')) + month(SUBTIME(CURRENT_TIMESTAMP(), 25200))) - MONTH('2012-02-01')) - 1) / 3) - ROUND((((month(SUBTIME(CURRENT_TIMESTAMP(), 25200)) - MONTH('2012-02-01')) - 1) / 3)))) = 0) AND (tbl
.AS
= 'Cancelled')) THEN 1 ELSE 0 END)) / SUM((CASE WHEN (tbl
.AS
= 'Active') THEN 1 ELSE 0 END)))) AS^CQ TRR
FROMtbl
GROUP BY 'CR' LIMIT 25000