JSQLParser / JSqlParser

JSqlParser parses an SQL statement and translate it into a hierarchy of Java classes. The generated hierarchy can be navigated using the Visitor Pattern
https://github.com/JSQLParser/JSqlParser/wiki
Apache License 2.0
5.39k stars 1.34k forks source link

Performance - large sql #235

Closed ryanmurf closed 5 years ago

ryanmurf commented 8 years ago

Noticed that large queries can take > 5 seconds. Below is an example of one that takes about 5 seconds. I have one that will take > 30 second.

SELECT 'CR' AS ^CR, (1 - (SUM((CASE WHEN (tbl.AS = 'Cancelled') THEN (CASE WHEN (ROUND((((((period_diff(date_format(tbl.CD, '%Y%m'), date_format(SUBTIME(CURRENT_TIMESTAMP(), 25200), '%Y%m')) + month(SUBTIME(CURRENT_TIMESTAMP(), 25200))) - MONTH('2012-02-01')) - 1) / 3) - ROUND((((month(SUBTIME(CURRENT_TIMESTAMP(), 25200)) - MONTH('2012-02-01')) - 1) / 3)))) = -3) THEN 1 ELSE 0 END) ELSE 0 END)) / SUM((CASE WHEN (tbl.AS = 'Active') THEN (CASE WHEN (ROUND((((((period_diff(date_format(tbl.OCD, '%Y%m'), date_format(SUBTIME(CURRENT_TIMESTAMP(), 25200), '%Y%m')) + month(SUBTIME(CURRENT_TIMESTAMP(), 25200))) - MONTH('2012-02-01')) - 1) / 3) - ROUND((((month(SUBTIME(CURRENT_TIMESTAMP(), 25200)) - MONTH('2012-02-01')) - 1) / 3)))) <= -3) THEN 1 ELSE 0 END) ELSE 0 END)))) AS ^P3Q TRR, (1 - (SUM((CASE WHEN (tbl.AS = 'Cancelled') THEN (CASE WHEN (ROUND((((((period_diff(date_format(tbl.CD, '%Y%m'), date_format(SUBTIME(CURRENT_TIMESTAMP(), 25200), '%Y%m')) + month(SUBTIME(CURRENT_TIMESTAMP(), 25200))) - MONTH('2012-02-01')) - 1) / 3) - ROUND((((month(SUBTIME(CURRENT_TIMESTAMP(), 25200)) - MONTH('2012-02-01')) - 1) / 3)))) = -2) THEN 1 ELSE 0 END) ELSE 0 END)) / SUM((CASE WHEN (tbl.AS = 'Active') THEN (CASE WHEN (ROUND((((((period_diff(date_format(tbl.OCD, '%Y%m'), date_format(SUBTIME(CURRENT_TIMESTAMP(), 25200), '%Y%m')) + month(SUBTIME(CURRENT_TIMESTAMP(), 25200))) - MONTH('2012-02-01')) - 1) / 3) - ROUND((((month(SUBTIME(CURRENT_TIMESTAMP(), 25200)) - MONTH('2012-02-01')) - 1) / 3)))) <= -2) THEN 1 ELSE 0 END) ELSE 0 END)))) AS ^P2Q TRR, (1 - (SUM((CASE WHEN (tbl.AS = 'Cancelled') THEN (CASE WHEN (ROUND((((((period_diff(date_format(tbl.CD, '%Y%m'), date_format(SUBTIME(CURRENT_TIMESTAMP(), 25200), '%Y%m')) + month(SUBTIME(CURRENT_TIMESTAMP(), 25200))) - MONTH('2012-02-01')) - 1) / 3) - ROUND((((month(SUBTIME(CURRENT_TIMESTAMP(), 25200)) - MONTH('2012-02-01')) - 1) / 3)))) = -1) THEN 1 ELSE 0 END) ELSE 0 END)) / SUM((CASE WHEN (tbl.AS = 'Active') THEN (CASE WHEN (ROUND((((((period_diff(date_format(tbl.OCD, '%Y%m'), date_format(SUBTIME(CURRENT_TIMESTAMP(), 25200), '%Y%m')) + month(SUBTIME(CURRENT_TIMESTAMP(), 25200))) - MONTH('2012-02-01')) - 1) / 3) - ROUND((((month(SUBTIME(CURRENT_TIMESTAMP(), 25200)) - MONTH('2012-02-01')) - 1) / 3)))) <= -1) THEN 1 ELSE 0 END) ELSE 0 END)))) AS ^PQ TRR, (1 - (SUM((CASE WHEN ((ROUND((((((period_diff(date_format(tbl.CD, '%Y%m'), date_format(SUBTIME(CURRENT_TIMESTAMP(), 25200), '%Y%m')) + month(SUBTIME(CURRENT_TIMESTAMP(), 25200))) - MONTH('2012-02-01')) - 1) / 3) - ROUND((((month(SUBTIME(CURRENT_TIMESTAMP(), 25200)) - MONTH('2012-02-01')) - 1) / 3)))) = 0) AND (tbl.AS = 'Cancelled')) THEN 1 ELSE 0 END)) / SUM((CASE WHEN (tbl.AS = 'Active') THEN 1 ELSE 0 END)))) AS ^CQ TRR FROM tbl GROUP BY 'CR' LIMIT 25000

ryanmurf commented 8 years ago

Simplified Query

SELECT * FROM tbl WHERE (ROUND((((((period_diff(date_format(tbl.CD, '%Y%m'), date_format(SUBTIME(CURRENT_TIMESTAMP(), 25200), '%Y%m')) + month(SUBTIME(CURRENT_TIMESTAMP(), 25200))) - MONTH('2012-02-01')) - 1) / 3) - ROUND((((month(SUBTIME(CURRENT_TIMESTAMP(),25200)) - MONTH('2012-02-01')) - 1) / 3)))) = -3)

ryanmurf commented 8 years ago

From the first query in (ms) jsqlparser-0.7.0.jar: Test Query: Iterations 5: mean time: 178.40 SD: 79.32 Var: 6291.04

jsqlparser-0.8.5.jar: Test Query: Iterations 5: mean time: 721.00 SD: 257.08 Var: 66090.80

jsqlparser-0.8.6.jar: Test Query: Iterations 5: mean time: 672.80 SD: 396.60 Var: 157295.36

jsqlparser-0.8.8.jar: Test Query: Iterations 5: mean time: 725.00 SD: 442.55 Var: 195848.00

jsqlparser-0.8.9.jar: Test Query: Iterations 5: mean time: 781.80 SD: 473.62 Var: 224313.76

jsqlparser-0.9.0.jar: Test Query: Iterations 5: mean time: 1598.00 SD: 651.82 Var: 424868.00

jsqlparser-0.9.1.jar: Test Query: Iterations 5: mean time: 2548.20 SD: 771.55 Var: 595285.76

jsqlparser-0.9.2.jar: Test Query: Iterations 5: mean time: 14975.80 SD: 515.50 Var: 265743.76

jsqlparser-0.9.3.jar: Test Query: Iterations 5: mean time: 4359.20 SD: 253.88 Var: 64457.36

jsqlparser-0.9.4.jar: Test Query: Iterations 5: mean time: 5013.60 SD: 370.49 Var: 137263.44

jsqlparser-0.9.5.jar: Test Query: Iterations 5: mean time: 1557.40 SD: 164.84 Var: 27171.44

wumpz commented 8 years ago

It was quite suprising for me to see those large time differences between versions of JSqlParser. I will introduce a performance measure tool for JSqlParser to detect those jumps more consistently. I hope this will led us to some hotspots work on.

wumpz commented 8 years ago

After digging around using a not released performance measure tool for JSqlParser I found some points to improve.

So I think now I need more SQLs to check.

ryanmurf commented 8 years ago

What would be the best way to get those to you?

wumpz commented 8 years ago

You could email me a text file. SQLs separated by ';'. It would be a good thing if you would allow to integrate these SQLs as test cases into JSqlParser or JSqlParser-Perf.

wumpz commented 8 years ago

The most impact would be a rewrite of the SelectBody - production. Due to production lookahead - I dont know the exact name used by JavaCCs manual - the complete SQL statement is parsed twice at the minimum. So for normal selects this would give a guaranteed speedup of 50 percent. A next problem comes with the PrimaryExpression rule that is used so much that each minimal speedup would accumulate to some significant percents speedup of the whole parse process.

One could check the new performance test tool repository within JSqlParsers github organisation. This could be used for any JavaCC - grammar.

timehat commented 7 years ago

@wumpz do you still need additional queries for testing? I have a bunch I might be able to send over.

wumpz commented 7 years ago

@timehat: To be honest. After not getting any answer to my last comment, which was meant to start a discussion, I considered this issue not as that important anymore.

timehat commented 7 years ago

@wumpz So I just noticed that the OP is actually a coworker of mine. I'm not a seasoned Java dev, so I can't really give any feedback on your comment from May 16th. But I did take another query that I saw taking about 23 seconds to parse and trimmed it down until I think I found (for my particular query) where some of the slowness comes in.

I'll give some stats on that in a second, but my main question is: what is your interest level on reports like this? In our case, we're parsing queries written by our users/customers for a data analytics/visualization tool. Some of what people come up with is surprising. We'd like to do our best effort to fulfill whatever we can (indeed some of these queries end up taking <1 second to execute against the database, but sometimes tens of seconds to parse). Are these too much of an edge case such that you don't want the noise?

Here's the reduction of what I found doing some testing today:

On my machine, this takes 71 seconds to parse: SELECT CASE WHEN ( CASE WHEN ( CASE WHEN ( CASE WHEN ( 1 ) THEN 0 END ) THEN 0 END ) THEN 0 END ) THEN 0 END FROM a

Removing one layer of CASE statement and it drops to 7 seconds: SELECT CASE WHEN ( CASE WHEN ( CASE WHEN ( 1 ) THEN 0 END ) THEN 0 END ) THEN 0 END FROM a

One less than that comes in around a 1/3 of a second.

Adding an extra layer on top of the 71 second version comes in at... well, it's still running. (Ha: just finished: 822 seconds). But I think you get the idea. There appears to be exponential growth in the time necessary in parsing this structure of query as the nesting of case statements increases. I'm sure if I added a few more, I'd essentially be locking up the entire thread and could eventually cause a DoS in my application.

Removing the parenthesis from the query that took 822 seconds to parse dropped the parsing time down to 2 seconds.

wumpz commented 7 years ago

@timehat Aha, he is a seasoned Java dev. That explains the communication stop.

I started digging around again. Strangly I could not parse your statement or is this not the parsed statement itself.

As I mentioned a big impact would be a rewrite of SelectBody production. I am on it, at least to test its possible gain.

wumpz commented 7 years ago

@timehat Test of your first sql is down to 80ms ;). SelectBody alone was not the whole thing.

wumpz commented 7 years ago

@timehat Nevertheless the original sql is down to 3secs, but the large examples take still time.

wumpz commented 7 years ago

@timehat ryanmurf sent me an 100kb SQL file, that consist of ca. 10 SQLs. This file is parsed taking 170secs. So going from not working to successfully processing with need of time. You could see it https://github.com/JSQLParser/javacc-perf-diag.

timehat commented 7 years ago

@wumpz thank you for looking into this. Where can I find your code change that made the perf improvements? I'm not seeing any commits on either repo. Or are you still finishing that up?

wumpz commented 7 years ago

Correct. The code is still local. I need to make some tests as I am not yet sure if the parenthesis parsing results in the same output as before. Nevertheless all tests succeed.

wumpz commented 7 years ago

Still there is work to do within the Primary expression production.

wumpz commented 7 years ago

Here are statistics of JSqlParsers speed test for 0.9.7 release and this actual 0.9.8-SNAPSHOT:

0.9.7

42500 statements parsed in 3968 milliseconds (10710 statements per second, 0,0000934 seconds per statement ) 38500 select scans for table name executed in 102 milliseconds (377450 select scans for table name per second, 0,0000026 seconds per select scans for table name)

0.9.8-SNAPSHOT

42500 statements parsed in 3032 milliseconds (14017 statements per second, 0,0000713 seconds per statement ) 38500 select scans for table name executed in 59 milliseconds (652542 select scans for table name per second, 0,0000015 seconds per select scans for table name)

wumpz commented 5 years ago

at least this seems to be fixed