Open arnaudframmery opened 2 years ago
We can notice a strange behaviour when limit equal -1 or not. This can be explained by the fact that in function of limit value, the SQL API used will not be the same in our code (we use between() if limit != -1 and geq() otherwise)
Here is the Flamegraph of the SQL API call with limit = 100 000
Here is the Flamegraph of the SQL API call with limit = -1
=> We can see that the function calls are not the same between these two graphs
Nevertheless, the Flamegraph of the raw SQL keeps the same whatever the limit value
You could also try to access Spark UI to have access to Spark query plans. It can help understanding the plan, what are the differences. I found this doc about Spark UI : https://spark.apache.org/docs/3.0.0-preview/web-ui.html but I haven't found a way to access it from your tests yet
With the databricks cluster
Method | Time |
---|---|
SQL API | 833,819 ± 151,347 ms |
raw SQL | 956,818 ± 203,762 ms |
Method | Time |
---|---|
SQL API | 833,815 ± 165,537 ms |
raw SQL | 1759,694 ± 924,455 ms |
Method | Time |
---|---|
SQL API | 2745,061 ± 1392,481 ms |
raw SQL | 5848,527 ± 5638,261 ms |
Take into consideration that the errors are quite huge ! A good idea would be to take the time provided by Spark UI for the requests execution only
When we are looking at the time given by Spark UI (in the Databricks cluster)
Method | Time |
---|---|
SQL API | 0.45 s |
raw SQL | 0.31 s |
Method | Time |
---|---|
SQL API | 0.47 s |
raw SQL | 0.22 s |
Method | Time |
---|---|
SQL API | 0.5 s |
raw SQL | 0.5 s |
Here are the query plans generated by Spark
Running on the Databricks cluster (1 node or 3 nodes) and figures are given by the Spark UI.
The query is like the following :
SELECT Severity, COUNT(*) AS severity_count FROM us_accidents WHERE Crossing = "true" GROUP BY Severity
Method | 1 node | 3 nodes | Speed up |
---|---|---|---|
SQL API | 0.57 s | 0.51 s | 1.12 |
raw SQL | 0.56 s | 0.48 s | 1.17 |
Method | 1 node | 3 nodes | Speed up |
---|---|---|---|
SQL API | 5 s | 3 s | 1.67 |
raw SQL | 5 s | 2.7 s | 1.85 |
Method | 1 node | 3 nodes | Speed up |
---|---|---|---|
SQL API | 13.9 s | 5 s | 2.78 |
raw SQL | 14 s | 4.8 s | 2.92 |
The benchmarks are done with JMH with 3 warmups iterations and then 10 iterations with the mode single shot time
The dataset used is this one : https://www.kaggle.com/sobhanmoosavi/us-accidents It's about 570 MB and 1.5 million lines.
The hardware is for now a laptop with a Ryzen 4500H and 20 Go of RAM
List query with limit and offset (limit = 100 000, offset = 0)
List query with limit and offset (limit = 100 000, offset = 100 000)
List query with limit and offset (limit = -1, offset = 100 000)
List query with conditions
Aggregation query + show