activeviam / par-student-spark-atoti

Project with students from CentraleSupelec to explore Spark API in order to power atoti with Spark
1 stars 0 forks source link

Benchmarks to compare Spark SQL API and Spark raw SQL #11

Open arnaudframmery opened 2 years ago

arnaudframmery commented 2 years ago

The benchmarks are done with JMH with 3 warmups iterations and then 10 iterations with the mode single shot time

The dataset used is this one : https://www.kaggle.com/sobhanmoosavi/us-accidents It's about 570 MB and 1.5 million lines.

The hardware is for now a laptop with a Ryzen 4500H and 20 Go of RAM

List query with limit and offset (limit = 100 000, offset = 0)

Method Time
SQL API 3469,058 ± 24,771 ms
raw SQL 302,646 ± 9,981 ms

List query with limit and offset (limit = 100 000, offset = 100 000)

Method Time
SQL API 2992,163 ± 19,237 ms
raw SQL 521,539 ± 15,745 ms

List query with limit and offset (limit = -1, offset = 100 000)

Method Time
SQL API 4103,849 ± 123,673 ms
raw SQL 4084,750 ± 72,480 ms

List query with conditions

Method Time
SQL API 9688,635 ± 64,259 ms
raw SQL 4376,306 ± 65,468 ms

Aggregation query + show

Method Time
SQL API 9527,319 ± 82,806 ms
raw SQL 3556,624 ± 36,587 ms
arnaudframmery commented 2 years ago

List query Comparison in function of limit value

We can notice a strange behaviour when limit equal -1 or not. This can be explained by the fact that in function of limit value, the SQL API used will not be the same in our code (we use between() if limit != -1 and geq() otherwise)

Here is the Flamegraph of the SQL API call with limit = 100 000 sql_api_limit_100000_offset_100000

Here is the Flamegraph of the SQL API call with limit = -1 sql_api_limit_-1_offset_100000

=> We can see that the function calls are not the same between these two graphs

Nevertheless, the Flamegraph of the raw SQL keeps the same whatever the limit value raw_sql_limit_100000_offset_100000

OPeyrusse commented 2 years ago

You could also try to access Spark UI to have access to Spark query plans. It can help understanding the plan, what are the differences. I found this doc about Spark UI : https://spark.apache.org/docs/3.0.0-preview/web-ui.html but I haven't found a way to access it from your tests yet

arnaudframmery commented 2 years ago

With the databricks cluster

List query with limit and offset (limit = 100 000, offset = 0)

Method Time
SQL API 833,819 ± 151,347 ms
raw SQL 956,818 ± 203,762 ms

List query with limit and offset (limit = 100 000, offset = 100 000)

Method Time
SQL API 833,815 ± 165,537 ms
raw SQL 1759,694 ± 924,455 ms

List query with limit and offset (limit = -1, offset = 100 000)

Method Time
SQL API 2745,061 ± 1392,481 ms
raw SQL 5848,527 ± 5638,261 ms

Take into consideration that the errors are quite huge ! A good idea would be to take the time provided by Spark UI for the requests execution only

arnaudframmery commented 2 years ago

When we are looking at the time given by Spark UI (in the Databricks cluster)

List query with limit and offset (limit = 100 000, offset = 0)

Method Time
SQL API 0.45 s
raw SQL 0.31 s

List query with limit and offset (limit = 100 000, offset = 100 000)

Method Time
SQL API 0.47 s
raw SQL 0.22 s

List query with limit and offset (limit = -1, offset = 100 000)

Method Time
SQL API 0.5 s
raw SQL 0.5 s
arnaudframmery commented 2 years ago

Here are the query plans generated by Spark

SQL API (same for all sets of param)

Databricks Shell - Details for Query 25

raw SQL (limit = 100 000, offset = 0)

Databricks Shell - Details for Query 105 (1)

SQL API (limit = 100 000, offset = 100 000)

Databricks Shell - Details for Query 131

SQL API (limit = -1, offset = 100 000)

Databricks Shell - Details for Query 153

arnaudframmery commented 2 years ago

The new values for the implementation of conditions and aggregations with Scala operators

Running on the Databricks cluster (1 node or 3 nodes) and figures are given by the Spark UI.

The query is like the following : SELECT Severity, COUNT(*) AS severity_count FROM us_accidents WHERE Crossing = "true" GROUP BY Severity

Dataset with 1.5 millions lines (47 columns)

Method 1 node 3 nodes Speed up
SQL API 0.57 s 0.51 s 1.12
raw SQL 0.56 s 0.48 s 1.17

Dataset with 15 millions lines (47 columns)

Method 1 node 3 nodes Speed up
SQL API 5 s 3 s 1.67
raw SQL 5 s 2.7 s 1.85

Dataset with 150 millions lines (47 columns)

Method 1 node 3 nodes Speed up
SQL API 13.9 s 5 s 2.78
raw SQL 14 s 4.8 s 2.92