activeviam / par-student-spark-atoti

Project with students from CentraleSupelec to explore Spark API in order to power atoti with Spark

1 stars 0 forks source link

Benchmarks to compare Spark SQL API and Spark raw SQL #11

Open arnaudframmery opened 2 years ago

arnaudframmery commented 2 years ago

The benchmarks are done with JMH with 3 warmups iterations and then 10 iterations with the mode single shot time

The dataset used is this one : https://www.kaggle.com/sobhanmoosavi/us-accidents It's about 570 MB and 1.5 million lines.

The hardware is for now a laptop with a Ryzen 4500H and 20 Go of RAM

List query with limit and offset (limit = 100 000, offset = 0)

Method	Time
SQL API	3469,058 ± 24,771 ms
raw SQL	302,646 ± 9,981 ms

List query with limit and offset (limit = 100 000, offset = 100 000)

Method	Time
SQL API	2992,163 ± 19,237 ms
raw SQL	521,539 ± 15,745 ms

List query with limit and offset (limit = -1, offset = 100 000)

Method	Time
SQL API	4103,849 ± 123,673 ms
raw SQL	4084,750 ± 72,480 ms

List query with conditions

Method	Time
SQL API	9688,635 ± 64,259 ms
raw SQL	4376,306 ± 65,468 ms

Aggregation query + show

Method	Time
SQL API	9527,319 ± 82,806 ms
raw SQL	3556,624 ± 36,587 ms

arnaudframmery commented 2 years ago

List query Comparison in function of limit value

We can notice a strange behaviour when limit equal -1 or not. This can be explained by the fact that in function of limit value, the SQL API used will not be the same in our code (we use between() if limit != -1 and geq() otherwise)

Here is the Flamegraph of the SQL API call with limit = 100 000 sql_api_limit_100000_offset_100000

Here is the Flamegraph of the SQL API call with limit = -1 sql_api_limit_-1_offset_100000

=> We can see that the function calls are not the same between these two graphs

Nevertheless, the Flamegraph of the raw SQL keeps the same whatever the limit value raw_sql_limit_100000_offset_100000

OPeyrusse commented 2 years ago

You could also try to access Spark UI to have access to Spark query plans. It can help understanding the plan, what are the differences. I found this doc about Spark UI : https://spark.apache.org/docs/3.0.0-preview/web-ui.html but I haven't found a way to access it from your tests yet

arnaudframmery commented 2 years ago

With the databricks cluster

List query with limit and offset (limit = 100 000, offset = 0)

Method	Time
SQL API	833,819 ± 151,347 ms
raw SQL	956,818 ± 203,762 ms

List query with limit and offset (limit = 100 000, offset = 100 000)

Method	Time
SQL API	833,815 ± 165,537 ms
raw SQL	1759,694 ± 924,455 ms

List query with limit and offset (limit = -1, offset = 100 000)

Method	Time
SQL API	2745,061 ± 1392,481 ms
raw SQL	5848,527 ± 5638,261 ms

Take into consideration that the errors are quite huge ! A good idea would be to take the time provided by Spark UI for the requests execution only

arnaudframmery commented 2 years ago

When we are looking at the time given by Spark UI (in the Databricks cluster)

List query with limit and offset (limit = 100 000, offset = 0)

Method	Time
SQL API	0.45 s
raw SQL	0.31 s

List query with limit and offset (limit = 100 000, offset = 100 000)

Method	Time
SQL API	0.47 s
raw SQL	0.22 s

List query with limit and offset (limit = -1, offset = 100 000)

Method	Time
SQL API	0.5 s
raw SQL	0.5 s

arnaudframmery commented 2 years ago

Here are the query plans generated by Spark

SQL API (same for all sets of param)

Databricks Shell - Details for Query 25

raw SQL (limit = 100 000, offset = 0)

Databricks Shell - Details for Query 105 (1)

SQL API (limit = 100 000, offset = 100 000)

Databricks Shell - Details for Query 131

SQL API (limit = -1, offset = 100 000)

Databricks Shell - Details for Query 153

arnaudframmery commented 2 years ago

The new values for the implementation of conditions and aggregations with Scala operators

Running on the Databricks cluster (1 node or 3 nodes) and figures are given by the Spark UI.

1 node = 4 cores & 14 GB
3 warmups and 10 iterations

The query is like the following : SELECT Severity, COUNT(*) AS severity_count FROM us_accidents WHERE Crossing = "true" GROUP BY Severity

Dataset with 1.5 millions lines (47 columns)

Method	1 node	3 nodes	Speed up
SQL API	0.57 s	0.51 s	1.12
raw SQL	0.56 s	0.48 s	1.17

Dataset with 15 millions lines (47 columns)

Method	1 node	3 nodes	Speed up
SQL API	5 s	3 s	1.67
raw SQL	5 s	2.7 s	1.85

Dataset with 150 millions lines (47 columns)

Method	1 node	3 nodes	Speed up
SQL API	13.9 s	5 s	2.78
raw SQL	14 s	4.8 s	2.92